Abstract 摘要
Genomic research that targets large-scale, prospective birth cohorts constitutes an essential strategy for understanding the influence of genetics and environment on human health1. Nonetheless, such studies remain scarce, particularly in Asia. Here we present the phase I genome study of the Born in Guangzhou Cohort Study2 (BIGCS), which encompasses the sequencing and analysis of 4,053 Chinese individuals, primarily composed of trios or mother–infant duos residing in South China. Our analysis reveals novel genetic variants, a high-quality reference panel, and fine-scale local genetic structure within BIGCS. Notably, we identify previously unreported East Asian-specific genetic associations with maternal total bile acid, gestational weight gain and infant cord blood traits. Additionally, we observe prevalent age-specific genetic effects on lipid levels in mothers and infants. In an exploratory intergenerational Mendelian randomization analysis, we estimate the maternal putatively causal and fetal genetic effects of seven adult phenotypes on seven fetal growth-related measurements. These findings illuminate the genetic links between maternal and early-life traits in an East Asian population and lay the groundwork for future research into the intricate interplay of genetics, intrauterine exposures and early-life experiences in shaping long-term health.
针对大规模、预期出生队列的基因组研究是了解遗传和环境对人类健康影响的重要策略1。然而,这类研究仍然很少,特别是在亚洲。在此,我们介绍了广州出生队列研究2(BIGCS)的I期基因组研究,该研究包括对4,053名中国人的测序和分析,主要由居住在华南的三人组或母婴二人组组成。我们的分析揭示了新的遗传变异,高质量的参考面板,以及BIGCS内的精细局部遗传结构。值得注意的是,我们确定了以前未报告的东亚特定的遗传协会与产妇总胆汁酸,妊娠期体重增加和婴儿脐带血性状。此外,我们观察到普遍的年龄特异性遗传效应对母亲和婴儿的脂质水平。 在探索性的代际孟德尔随机分析中,我们估计了7种成人表型对7种胎儿生长相关指标的母体pupil因果关系和胎儿遗传效应。这些发现阐明了东亚人群中母亲和早期生活特征之间的遗传联系,并为未来研究遗传学,宫内暴露和早期生活经历在塑造长期健康方面的复杂相互作用奠定了基础。
Similar content being viewed by others
其他人正在检视类似内容
Main 主要
Human genetics seeks to decipher the intricate relationship between DNA sequence variants and biomedical traits, particularly those relevant to the initiation and progression of human diseases3. Recent advancements in sequencing technology and analytical methodologies, and the establishment of large biobanks and genomics consortia have propelled this pursuit4,5,6. However, the predominant focus on unrelated adult individuals in genomic studies has constrained our comprehension of the genetic underpinnings of traits that manifest early in life, and how in utero and early-life exposures interact with genetics in shaping trait variability and disease susceptibility later in life7,8. Furthermore, the underrepresentation of non-European populations in genomic research has hampered the generalizability of findings to global populations9.
人类遗传学试图破译DNA序列变异和生物医学特征之间的复杂关系,特别是那些与人类疾病的发生和发展相关的生物医学特征。测序技术和分析方法的最新进展,以及大型生物库和基因组学联盟的建立推动了这一追求4,5,6。然而,在基因组研究中,对无关成年个体的主要关注限制了我们对生命早期表现出的性状的遗传基础的理解,以及子宫内和早期生活中的暴露如何与遗传学相互作用,从而在以后的生活中塑造性状变异性和疾病易感性7,8。此外,非欧洲人群在基因组研究中的代表性不足,阻碍了研究结果对全球人群的普遍性。
To address these gaps, prospective birth cohorts that recruit and sequence families in trios or parent–child duos provide an effective and systematic approach. So far, some European birth cohorts10,11,12,13 and consortia14,15 have made some progress in elucidating the genetic architecture of maternal and infant traits and the potential causal effects of early-life exposures on health outcomes. However, additional sequencing efforts are needed to explore low-frequency and population-specific variants, as well as broader phenotypic spectra that are not covered in these studies. In Asia, several birth cohort studies have been initiated16,17,18,19. These studies have primarily used traditional epidemiological approaches with limited genetic data, limiting opportunities to explore potentially causal relationships between multifactorial risk factors and health outcomes. Given the lifestyle differences and unique population characteristics between Asian and European populations, comprehensive population-specific genome studies in birth cohorts are imperative.
为了解决这些差距,前瞻性的出生队列招募和序列的家庭在三人组或亲子二人组提供了一个有效的和系统的方法。到目前为止,一些欧洲出生队列10、11、12、13和财团14、15在阐明孕产妇和婴儿特征的遗传结构以及生命早期暴露对健康结果的潜在因果影响方面取得了一些进展。然而,需要额外的测序工作来探索低频和人群特异性变异,以及这些研究中未涵盖的更广泛的表型谱。在亚洲,已经启动了几项出生队列研究16、17、18、19。 这些研究主要使用传统的流行病学方法,遗传数据有限,限制了探索多因素风险因素与健康结果之间潜在因果关系的机会。鉴于亚洲和欧洲人群之间的生活方式差异和独特的人群特征,在出生队列中进行全面的人群特异性基因组研究势在必行。
Launched in South China, the Born in Guangzhou Cohort Study (BIGCS) represents one of the largest prospective birth cohorts in Asia2. By 2021, BIGCS had recruited and deeply phenotyped more than 50,000 trio or duo families from Guangzhou city, a southern Chinese megacity with a Cantonese heritage. BIGCS aims to track diverse physical and biochemical measurements in families from gestation to the age of 18 to understand the influences of early-life traits on developmental health (Supplementary Fig. 1). In this Article, we present findings from phase I of the genome study, which involved whole-genome sequencing (WGS) analysis of 4,053 healthy participants in trios or duos (332 trios, 1,406 duos and 245 unrelated individuals) using a low-coverage WGS design20, with an average coverage of approximately 6.63x. We report a high-quality variation dataset and haplotype reference panel that enable more accurate genotype imputation for individuals of Chinese ancestries. We unveil a fine-scale genetic structure among BIGCS participants, characterized by linguistic affiliations and an ancient northern–southern admixture. In genome-wide association studies (GWAS) of 18 adult and infant quantitative traits, we identify East Asian-specific genetic associations with total bile acid (TBA) and gestational weight gain (GWG), suggesting genetic associations with previously understudied infant cord blood traits, and prevalent age-specific genetic effects on lipid levels. Finally, we conducted an intergenerational Mendelian randomization analysis to explore and provide estimates of putative maternal causal non-inherited intrauterine effects on seven fetal growth measurements for seven maternal phenotypes. The BIGCS website is available at http://bigcs.com.cn/.
广州出生队列研究(BIGCS)在华南地区启动,是亚洲最大的前瞻性出生队列研究之一2。截至2021年,BIGCS已招募并深入分析了来自广州市的50,000多个三人或二人家庭,广州是中国南方的大城市,具有广东传统。BIGCS旨在跟踪从妊娠到18岁的家庭中的各种物理和生化测量,以了解早期生活特征对发育健康的影响(补充图1)。在这篇文章中,我们介绍了基因组研究的第一阶段的发现,其中涉及使用低覆盖率WGS设计20对4,053名健康参与者进行全基因组测序(WGS)分析三人组或二人组(332名三人组,1,406名二人组和245名无关个体),平均覆盖率约为6.63x。 我们报告了一个高质量的变异数据集和单倍型参考面板,使更准确的基因型插补的中国血统的个人。我们揭示了一个精细的规模的遗传结构BIGCS参与者,其特点是语言的联系和古老的南北混合。在18个成人和婴儿数量性状的全基因组关联研究(GWAS)中,我们确定了东亚特定的遗传与总胆汁酸(TBA)和妊娠期体重增加(GWG)的关联,表明与以前研究不足的婴儿脐带血性状的遗传关联,以及对脂质水平的普遍年龄特异性遗传效应。最后,我们进行了代际孟德尔随机化分析,以探索和提供推定的母亲因果非遗传性宫内影响的7个胎儿生长测量的7个母亲表型的估计。BIGCS的网站是http://bigcs.com.cn/。
The BIGCS phase I genomic dataset
The phase I genome study of BIGCS includes 4,053 participants from the BIGCS programme at Guangzhou Women and Children’s Medical Center (GWCMC), representing 13 ethnic groups from 30 out of 34 administrative divisions in China (Fig. 1a, Supplementary Fig. 2 and Supplementary Tables 1 and 2). Participants comprised 332 parent–offspring trios (father–mother–offspring), 1,406 parent–offspring duos (14 father–offspring and 1,392 mother–offspring duos) and 245 unrelated individuals (Fig. 1b). Peripheral blood samples from parents and cord blood samples from infants were sequenced to an average depth of 6.63x (Supplementary Fig. 3a,b), covering approximately 98.56% ± 0.64% of the non-N sequences in the human genome (Supplementary Table 3). Further details regarding the study’s sampling framework can be found in the Methods and Supplementary Notes.
BIGCS的I期基因组研究包括来自广州市妇女儿童医疗中心(GWCMC)BIGCS项目的4,053名参与者,代表了中国34个行政区划中30个的13个民族(图1a,补充图2和补充表1和2)。参与者包括332个父母-后代三人组(父亲-母亲-后代),1,406个父母-后代二人组(14个父亲-后代和1,392个母亲-后代二人组)和245个无关个体(图1b)。对父母的外周血样本和婴儿的脐带血样本进行测序,平均深度为6.63 x(补充图3a、B),覆盖人类基因组中非N序列的约98.56% ± 0.64%(补充表3)。关于研究抽样框架的进一步细节,见《方法和补充说明》。
图1:I期参与者和遗传变异的特征。
a, Geographic distribution and demographic statistics of the 4,053 participants involved in the BIGCS phase I genome study categorized by birthplace. The map was sourced from an approved standard map service (http://bzdt.ch.mnr.gov.cn) endorsed by the Ministry of National Resources of the People’s Republic of China (GS YUE(2023)1422). Numbers on the map indicate the number of participants in each region. b, Composition of samples within the study. c, Allele frequency distribution of known and novel variants, with SNPs and indels displayed. d, Length distribution and the proportion of indel variants in noncoding and coding regions, ranging from −20 bp (deletion) to +20 bp (insertion) in size. e, Venn diagram illustrating variants identified in BIGCS compared with other data resources, including CMDB, ChinaMAP, gnomAD EAS and 1KGP3 CHN. Each coloured oval represents a specific resource, featuring its name, variant number and sample size. The percentage within each oval indicates the proportion of variants within that resource relative to the total number of variants in the combined datasets (n = 197,403,927). f, Average imputation accuracy is shown for variants within each non-reference allele frequency bin. Each coloured line corresponds to a reference panel. Imputation accuracy is measured using R2, computed between genotype dosage imputed by each reference panel (1KGP3, HRC, GAsP, BIGCS, TOPMed and the Meta-BIGCS-TOPMed) and true genotypes in 50 high-coverage WGS samples per variant. Variants present in the true set but not in the reference panel are assigned an R2 value of zero. The allele frequency (x axis) is determined based on the allele frequency estimated from the BIGCS dataset. For variants not available in BIGCS, allele frequencies were estimated from the 50 WGS samples.
a.按出生地分类的参与BIGCS第一阶段基因组研究的4 053名参与者的地理分布和人口统计。该地图源自经中华人民共和国国家资源部认可的标准地图服务(http://bzdt.ch.mnr.gov.cn)(粤规(2023)1422)。地图上的数字表示每个地区的参与者人数。B,研究中样本的组成。c,已知和新变体的等位基因频率分布,显示SNP和插入缺失。d,非编码区和编码区中indel变体的长度分布和比例,大小范围从−20 bp(缺失)到+20 bp(插入)。e,维恩图说明了在BIGCS中识别的变体与其他数据资源(包括CMDB、ChinaMAP、gnomAD EAS和1KGP3 CHN)的比较。 每个彩色椭圆代表一种特定的资源,包括其名称、变体编号和样本量。每个椭圆形内的百分比表示该资源内的变体相对于组合数据集中的变体总数的比例(n= 197,403,927)。f,显示了每个非参考等位基因频率箱内变异的平均插补准确度。每条彩色线对应一个参考面板。使用R2测量插补准确度,R2在每个参考组(1 KGP 3、HRC、GAsP、BIGCS、TOPM和Meta-BIGCS-TOPM)插补的基因型剂量与每个变体50个高覆盖率WGS样本的真实基因型之间计算。存在于真组中但不存在于参比组中的变体被指定为零的R2值。等位基因频率(x轴)基于从BIGCS数据集估计的等位基因频率来确定。 对于在BIGCS中不可用的变体,从50个WGS样品估计等位基因频率。
After quality control and bioinformatics analysis (Extended Data Fig. 1 and Methods), we identified a set of 56,230,613 biallelic variants, comprising 51,052,456 single nucleotide variants (SNVs) and 5,178,157 small insertion–deletion mutations (indels) (Extended Data Table 1). The overall transition/transversion (Ts/Tv) ratio was 2.09 and the heterozygous-to-homozygous (Het/Hom) proportion was 1.48, consistent with statistical expectations21. We categorized the variants on the basis of non-reference allele frequency in the BIGCS population and their presence in NCBI dbSNP (Fig. 1c). Approximately 32.56% of the variants (18.3 million, Ts/Tv = 1.46) were not reported in dbSNP (build 154), with 93.4% classified as singletons or doubletons (allelic count AC ≤ 2). Indels ranged in size from −20 bp to 20 bp (Supplementary Fig. 4), with a 3-bp enrichment in coding regions (Fig. 1d). Compared with four Chinese genetic resources: CMDB22 (v1.0; 0.06x; 141,431 genomes), ChinaMAP23 (v1.0; 40.8x; 10,588 genomes), gnomAD EAS24 (v3.0; 1,567 genomes) and 1KGP34 CHN (a combination of CHB, CHS, and CDX; 301 genomes), the BIGCS resource contributed specifically to 10% of the union set of 197,403,927 variants from the 5 datasets (Fig. 1e). The number of variants per individual exhibited geographical, ethnic and linguistic patterns (Supplementary Fig. 5). Compared with genotypes from a subset of 240 individuals with Illumina SNP arrays, we achieved a genotype concordance rate of 0.99 for variants with minor allele frequency (MAF) greater than 0.05, and 0.98 for low-frequency variants with MAF between 0.01 and 0.05 (Supplementary Fig. 6a). Notably, genotype refinement using BEAGLE tools markedly reduced the discordance rate, highlighting the importance of utilizing family-relatedness information in birth cohorts to improve variant quality (Supplementary Fig. 6b).
We further constructed a haplotype reference panel comprising 2,245 unrelated parental individuals and 43,055,086 high-quality SNPs plus 4,184,387 indels spanning the 22 autosomes and the X chromosome (Methods). To assess the utility of the BIGCS panel in imputing genotypes for individuals of Chinese ancestry, we employed an independent high-coverage WGS dataset comprising 50 Chinese individuals (40x coverage, 11,174,603 biallelic variants). We mimicked a standard imputation process using the 930,000 SNP sites from the Affymetrix Genome-Wide Human SNP Array 6.0 and estimated the imputation accuracy (R2) by comparing the imputed genotype dosage with true genotypes derived from high-coverage sequencing at sites absent on the array (Methods). Across all allele frequency ranges, the BIGCS panel consistently exhibited superior R2 and true variant coverage when compared with commonly used reference panels for individuals of Chinese ancestry, including the 1KGP3 reference panel (n = 2,504), the haplotype reference consortium panel25 (HRC) (n = 32,470), the GenomeAsia100K Project reference panel26 (GAsP) (n = 1,654) and the multi-ethnic TOPMed reference panel27 (TOPMed) (n = 97,256) (Fig. 1f, Extended Data Fig. 2 and Supplementary Tables 4 and 5). Specifically, the BIGCS reference panel achieved an average R2 of 0.715 for low-frequency variants (1% ≤ MAF < 5%) and 0.935 for common variants (MAF ≥ 5%), surpassing other panels (Fig. 1f). For rare variants with MAF ≤ 1%, the BIGCS panel also exhibited the highest average R2 (R2 = 0.466) compared to the 1KGP3 (R2 = 0.365), HRC (R2 = 0.294), GAsP (R2 = 0.321) and TOPMed (R2 = 0.465) panels. Combining the BIGCS and TOPMed panel through meta-imputation further enhanced accuracy for low-frequency (R2 = 0.723) and rare variants (R2 = 0.495) (Fig. 1f). Filtering variants based on the R-squares estimated by Minimac3 improved imputation accuracy by the BIGCS reference panel (rare variants: R2 ≥ 0.824, low-frequency variants: R2 ≥ 0.880, common variants: R2 ≥ 0.931; Supplementary Fig. 7 and Supplementary Table 4). The higher imputation accuracy of the BIGCS panel reflects the need to address the underrepresentation of Chinese genetic diversity worldwide and highlights the benefits of a family design in constructing population haplotype references.
Fine-scale genetic structure
The BIGCS cohort features a substantial representation of individuals from Guangdong province in southern China, the place of origin and evolution of Cantonese, which distinguishes it from the CHB, CHS, and CDX populations recruited in 1KGP4, as well as recently investigated populations from east, north, and central China23,28,29 and the Chinese groups collected in the Allen Ancient DNA Resource30 (AADR) (Extended Data Fig. 3a,b and Supplementary Tables 1 and 6a). To uncover the population genetic structure and history of the BIGCS cohort, we used diverse population genetic methodologies, including principal component analysis (PCA), admixture, F statistics, multiple sequentially Markovian coalescent, qpAdm and ChromoPainter, along with Globetrotter analysis (Methods).
In PCA analysis conducted solely on BIGCS participants, we observed a strong correlation between individual genetic ancestry and their linguistic affiliation (PC1 versus dialect: Spearman’s r = 0.69, P = 6.7 × 10−296; PC2 versus dialect: Spearman’s r = 0.19, P = 4.83 × 10−18), surpassing the correlation with their geographic birthplaces and ethnicities (Fig. 2a,b and Supplementary Fig. 8a,b). Unlike the predominant Mandarin-speaking group in China, the dominant Cantonese-speaking group in BIGCS forms a representative southern Chinese cluster. For the other two widely spoken languages in Guangdong, the Min group displayed substantial genetic distance from the Cantonese group, and the Hakka group occupied an intermediate position between Cantonese and Min groups in the PCA space (Fig. 2b). In the admixture analysis of the five linguistic groups with a sample size greater than 50 (Mandarin, Cantonese, Hakka, Min and Xiang), the optimal number of ancestral components was inferred to be K = 3 (Extended Data Fig. 4). These three ancestral components corresponded to Cantonese-, Min- and Mandarin-enriched components, delineating the structure of the major linguistic groups within BIGCS. Despite a smaller sample size, considerable genetic differentiation was observed among the Gan, Hui, Wu, Dianbai-Li and Zhuang dialect groups (Fig. 2c). Multiple sequentially Markovian coalescent analysis suggested similar population dynamics among the linguistic groups, with the lowest effective population size occurring at around 25 thousand years ago (ka) (Supplementary Fig. 9a) corresponding to peak of the Last Glacial Maximum and the initiation of divergence among the 5 linguistic groups around 6 to 7 ka (Supplementary Fig. 9b), coinciding with the Neolithic Period in China (8000 bc to 2000 bc) when the ancient Chinese agricultural system steadily formed and developed (Supplementary Notes).
a, Geographical distribution of the eight major Chinese dialect groups (Mandarin, Cantonese, Min, Hakka, Gan, Xiang, Wu and Hui), and the two minority dialects (Li and Zhuang Chinese) within the BIGCS study, according to the Language Atlas of China49. The map was sourced from an approved standard map service (http://bzdt.ch.mnr.gov.cn) endorsed by the Ministry of National Resources of the People’s Republic of China (GS YUE(2023)1422). b, PCA of all 2,245 unrelated parental participants in the BIGCS, coloured by their spoken dialects and shaped by the geographical region of their birthplaces. Each point represents one participant and is positioned according to their principal component values. c, The same PCA plot is colour-coded by specific dialects, with the remaining individuals shown in grey in each subplot. The title of each subplot indicates the dialect name, along with the sample size in parentheses. d, ADMIXTURE analysis comprising the 10 linguistic groups (n = 2,245, red labels) from the BIGCS cohort and 56 representative present-day (n = 1,422, black labels) and ancient Asian groups (n = 210, blue labels) from the AADR, for k = 11 ancestral components. ADMIXTURE results for k ancestral components ranging from 2 to 15 are presented in Supplementary Fig. 12. Each individual is represented by a small bar, with colour proportions indicating the proportion of specific ancestral components. Mandarin speakers are further categorized into seven groups on the basis of their birthplace: Mandarin_SC (South China), Mandarin_SWC (Southwest China), Mandarin_CC (Central China), Mandarin_EC (East China), Mandarin_NWC (Northwest China), Mandarin_NEC (Northeast China), Mandarin_NC (North China). Further information about the 56 representative Asian groups in AADR is available in Supplementary Table 6b.
Comparisons with present-day global and Asian ancestral groups from AADR using PCA, admixture and fixation index (Fst) demonstrated that all ten BIGCS linguistic groups are related to both Sino-Tibetan-speaking and Tai-Kadai-speaking populations (Extended Data Fig. 3c,d and Supplementary Figs. 10 and 11). In the admixture clustering analysis involving 56 present-day and ancient Asian groups from AADR (Supplementary Table 6b), the optimal number of clusters was inferred as K = 11, revealing two primary ancestral components within the BIGCS population (Fig. 2d and Supplementary Fig. 12). The first ancestral component was ubiquitously found among present-day and historic southern East Asians, particularly in present-day Tai-Kadai-speaking and Austroasiatic populations and the ancient populations from Guangxi (GaoHuaHua and BaBanQinCen) and Vietnam (blue). The second component was most prevalent and maximized in the ancient and present-day northern East Asians including Neolithic and pre-Neolithic northern East Asians (nEA), CHB and JPT population groups (orange). We further examined the genetic relationship of BIGCS groups with representative nEA and Neolithic and pre-Neolithic southern East Asians (sEA) using f4 statistics (Methods). The majority of the linguistic groups within BIGCS, except for Hui, exhibited genetic relatedness to both nEA from the Amur River, Yellow River and Shandong31,32,33 spanning 19 ka to 4 ka and sEA from Fujian, dating back32,34 12 ka to 8.5 ka (f4(nEA/sEA, BIGCS populations, sEA/nEA, Mbuti) < 0, Z < −3, Supplementary Table 7a). Notably, when comparing the linguistic groups to the sEA represented by the Qihe3 and Qihe from Fujian, the genetic affinity of the linguistic groups with nEA dating within the last 10,000 years was more pronounced, with f4(nEA < 10 ka, Qihe3/Qihe; BIGCS pops, Mbuti) > 0 (Supplementary Table 7a). These results imply a closer genetic affinity between the BIGCS linguistic groups and ancient northern East Asian populations compared with their relationships with ancient southern East Asian populations. This observation aligns with previous findings among present-day East Asians32, and is further substantiated by results from the qpAdm (Supplementary Table 7c) as well as ChromoPainter and Globetrotter analyses (Supplementary Fig. 13 and Supplementary Table 8). In a symmetric f4 test including six representative deep ancient lineages, we did not find deep ancestral components for the BIGCS linguistic groups (Supplementary Table 7b). A more detailed interpretation of the f4 test, qpAdm and the ChromoPainter analytical results is presented in Supplementary Notes.
East Asian-specific genetic association
The prevailing focus of GWAS on European populations leaves a gap in our understanding of the genetic underpinnings of a broader spectrum of maternal and infant traits within the East Asian population. In this investigation, we performed GWAS on 12 adult traits involving 2,245 parents, and 6 infant traits among 1,808 children of Chinese ancestries from the BIGCS cohort (Extended Data Table 2, Supplementary Fig. 14 and Methods).
In total, we identified ten loci associated with eight adult traits and three loci associated with two infant traits, reaching study-wide genome-wide significance (P < 2.78 × 10−9) (Fig. 3, Extended Data Fig. 5 and Supplementary Table 9a). Replication of 6 of the signals in an independent GWAS study involving 21,022 to 26,103 Chinese pregnancies sequenced by noninvasive prenatal testing22 (NIPT) demonstrated a 100% replication rate (Supplementary Table 9a and Methods). Among the 13 signals, 9 had been previously associated with the same or similar trait in studies encompassing various population ancestries in the GWAS catalogue35 or PhenoScanner36 (Methods), and 1 of them (rs10830963) had been reported in the literature37 (Supplementary Table 9a). Of the remaining three genetic loci that were not previously reported, two were associated with two maternal traits (TBA and GWG) and one influenced early-life low density lipoprotein (LDL) levels (Fig. 3). An additional 4 loci did not meet the study-wide significance threshold but surpassed genome-wide significance level (P < 5 × 10−8) are presented in Supplementary Table 9b and Supplementary Fig. 15.
a, Manhattan plot displaying all variants with a P value of less than 10−3. The x axis represents the chromosomal position and the y axis represents the negative logarithmic transformation of the P value derived from the GWAS linear mixed regression model. The red dashed horizontal line represents the genome-wide significance threshold at P = 5.0 × 10−8. Each association locus is annotated with the nearest gene symbol, accompanied by the corresponding trait in parenthesis. Gene symbols in red indicate genetic association loci (1-Mb window centred on the lead SNP) that were not previously identified in the GWAS catalogue or PubMed. Gene symbols in grey indicate genetic associations that surpassed the genome-wide association significance threshold (P < 5 × 10−8) but did not meet the study-wide association significance threshold (P < 2.78 × 10−9). The vertical red dots represent loci with age-specific maternal and infant genetic effects on the same trait. OGTT, oral glucose tolerance test; TC, total cholesterol; TG, triglycerides. b, A forest plot displaying the 17 independent loci that reached a genome-wide significance level of P < 5 × 10−8. The effect size (β value) of genetic associations with maternal traits is depicted in orange, the effect size of genetic associations with infant cord blood traits is shown in blue. For each locus, the effect allele corresponds to the non-reference allele. The effect is quantified in terms of the number of standard deviation changes in the trait per unit increase in genotype dosage.
The lead SNP associated with TBA is a missense variant (rs2296651-A, c.800 C > T/p.Ser267Phe) in the coding region of SLC10A1 (also known as NTCP) (beta = 0.88, 95% confidence interval: 0.75 to 1.01, P = 2.16 × 10−39) (Figs. 3 and 4a,b). This association was replicated in the NIPT dataset (beta = 0.47, 95% confidence interval: 0.33 to 0.61, P = 5.37 × 10−11) (Supplementary Table 9). Notably, the rs2296651-A allele was also associated with a 4.55-fold increased risk of cholestasis during pregnancy in the BIGCS cohort (odds ratio (OR) = 4.55, 95% confidence interval: 3.12 to 5.93, P = 4.2 × 10−10) (Supplementary Fig. 16). The rs2296651-A allele frequency was higher among the southern Chinese (9.96%) compared with those in northern and northeastern China (2.46% and 2.33%) (Fig. 4c), and was absent in non-East Asian populations (Fig. 4d). This discrepancy may be attributed to a potential role of the p.Ser267Phe variant in conferring resistance against chronic hepatitis B virus infections38. Our study provides initial evidence of an East Asian-specific association between the p.Ser267Phe variant and increased levels of TBA, which has been linked to an increased risk of perinatal morbidity and mortality39.
a, LocusZoom plots of the SLC10A1 locus associated with TBA. The missense lead SNP rs2296651 is represented by a purple circle. GWAS annotations were derived from the GWAS catalogue (e107_r2022-09-30). Linkage disequilibrium r2 values were calculated using the East Asian population from the 1000 Genomes Project. b, Comparison of TBA based on the three genotypes of the missense lead SNP rs2296651. The P value obtained from linear regression analysis is shown for each pairwise comparison between the genotype categories. c,d, Frequencies of the non-reference allele A of rs2296651 (p.Ser267Phe) in China (c) and globally (d), respectively. The data for China were obtained from BIGCS, and the global data were acquired from the Chicago allele frequency website, utilizing information from the 1000 Genomes Project. The map in c was sourced from an approved standard map service (http://bzdt.ch.mnr.gov.cn) endorsed by the Ministry of National Resources of the People’s Republic of China (GS YUE(2023)1422).
The second novel locus was associated with the rate of maternal GWG in kg per week (GWG rate) (Fig. 3). The 4-bp lead rs3840091-CCAGA deletion, which is associated with accelerated weight gain during gestation (beta = 0.33, 95% confidence interval: 0.23 to 0.44, P = 5.45 × 10−10, Fig. 3), is common in the BIGCS cohort (allele frequency (AF) = 12.18%) and the East Asian population (AF = 13.04%) but less common in European (AF = 4.15%) or African (AF = 2.17%) populations, according to gnomAD24. This may explain why its genetic effect on GWG was not identified in previous array-based GWAS studies targeting European populations40. Five nearby genetic variants also demonstrated a genome-wide significant effect (P < 5 × 10−8) (Supplementary Table 10). All six of these genetic associations were whole-blood expression quantitative trait loci (eQTL) for PITPNB, which is involved in lipid binding41. One copy of the 4-bp deletion is associated with both increased PITPNB expression and the GWG rate (Supplementary Fig. 17). Evidence from the rat genome database42 suggested that the rat orthologous genes of both the human PITPNB and TTC28 genes are expression quantitative trait loci (eQTL) that determine body weight gain in the rat population43. This underscores the potential causal role of this locus in GWG rate in the East Asian population, warranting further replication in other maternal cohorts and validation of its functions.
Three loci were associated with two infant lipid traits (P < 2.78 × 10−9), representing genetic influences on early-life lipid levels. The association of the APOE locus with high density lipoprotein (HDL) (rs7412-T, beta = 0.42, 95% confidence interval: 0.29 to 0.55, P = 2.19 × 10−10) and LDL (rs72654473-A, beta = −0.71, 95% confidence interval: −0.83 to −0.59, P = 1.98 × 10−30) was previously known in adults (Supplementary Table 9). The remaining SOAT2 locus associated with infant LDL levels has not been reported previously and demonstrates an age-specific genetic effect, as detailed below.
Age-specific genetic effects
We observed a notable discrepancy in the genetic effect of 4 out of the 12 lipid-associated loci on the same lipid trait between mothers and infants through a two-sample t-test (P < 0.004 adjusted for multiple comparisons; Fig. 5 and Supplementary Notes). These included one locus related to LDL levels in infant cord blood and three loci related to total cholesterol, LDL and triglyceride levels in maternal peripheral blood (Fig. 5 and Supplementary Table 9). Despite similar sample sizes, the chromosome 12 locus associated with LDL level in infant cord blood (rs137994041-AGTTT located in the intron of SOAT2, beta = 0.34, 95% confidence interval: 0.23 to 0.45, P = 1.66 × 10−9) did not show a genetic association with LDL levels in maternal peripheral blood (beta = −0.007, 95% confidence interval: −0.111 to 0.096, P = 0.892). Conversely, three loci, including the APOE locus associated with total cholesterol levels, the CELSR2 locus associated with LDL levels, and the APOA5 locus associated with triglyceride levels in maternal peripheral blood, did not manifest a genetic effect on the same lipid traits in infant cord blood (P > 0.05) or demonstrated an opposite genetic effect (Fig. 5 and Supplementary Table 9). Two additional loci connected with HDL and triglyceride levels in infant cord blood that met the genome-wide significance threshold also exhibited distinct genetic effects (Supplementary Fig. 18).
a, Histograms and density plots illustrating the distribution of each of the four lipid traits for both mothers and infants. b, Forest plot displaying the genetic effect differences between mothers and infants at four loci surpassing the study-wide genome-wide association significance (P < 2.78 × 10−9). The effect size (β value) of the genetic associations with maternal traits is represented in orange, and the effect sizes of the genetic associations with infant cord blood traits are represented in blue. The non-reference allele was considered as the effect allele for each locus. The effect is measured in terms of the number of standard deviation changes in the trait per unit increase in genotype dosage. Statistical test of the difference using a two-sided two-sample t-test is detailed in the Supplementary Note.
To better understand the reasons underlying these differences, we compared the distribution of the four lipid traits between maternal peripheral blood and infant cord blood, assayed using the same experimental approach. Notably, we observed substantial disparities in lipid levels between mothers and infants (Fig. 5a) (triglyceride: Pearson’s r = −0.014, total cholesterol: r = 0.081, LDL: r = 0.096, HDL: r = 0.17) (Supplementary Fig. 19). Furthermore, we disentangled the haplotype genetic effect and conducted a multivariable regression of the lipid traits on the phased allelic dosage of the maternal transmitted allele (h1), the maternal non-transmitted allele (h2), and the paternal transmitted allele (h3) for the 13 genetic loci (Supplementary Table 9). The genetic effects on maternal traits were restricted to the h1 and h2 alleles regardless of their transmission status. The genetic effects on infant cord blood traits were restricted to the h1 and h3 alleles but not the non-transmitted h2 alleles. No parent-of-origin effect44 was observed for the genetic association signals reported in our study. A search in the GTEx portal revealed that the lead SNP of four out of the five loci with distinct genetic effects were eQTL in multiple tissues (Supplementary Fig. 20), suggesting the functional relevance of these genetic loci. The distinct effects identified in this study may therefore be driven by an effect modification by aging and we refer to it as an age-specific genetic effect. These preliminary findings regarding age-specific genetic effects on lipid traits warrant further replication and validation studies to understand their mechanisms and implications.
For comparison, we also conducted a family-based association study using the linear mixed model implemented in SAIGE, which incorporates both within-family and between-family information, for each of the four lipid traits using both parent and infant samples (Methods). As expected, the inclusion of all family members in the model increased the power compared to the GWAS analyses on parents and infants separately (Supplementary Table 11). All the six lipid-associated loci without age-specific effects between adults and infants were consistently identified in this joint-sample analysis with smaller P values (Supplementary Table 11). However, this approach missed information in cases where age-specific genetic effects occur. For example, except for the CELSR2 locus associated with LDL and the APOA5 locus associated with TG, the remaining three adult- or infant-specific genetic associations that surpassed the genome-wide significance threshold were not detected in the joint-sample analysis.
Intergenerational Mendelian randomization
Fetal growth, which has been related to long-term health outcomes8, has been hypothesized to be influenced by maternal phenotypes15,45,46,47. To investigate potential causal relationships between maternal phenotypes and fetal growth measurements, we created haplotype genetic scores for both transmitted and non-transmitted alleles. Utilizing these genetic scores and their linear combinations, we performed an intergenerational Mendelian randomization analysis within the BIGCS cohort (Methods and Supplementary Notes). We explored the putative maternal causal effects of seven maternal quantitative measurements (maternal height, pre-pregnancy body mass index (BMI), blood pressure, fasting plasma glucose (FPG), TBA, triglyceride and total cholesterol levels) and fetal genetic effects on three extensively studied birth outcomes (birth weight, birth length and gestational duration at birth) and four quantitative metabolic factors in cord blood (HDL, LDL, triglyceride and total cholesterol levels) (Extended Data Table 3, Extended Data Fig. 6 and Supplementary Table 12a). To facilitate interpretation of the results, we compared our findings with those from a study including multiple European-ancestry populations48 and an Icelandic study44, provided a power analysis incorporating instrument strength and sample size (Extended Data Table 4, Supplementary Table 13 and Supplementary Fig. 21), and presented a secondary analysis excluding specific SNPs (Supplementary Table 12b).
We confirmed several known putative causal relationships observed in prior studies of European-ancestry populations44,48. For example, we found a positive causal effect of maternal height on birth weight, with an estimated increase of 14.99 g (95% confidence interval: 1.09 to 28.88, P = 3.45 × 10−2) per 1 cm increase in maternal height. We also found an increase in birth weight of 561.2 g (95% confidence interval: 75.9 to 1046.5, P = 2.34 × 10−2) per 1 mmol l−1 increase in maternal FPG (Extended Data Table 3). These results align with prior findings from European populations48 (17.34 g, 95% confidence interval: 4.53 to 11.25 for maternal height and 408.33 g, 95% confidence interval: 41.67 to 775.00 for FPG). Further, we identified a negative fetal genetic effect on birth weight, indicating a decrease of 131.22 g (95% confidence interval: −238.84 to −23.60 g, P = 1.69 × 10−2) per unit increase in genetic scores associated with a 1 mmHg increase in blood pressure, consistent with both the multi-ancestral and Icelandic studies44,48. Concerning birth length, we observed a negative fetal genetic effect, with a reduction of 0.77 cm (95% confidence interval: −1.26 to −0.28, P = 1.78 × 10−3) per unit increase in genetic scores associated with 1 mmHg increase in blood pressure, consistent with the Icelandic study44.
Additionally, we identified previously unreported effects, including a negative maternal effect, leading to a 0.42 cm reduction in birth length (95% confidence interval: −0.69 to −0.15, P = 2.27 × 10−3) for each 1 mmol l−1 increase in maternal TBA. We also identified a negative fetal genetic effect of total cholesterol on cord blood LDL and triglyceride levels, resulting in an average decrease of 0.77 mmol l−1 in LDL (95% confidence interval: −1.30 to −0.24, P = 4.57 × 10−3) and 0.33 mmol l−1 in triglyceride (95% confidence interval: −0.61 to −0.05, P = 2.21 × 10−2) per genetic score associated with a 1 mmol l−1 increase in maternal total cholesterol. Additionally, we observed a negative fetal genetic effect of blood pressure and a positive fetal genetic effect of TBA on total cholesterol level in cord blood. These effects remained consistent in the secondary analysis (Supplementary Table 12).
A detailed interpretation of our methods and potential biological mechanisms of the findings are provided in the Supplementary Notes. Although these estimated effects provide insights into the potential causal relationship between maternal phenotypes and fetal growth, it is imperative to underscore that the observed causal effects in the study are suggestive, and necessitate further investigations through mechanistic experiments and replication studies. Furthermore, power constraints may have limited the detection of more subtle causal effects, including several known causal relationships reported in prior multi-ancestral and Icelandic studies44,48. As more comprehensive phenotypic assessments, such as molecular phenotypes become available, this pilot study will lay the foundation for a deeper comprehension of the mechanisms underlying the observed putative causal relationships between maternal traits and birth outcomes.
Discussion
Despite the inherent challenges in establishing and maintaining birth cohorts with long-term follow-up, genomic research using these resources is a critical approach for unravelling the genetic and environmental influences of early life on later-life health1. The phase I BIGCS genome study presented here provides initial insights into the genetic knowledge and utility derived from genetic investigations within an Asian birth cohort. We identified a total of 18.3 million novel genetic variants and developed a reference panel that enhances genotype imputation accuracy for individuals of Chinese descent. Among the southern Chinese participants, we discerned a fine-scale local genetic structure, primarily linked to linguistic affiliations and characterized by an ancient northern–southern East Asian admixture. We identified 13 genetic associations for 18 traits, including two previously undiscovered East Asian-specific genetic associations with total cholesterol and GWG, and three suggestive genetic associations with infant cord blood traits that likely influence early-life lipid levels. Particularly, suggestive age-specific genetic associations for five loci were identified, demonstrating distinct genetic effects on the same lipid traits between adults and infants. In our exploratory intergenerational Mendelian randomization analysis, several putative causal relationships were confirmed, including positive maternal causal effects of maternal height and fasting plasma glucose, along with a negative fetal genetic effect of blood pressure on birth weight. We also identified previously unreported effects, including a negative maternal causal effect of TBA on birth length, a negative fetal genetic effect of total cholesterol on cord blood LDL and triglyceride levels, as well as contrasting fetal genetic effects of blood pressure and TBA on cord blood total cholesterol. These findings provide new insights into the genetic diversity within the southern Chinese population and highlight potential genetic influences on maternal and early-life traits in East Asia.
The methods and genetic discoveries in this study provide a proof-of-concept methodological framework for future medical and genetic studies of the human population. We have recruited and deeply phenotyped over 50,000 Chinese maternal-infant pairs or trios in the BIGCS cohort, with longitudinal follow-up from early gestation to 18 years of age. We are organizing the logistics to sequence additional samples in the phase II study, with a specific focus on perinatal outcomes, and ultimately sequencing all participants in the future. As we continue to develop and release genomic data, future efforts will be prioritized to update the variation dataset, reference panel, genotype-phenotype associations and to quantify the complex interplay of a comprehensive spectrum of environmental and genetic factors during early life and their roles in shaping not only birth outcomes but also childhood and adult health.
Methods
Cohort description
The BIGCS cohort, established in 2012 in Guangzhou, China, is one of the largest-scale prospective birth cohorts in Asia, designed to track a wide range of physical and biochemical measurements of the participants from the prenatal period to adulthood2. One primary goal of BIGCS to investigate the impact of early-life traits on developmental health. To achieve this objective, pregnant women residing in Guangzhou who attend their first routine antenatal examinations at GWCMC are recruited along with their husbands and offspring. To ensure the feasibility of follow-up for the offspring, only pregnant women who are currently living in Guangzhou and plan to continue residing and raising their children in Guangzhou, were eligible for participation. Eligible participants are identified and invited to participate in the study by trained personnel, and all participants in the BIGCS cohort have provided informed consent. It should be noted that while Guangzhou is a city with extensive international trade and work opportunities, the population makeup of the BIGCS cohort is predominantly indigenous Chinese. Further details of the sampling frame are available in Supplementary Notes.
By the end of 2021, BIGCS had recruited more than 50,000 trios or duos families. From all the BIGCS participants, we randomly selected 1,981 families, including 426 trios and 1,555 duos using specific criteria: (1) mothers who delivered babies at GWCMC between October 2013 and March 2017; (2) mothers pregnant with singletons; (3) mothers without major pre-pregnancy diseases, including diabetes, thyroid disorders, hypertension, viral hepatitis and kidney disease, or low prevalence conditions (prevalence rate <0.01 in BIGCS), such as Down syndrome; (4) availability of maternal peripheral blood during gestational weeks 14–28 and cord blood samples at birth; (5) completion of an OGTT during mid-pregnancy. Approximately 4.2% of the samples lacked high-quality sequencing libraries when sent for sequencing and were excluded, resulting in 353 trios, 1,493 duos (8 father–offspring and 1,485 mother–offspring), and 170 unrelated single individuals (53 infants, 100 adult females and 17 adult males) (n = 4,215). We did not use statistical methods to predetermine sample size. The study was approved by the Ethics Committee of Guangzhou Women and Children’s Medical Center (no. 2012[015] and 2017102302). The participants consented to the publication of the research results.
Whole-genome sequencing of the 4,215 participants
Paired-end 100 bp WGS with an average insert size of 214 bp was performed on the BGISEQ-500 platform with Magnetic Beads Blood Genomic DNA Extraction Kit. The average sequencing depth was ~6.63x (Supplementary Fig. 3a,b). The duplication rate was minimal, averaging 1.7% (Supplementary Fig. 3c). To enhance data quality, we employed SOAPnuke (v1.5.6)50 to filter out adapter sequences and eliminate poor-quality bases from the raw sequencing data. A read was excluded if its sequence closely matched adapter sequences, allowing for less than two mismatches. Additionally, reads were excluded if they exhibited over 50% low-quality bases (base quality <12) or contained more than 10% N bases.
SNP array genotyping of 240 participants
We selected 240 adult female participants from the total 4,215 participants for SNP array genotyping using the illumina GSA-24 (v1.0) BeadChip SNP array (https://www.illumina.com/products/by-type/microarray-kits/infinium-global-screening.html). The genotypic data generated from this subset served as a gold-standard SNP dataset, playing a pivotal role in benchmarking the quality control procedures employed for variant calling in this study.
High-coverage whole-genome sequencing of 50 participants
For an additional 50 healthy Chinese participants from the BIGCS cohort, not included in the previous group, we conducted high-coverage whole-genome paired-end sequencing utilizing the Illumina HiSeq X10 platform with 140-bp reads, achieving an average coverage of 40x. We aligned the clean reads to GRCh38/hg38 reference genome using BWA-MEM (v0.7.17)51. We applied GATK (v4.1.8.1) best practice joint calling protocol to detect and genotype variants in these participants. After variant quality score recalibration (VQSR) and the removal of multi-allelic variants, we obtained a set of 11,174,603 high-quality genotyped biallelic variants. This included 9,816,793 SNPs and 1,357,810 indels. We further removed SNPs that met one of the following three criteria: (1) SNPs that were located within the low-complexity regions of GRCh38; (2) SNPs absent in any one of the five reference panels; (3) SNPs classified as singletons in these 50 participants, with an allele frequency less than 0.01 in the BIGCS reference panel. The resulting 8,303,052 SNP variants were used to benchmark genotype imputation accuracy with BIGCS and other reference panels.
Variant detection, genotyping calling and haplotype phasing
To facilitate efficient variant calling for thousands of samples, we devised the ilus (v1.1.1) variant calling pipeline for BIGCS (see Code availability), which was based on the GATK21 multi-sample joint calling framework, and incorporates diverse quality control processes and data statistic functions (Extended Data Fig. 1). The ilus pipeline employed BWA-MEM (v0.7.17)51 for sequencing read alignment against GRCh38/hg38, VerifyBamID2 (v1.0.6) for detecting highly contaminated samples52, samtools53 for sorting and merging alignment reads from sequencing lanes and libraries, GATK (v4.1.8.1) MarkDuplicates for identifying PRR duplicate reads, GATK BaseRecalibrator for recalibrating base quality, GATK HaplotypeCaller in GVCF mode, GATK GenotypeGVCFs for joint variant calling, and GATK VariantRecalibrator for variant quality score recalibration. The subsequent phases of phasing and genotype refinement were performed using BEAGLE software version 4.054. Variant annotation was performed by Variant Effect Predictor55 (VEP) (v95 GRCh38) with default parameters. Python and R scripts were employed for data statistics. A detailed description of each analytical steps is provided in Supplementary Notes.
BIGCS reference panel construction and evaluation
To create the BIGCS reference panel, we extracted all 2,245 unrelated samples from the phased data mentioned above, which consisted of 43,055,086 high-quality SNPs and 4,184,387 indels. The phased data were converted to m3vcf format for each chromosome, a reference format compatible with Minmac3 (v2.0.1)56. To evaluate the performance of the BIGCS panel, we used variants obtained from the high-coverage WGS data (~40x on average described above) of 50 additional unrelated healthy Chinese individuals who had no familial relationships with any individuals in the BIGCS panel. Subsequently, we extracted approximately 0.93 million SNPs present in the Affymetrix Genome-Wide Human SNP Array 6.0 from these 50 samples and performed genotype imputation for the remaining variants using Minmac3. We evaluated imputation accuracy by calculating Pearson’s R2 between the true genotype from high-coverage WGS and the imputed genotype dosage generated by the BIGCS reference panel. For variants present in the true set but not in the reference panel, an R2 value of zero was assigned. The average imputation accuracy is shown for variants in each non-reference allele frequency bin ranging from 0 to 100% in Fig. 1f. Allele frequencies on the x axis were determined based on the allele frequency estimates from the BIGCS dataset. In cases where a variant was unavailable in BIGCS, its allele frequency was estimated from 50 WGS. Meta-imputation was conducted using MetaMinimac257. To enhance the utility of the BIGCS reference panel, we have established an imputation server on the BIGCS genome database website (http://gdbig.bigcs.com.cn/).
PCA and admixture
The genetic structure and diversity of the BIGCS cohort were analysed using PCA and admixture58 with unrelated parental samples alone and by merging a dataset of autosomal biallelic SNPs of BIGCS-unrelated parental samples and the representative Asian individuals from the AADR dataset30. We used PLINK2 (v2.00a3LM)59 to select SNPs with MAF ≥ 1%, genotype missing rate <5%, and HWE P value > 1.0 × 10−6. Moreover, we performed linkage disequilibrium pruning using “--indep-pairwise 50 10 0.1” in PLINK2, yielding 598,219 biallelic SNP sites for PCA among BIGCS-unrelated samples. For the merged dataset of BIGCS-unrelated samples and AADR samples, we applied slightly relaxed filtering criteria due to the smaller size of array sites in AADR. The filtering parameters included “--maf 0.01 --geno 0.3 --hwe 1e-6 --vcf-half-call m --indep-pairwise 1000 100 0.9”. PCA was carried out using smartpca command from EIGENSOFT program60. Admixture analysis was conducted from K = 2 to K = 14 with default parameters. The K value with the smallest cross-validation error rate was selected as the best model, which was K = 4 for BIGCS and the 33 Chinese groups from AADR (Supplementary Fig. 11), K = 11 for BIGCS and 56 ancient and present-day Asian groups from AADR (Supplementary Fig. 12), and K = 3 for BIGCS alone (Extended Data Fig. 4).
Ancient admixture
To investigate patterns of ancient admixture within the BIGCS linguistic groups, we compared each of the ten linguistic groups to both Neolithic and pre-Neolithic southern East Asians (sEA) and northern East Asians (nEA) (Supplementary Table 7a), as well as deep Asian lineages (Supplementary Table 7b) using f4 statistics. Calculation of the f4 statistics was performed using qpDstat in AdmixTools61 (v7.0.2). To mitigate potential genotype bias resulting from sequencing depth or platform discrepancies, we realigned the BIGCS fastq sequencing data to the same human genome reference used by AADR (hg19). Subsequently, we randomly selected one read from the alignment Bam files and generated pseudo-haploid genotypes using PileupCaller (part of sequenceTools version 1.4.0.5) with the parameter “--randomHaploid”, adhering to the same 1240 K panel SNP list (v54.1.p1_1240K_public.snp) used for the ancient and present-day Asian individuals from AADR with the parameter “-f”. Within qpDstat, we specified “f4mode: YES” and employed frequency data for groups with more than one individual, and a 0/1 count for groups with only one individual. The f4 statistics were structured as f4(Y, Z; X, Mbuti), with the present-day Central African Mbuti population serving as the outgroup. We further utilized ancient samples as sources to model ancestry proportions for the BIGCS linguistic groups through qpAdm in AdmixTool and explored their admixture history using ChromoPainter (v2) and the fastGLOBETROTTER software package62,63. More detailed methods and interpretations are provided in the Supplementary Note.
Genome-wide association analysis
We conducted GWAS using SAIGE64, employing a linear regression model for genotype-phenotype association tests with default parameters. We analysed 18 relevant quantitative traits, including three parental traits (height, weight and BMI), nine maternal traits during pregnancy, and six traits of infants (Extended Data Table 2). For all traits, separate GWAS analyses were performed for Han Chinese adults and infants. For the four lipid traits (total cholesterol, HDL, LDL and triglyceride) that were assayed in both the adult’s peripheral blood and the infant cord blood with the same experimental protocol. We conducted family-based GWAS. Before performing GWAS, we filtered out individuals and variants according to the following criteria for each trait:
For individuals:
-
(1)
Relative samples having genetic relatedness up to third-degree (lcMLkin PI_HAT > 0.05);
-
(2)
Samples whose traits were missing.
For variants (including SNPs and indels):
-
(1)
The variants with MAF < 0.01;
-
(2)
The variants with HWE P value < 1.0 × 10−6;
-
(3)
The variants with genotype missing call rate > 0.01 (since linkage disequilibrium-based refinement was applied there was no missing call).
After performing these filtering steps, sample sizes and variant counts for each trait are presented in Extended Data Table 2. Covariates used for each trait were as follows: for the GWAS of parental height, pre-pregnancy weight, and BMI at recruitment, sex, age, and the first ten principal components from the PCA were used as covariates; for the GWAS of nine maternal quantitative traits, age, pre-pregnancy BMI, gravidity, parity and the first ten principal components from PCA were used as covariates; for the GWAS of infant cord blood lipid traits, including HDL, LDL, total cholesterol and triglyceride, as well as birth weight and birth length, the first ten principal components from PCA, gestational duration, and infant sex were used as covariates.
In the genome-wide analysis of all traits, we observed no statistical inflation with the genomic control lambda value (λGC) ranging from 0.89 to 1.01 (Extended Data Table 2).
For all 18 GWAS we conducted, we defined statistical significance as study-wide genome-wide significance after Bonferroni correction (P < 2.78 × 10−9) for primary discoveries. We also reported variants meeting the genome-wide significant statistical threshold (P < 5 × 10−8) as supplementary information. The lead variant at an independent locus was identified as the variant with the smallest P value and not in linkage disequilibrium (r2 < 0.1) with any other variants within a 1.0-Mb window. The nearest gene symbol of the lead SNP was annotated using the VEP55. LocusZoom65 was used to visualize loci. Novel loci were defined by the absence of any associated variants within a 1-megabase window centred around the lead SNP, with references from the GWAS catalogue35, lack of any genetic variants associated with the same traits (r2 < 0.8) with references from PhenoScanner V236 (P < 10−5), and no reported associations with the same trait in PubMed. eQTL were identified through GTEx66.
We also conducted GWAS for all variants in the participants, including those with an MAF of less than 0.01. However, no low-frequency or rare variants reached genome-wide association statistic test threshold. Summary statistics for the entire genome variants in GWAS are publicly available on our website, which will enable future meta-analysis studies to explore and utilize low-frequency and rare variants.
Family-based GWAS analyses were performed using SAIGE, with principal components obtained via PCA in related samples (PC-AiR) and the full genetic relationship matrix (GRM) instead of the sparse GRM. An online interface for conducting meta-analysis of the GWAS traits is provided on our project website (http://gdbig.bigcs.com.cn/gwas_meta_analysis/jobs.html).
Replication of GWAS loci
For replication purposes, we compared variants meeting the genome-wide association statistical threshold (P = 5 × 10−8) with an independent study encompassing six maternal traits (OGTT0H, OGTT1H, OGTT2H, total cholesterol, triglyceride and TBA) involving 21,022 to 26,103 Chinese pregnancies who underwent NIPT sequencing data (average sequencing depth ~0.1x per sample)22. Genotypes in this NIPT study were imputed using STITCH (version 1.2.7), achieving an average imputation accuracy of 0.89 for a total of 8.16 million variants with info score ≥ 0.4 and MAF ≥ 0.01. We conducted genome-wide association analysis in this replication cohort using PLINK2.0, with gestational week, maternal age, BMI, and the top five principal components as covariates to account for population stratification. The pre-defined criteria for replication necessitated a P value below 0.0083 (adjusted for 6 loci—that is, 0.05 divided by the number of loci) in the replication cohort22, along with the concurrence of effect direction.
Haplotype genetic score and intergenerational Mendelian randomization analysis
We obtained SNPs related to adult height, BMI, FPG, and blood pressure from a prior study48, totalling 2,130 SNPs for height, 628 SNPs for BMI, 22 SNPs for FPG, and 831 SNPs for blood pressure. We converted these SNP coordinates from human genome GRCh37 (hg19) to human genome GRCh38 assembly using the CrossMap software67. After filtering out variants failing the conversion and those not present in the BIGCS variant dataset, we retained 2007, 603, 19 and 759 SNPs for height, BMI, FPG and blood pressure, respectively.
Additionally, we leveraged GWAS results from the abovementioned NIPT study22 for maternal height, BMI, FPG, TBA, triglyceride and total cholesterol. Beta values, standard errors, P values, and allele frequencies from the NIPT GWAS were utilized for overlapping loci concerning maternal height, BMI and FPG, present in the NIPT Chinese GWAS. The NIPT data provided additional instrumental variables for maternal height, BMI and FPG. In total, we obtained 2033, 617, 30, 759, 6, 45 and 28 SNPs for maternal height, BMI, FPG, blood pressure, TBA, triglyceride and total cholesterol, respectively (Supplementary Tables 14–20).
We performed an intergenerational Mendelian randomization analysis using all instrumental variables, with primary results presented in Supplementary Table 12 under sheet ‘MR_ratio-estimate_raw’, Extended Data Table 3 and Extended Data Fig. 6. We also conducted a secondary analysis incorporating restrictions on the SNPs used to construct the haplotype genetic score and presented in the results in Supplementary Table 12 under sheet ‘MR_ratio-estimate_restrictIV’. A comprehension rationale and interpretation of the intergenerational Mendelian randomization analysis, including primary and secondary analyses are detailed in the Supplementary Notes.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The release of the raw sequencing data by this work is approved by The Ministry of Science and Technology of the People’s Republic of China (permission number 2022BAT2230) at the National Genomics Data Center (https://ngdc.cncb.ac.cn) (accession number HRA002496). Data can be accessed via applications, following the GSA guide (https://ngdc.cncb.ac.cn/gsa-human/document). The access authority can be obtained for academic research use only. Previous published genotype data for ancient individuals were reported by the Reich laboratory in the Allen Ancient DNA Resource (https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data, version 54.1). Researchers who are interested in collaborating with the BIGCS group are welcome to contact X.Q. or data.bigcs@bigcs.org.
Code availability
The ilus (code for BIGCS data variant calling) and GDBIGtools are all available in the Github repository using the following links: ilus: https://github.com/ShujiaHuang/ilus. GDBIGtools: https://github.com/BIGCS-Lab/GDBIGtools. Script for distinguishing the parental haplotype alleles from infant genotype and calculation of the genotype/haplotype-based PRS: https://github.com/ShujiaHuang/genotools/blob/master/scripts/mr.py. Script for detecting of age-specific genetic effects on lipid levels among mothers and infants: https://github.com/ShujiaHuang/genotools/blob/master/scripts/twosamplettest.py. Other software and databases used in this study are publicly available, and the URLs are listed below: SOAPnuke (v1.5.6): https://github.com/BGI-flexlab/SOAPnuke. BWA-MEM (v0.7.17): https://github.com/lh3/bwa. verifyBamID2 (v1.0.6): https://github.com/Griffan/VerifyBamID. GATK (v4.1.8.1): https://github.com/broadgsa/gatk/. SAMtools (v1.9): http://samtools.github.io/. BCFtools (v1.9): https://samtools.github.io/bcftools/bcftools.html. bedtools (v2.27.1-65-gc2af1e7-dirty): https://github.com/arq5x/bedtools2/. Variant Effect Predictor (release 95): https://github.com/Ensembl/ensembl-vep. Beagle (v4.0): https://faculty.washington.edu/browning/beagle/beagle.r1399.jar. Minimac3 (v 2.0.1): http://genome.sph.umich.edu/wiki/Minimac3. AdmixTools (v7.0.2): https://github.com/DReichLab/AdmixTools. MSMC2 (v2.1.1): https://github.com/stschiff/msmc2. CrossMap (version 0.2.2): http://crossmap.sourceforge.net/. dbSNP Build 154: http://www.ncbi.nlm.nih.gov/SNP/. GATK bundle (hg38): https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0. Human genome reference (GRCh38/hg38):ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz. The low-complexity regions of GRCh38: https://github.com/lh3/varcmp/blob/master/scripts/LCR-hs38.bed.gz. The 1000 Genome Project: https://www.internationalgenome.org/. The GWAS Catalogue: https://www.ebi.ac.uk/gwas/. The Human Protein Atlas: https://www.proteinatlas.org/. The public GWAS SNPs used in constructing genotype-based PRS and haplotype-based PRS: https://doi.org/10.1371/journal.pmed.1003305.s003. We used Python (version 3.7.6) and R (version 4.1.1) extensively to analyse data and create plots. The Venn and admixture plots were created by using a Python library: https://github.com/ShujiaHuang/geneview. Supplementary Figs. 17a and 20 were created using: https://gtexportal.org/. Fig. 4d was created using: https://popgen.uchicago.edu/ggv/.
References
Manolio, T. A., Bailey-Wilson, J. E. & Collins, F. S. Genes, environment and the value of prospective cohort studies. Nat. Rev. Genet. 7, 812–820 (2006).
Qiu, X. et al. The Born in Guangzhou Cohort Study (BIGCS). Eur. J. Epidemiol. 32, 337–346 (2017).
Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577, 179–189 (2020).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Denny, J. C. et al. The ‘all of us’ research program. N. Engl. J. Med. 381, 668–676 (2019).
Barker, D. J. P. The fetal and infant origins of adult disease. Br. Med. J. 301, 1111 (1990).
Gaillard, R. & Jaddoe, V. W. V. Maternal cardiovascular disorders before and during pregnancy and offspring cardiovascular risk across the life course. Nat. Rev. Cardiol. 20, 617–630 (2023).
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).
Fraser, A. et al. Cohort profile: the Avon Longitudinal Study of Parents and Children: ALSPAC mothers cohort. Int. J. Epidemiol. 42, 97–110 (2013).
Magnus, P. et al. Cohort profile update: the Norwegian Mother and Child Cohort Study (MoBa). Int. J. Epidemiol. 45, 382–388 (2016).
Ernst, A. et al. Cohort profile: the puberty cohort in the Danish National Birth Cohort (DNBC). Int. J. Epidemiol. 49, 373–374 (2020).
Kooijman, M. N. et al. The Generation R Study: design and cohort update 2017. Eur. J. Epidemiol. 31, 1243–1264 (2016).
Middeldorp, C. M., Felix, J. F., Mahajan, A. & McCarthy, M. I. The Early Growth Genetics (Egg) and Early Genetics And Lifecourse Epidemiology (eagle) consortia: design, results and future prospects. Eur. J. Epidemiol. 34, 279–300 (2019).
Metzger, B. E. et al. Hyperglycemia and adverse pregnancy outcomes. N. Engl. J. Med. 358, 1991–2002 (2008).
Kishi, R. et al. Birth Cohort Consortium of Asia: current and future perspectives. Epidemiology 28, S19–S34 (2017).
Tao, F. B. et al. Cohort profile: the China–Anhui Birth Cohort Study. Int. J. Epidemiol. 42, 709–721 (2013).
Hu, Z. B. et al. Profile of China National Birth Cohort. Chinese J. Epidemiol. 42, 569–574 (2021).
Yue, W. et al. The China Birth Cohort Study (CBCS). Eur. J. Epidemiol. 37, 295–304 (2022).
Li, Y., Sidore, C., Kang, H. M., Boehnke, M. & Abecasis, G. R. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Liu, S. et al. Genomic analyses from non-invasive prenatal testing reveal genetic associations, patterns of viral infections, and Chinese population history. Cell 175, 347–359.e14 (2018).
Cao, Y. et al. The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res. 30, 717–731 (2020).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Wall, J. D. et al. The GenomeAsia 100 K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Zhang, P. et al. NyuWa Genome resource: a deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep. 37, 110017 (2021).
Cong, P. K. et al. Genomic analyses of 10,376 individuals in the Westlake BioBank for Chinese (WBBC) pilot project. Nat. Commun. 13, 2939–15 (2022).
Mallick, S. et al. The Allen Ancient DNA Resource (AADR): A curated compendium of ancient human genomes. Preprint at bioRxiv https://doi.org/10.1101/2023.04.06.535797 (2023).
Mao, X. et al. The deep population hiswwwtory of northern East Asia from the Late Pleistocene to the Holocene. Cell 184, 3256–3266.e13 (2021).
Yang, M. A. et al. Ancient DNA indicates human population shifts and admixture in northern and southern China. Science 369, 282–288 (2020).
Ning, C. et al. Ancient genomes from northern China suggest links between subsistence changes and human migration. Nat. Commun. 11, 2700 (2020).
Wang, T. et al. Human population history at the crossroads of East and Southeast Asia since 11,000 years ago. Cell 184, 3829–3841.e21 (2021).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Kamat, M. A. et al. PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations. Bioinformatics 35, 4851–4853 (2019).
Hayes, M. G. et al. Identification of HKDC1 and BACE2 as genes influencing glycemic traits during pregnancy through genome-wide association studies. Diabetes 62, 3282–3291 (2013).
Peng, L. et al. The p.Ser267Phe variant in SLC10A1 is associated with resistance to chronic hepatitis B. Hepatology 61, 1251–1260 (2015).
Ovadia, C. et al. Association of adverse perinatal outcomes of intrahepatic cholestasis of pregnancy with biochemical markers: results of aggregate and individual patient data meta-analyses. Lancet 393, 899–909 (2019).
Warrington, N. M. et al. Maternal and fetal genetic contribution to gestational weight gain. Int. J. Obes. 42, 775–784 (2018).
Safran, M. et al. GeneCards version 3: the human gene integrator. Database 2010, baq020 (2010).
Smith, J. R. et al. The Year of the Rat: the Rat Genome Database at 20: a multi-species knowledgebase and analysis platform. Nucleic Acids Res. 48, D731–D742 (2020).
Marissal-Arvy, N. et al. QTLs influencing carbohydrate and fat choice in a LOU/CxFischer 344 F2 rat population. Obesity 22, 565–575 (2014).
Juliusdottir, T. et al. Distinction between the effects of parental and fetal genomes on fetal growth. Nat. Genet. 53, 1135–1142 (2021).
Han, Z., Lutsiv, O., Mulla, S. & McDonald, S. D. Maternal height and the risk of preterm birth and low birth weight: a systematic review and meta-analyses. J. Obstet. Gynaecol. Canada 34, 721–746 (2012).
Voigt, M. et al. Individualized birth length and head circumference percentile charts based on maternal body weight and height. J. Perinat. Med. 48, 656–664 (2020).
Teng, H. et al. Gestational systolic blood pressure trajectories and risk of adverse maternal and perinatal outcomes in Chinese women. BMC Pregnancy Childbirth 21, 155 (2021).
Chen, J. et al. Dissecting maternal and fetal genetic effects underlying the associations between maternal phenotypes, birth outcomes, and adult phenotypes: a Mendelian-randomization and haplotype-based genetic score analysis in 10,734 mother–infant pairs. PLoS Med. 17, e1003305 (2020).
Baker, H. D. R. Language atlas of China. Bull. Sch. Orient. Afr. Stud. 56, 398–399 (1993).
Chen, Y. et al. SOAPnuke: A MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience 7, gix120 (2018).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res. 30, 185–194 (2020).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Browning, B. L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Yu, K. et al. Meta-imputation: an efficient method to combine genotype data after imputation with multiple reference panels. Am. J. Hum. Genet. 109, 1007–1015 (2022).
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, https://doi.org/10.1186/s13742-015-0047-8 (2015).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Patterson, N. et al. Ancient admixture in human history. Genetics 192, 1065–1093 (2012).
Wangkumhang, P., Greenfield, M. & Hellenthal, G. An efficient method to identify, date, and describe admixture events using haplotype information. Genome Res. 32, 1553–1564 (2022).
Hellenthal, G. et al. A genetic atlas of human admixture history. Science 343, 747–751 (2014).
Zhou, W. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).
Pruim, R. J. et al. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 26, 2336–2337 (2010).
Lonsdale, J. et al. The Genotype–Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).
Acknowledgements
This study was supported by the Ministry of Science and Technology of the People’s Republic of China (2022YFC2702903, 2022YFC2704601, 2021ZD0200536), the National Natural Science Foundation of China (81673181, 82173525, 31900487, 82003471, 82273642), the Department of Science and Technology of Guangdong Province (2020B1111170001, 2018B030335001, 2019B030301004, 2022B1212010004), Guangdong Basic and Applied Basic Research Foundation (2022B1515120080, 2020A1515110859), Science and Technology Planning Project of Guangdong Province (2019B020227001, 2019B030316014), the Guangzhou Municipal Science and Technology Bureau (202201020656, 202102010254, 202007030002), the Guangzhou Municipal Health Commission (2023A031001), and Shenzhen Basic Research Foundation (20220818100717002). We are grateful to all the participants in BIGCS project. We thank C. Gao, P. Huang, X. Liu, Y. Hu and all colleagues at GWCMC who have provided invaluable assistance to the BIGCS project; G. Zhang for useful discussions on Mendelian randomization analysis; and the professional technical support service provide by W. Lai, L. Wei and S. Liu for setting up the GDBIG website. We thank the Tianhe-2 Supercomputer Center in Guangzhou for support of computational and storage resources. We would also like to acknowledge the Genotype-Tissue Expression (GTEx) Project, for providing figure data used in Fig. 4b, Supplementary Figs. 17 and 20. The data used for the analyses described in this manuscript were obtained from the GTEx portal on 7 July 2022.
Author information
Authors and Affiliations
Consortia
Contributions
Conceptualization: X.Q., S.H. and S. Liu Sample collection and data curation: Y.K., J.L. and X.X. Investigation: S.H., M.H., S. Liu and C.W. Methodology: S.H., S. Liu and Q.F. Formal analysis: S.H., M.H., C.W., S. Liu, T.W., Q.F. and X.F. Visualization: S.H., M.H., C.W. and S. Liu. Software: S.H., M.H. and C.W. Validation: S. Liu, Y. Gu, M.H., S.H., J.L., X.X., Y.K. and J.H. Writing, original draft: S. Liu and S.H. Writing, review and editing: S. Liu, S.H., Q.F., T.W., X.Q., J.H., S. Lin and W.Z. Project administration: X.Q., S.H. and J.H. Supervision: X.Q. and H.X. Resources: X.Q. and H.X.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature thanks Rachel Freathy, Sarah Gagliano Taliun, Chuan-Chao Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Read alignment, variant calling, filtering, and genotype refinement.
A comprehensive description of the quality control and bioinformatics analysis is available in the Methods and Supplementary Notes.
Extended Data Fig. 2 Assessment of imputation accuracy and coverage of imputed variants compared to true set variants from WGS.
For variants present in the true set but absent from the reference panel, an R2 value of zero was assigned. The X-axis represents allele frequency, estimated based on the BIGCS dataset. In cases where a variant was unavailable in BIGCS, its allele frequency was estimated using data from the 50 WGS samples.
Extended Data Fig. 3 Principal component analysis (PCA) comparing linguistic groups in BIGCS and nine present-day Asian linguistic families in AADR.
(a-b) Geographic distribution of the 4,053 BIGCS participants (a) and all the 836 Chinese samples from the AADR dataset (b). (c-d) PCA was conducted on a merged sample comprising 2,245 present-day unrelated Chinese individuals from the BIGCS dataset, 402 present-day Chinese individuals from the AADR dataset, and 202 present-day Asian groups from the AADR dataset. Each data point on the PCA plot represents one participant, with colors and shapes denoting their linguistic or ethnic groups. In plot (c), nine different shapes were used to represent nine linguistic families. In (d), the shapes remained consistent, with additional colors assigned to each linguistic family to represent linguistic groups. The analysis utilized 258,552 biallelic sites and applied the following pruning and filtration parameters: “--maf 0.01 --geno 0.3 --hwe 1e-6 --vcf-half-call m --indep-pairwise 1000 100 0.9”. The map in panels a and b was sourced from an approved standard map service (http://bzdt.ch.mnr.gov.cn) endorsed by the Ministry of National Resources of the People’s Republic of China (GS YUE(2023)1422).
Extended Data Fig. 4 Genetic structure and admixture of participants in the BIGCS study.
Ancestral components were determined for each participant using ADMIXTURE, considering K values ranging from 2 to 5. This analysis included the five linguistic groups with a sample size greater than 50. The optimal number of ancestral components, determined by the smallest cross-validation error, was found to be K = 3. These three ancestral components were visually represented by the colors green, orange, and blue, which corresponded to ancestral components enriched with Cantonese, Min, and Mandarin respectively. Mandarin speakers were further subdivided into seven groups based on their birthplace: Mandarin_SC (South China), Mandarin_SWC (Southwest China), Mandarin_CC (Central China), Mandarin_EC (East China), Mandarin_NWC (Northwest China), Mandarin_NEC (Northeast China), Mandarin_NC (North China).
Extended Data Fig. 5 LocusZoom plots of the remaining 12 loci reaching study-wide association significance (P < 2.78 × 10−9) besides the SLC10A1 locus.
Detailed information about the lead variants is provided in Table S9. LD r2 calculations were performed using East Asian populations from the 1KGP dataset, except for the TTC28 and SOAT2 loci. As the lead SNPs for these two loci were absent in the 1KGP dataset, LD r2 was computed using the BIGCS reference panel through the pairwise LD research tool available on the BIGCS website (http://gdbig.bigcs.com.cn/ld/cal.html). The LocusZoom plot illustrating the SLC10A1 locus association is presented in Fig. 4a.
Extended Data Fig. 6 Observed phenotypic associations, estimated effects of parentally transmitted alleles, maternal non-transmitted alleles, maternal causal effect, and fetal genetic effect per one-unit change in maternal phenotypes on birth outcomes.
Measurement Units: 1 cm (height), 1 kg/m2 (BMI), 1 mmHg (BP), 1 mmol/L (FPG, TC and TG) and 1 umol/L (TBA).
Supplementary information
Supplementary Information
This document offers comprehensive details on variant calling, population genetic analysis, the assessment of age-specific genetic effects, and intergenerational Mendelian randomization conducted on the BIGCS dataset. It includes Supplementary Notes, Supplementary Figs. 1–21, and a reference guide for Supplementary Tables 1–20.
Supplementary Table 1
Geographic distribution of BIGCS cohort samples investigated in this study. Related to Fig. 1.
Supplementary Table 2
Ethnicity distribution of the BIGCS cohort samples investigated in this study. Related to Fig. 1.
Supplementary Table 3
Summary statistics of sequencing data and variant detected in each BIGCS individual.
Supplementary Table 4
Number of variants and average imputation accuracy for a range of Minimac3 estimated R-squares thresholds and reference panels.
Supplementary Table 5
Evaluation of mean imputation accuracy of the BIGCS reference panel imputation, in comparison with four commonly used reference panels.
Supplementary Table 6
Supplementary Table 6a: Geographic distribution of Chinese sample form AADR samples investigated in this study. Related to Extended Data Fig. 3b. Supplementary Table 6b: Information on 56 present-day and ancient Asian groups from AADR used in admixture analysis.
Supplementary Table 7
Supplementary Table 7a: Symmetric f4 test comparing each BIGCS linguistic groups with Neolithic and Pre-neolithic northern and southern Asian groups. Supplementary Table 7b: Symmetric f4 test comparing each BIGCS linguistic groups with six deep Asian lineages and two present-day Chinese groups. Supplementary Table 7c. Successful qpAdm results for the BIGCS linguistic groups assuming one, two and three sources.
Supplementary Table 8
Summary of Chromopainter and Globetrotter analysis for the five major linguistic groups in BIGCS.
Supplementary Table 9
Supplementary Table 9a: Genome-wide associations signals reaching study-wide significance threshold (P < 2.78e-9). Related to Figs. 3–5. Supplementary Table 9b: Genome-wide association signals reaching genome-wide significance threshold (P < 5 × 10−8). Related to Figs. 3–5.
Supplementary Table 10
Nearby variants with linkage disequilibrium r2 > 0.2 surrounding the 4-bp deletion (rs3840091) associated with GWG and comparison with a GWG GWAS in the European population. Related to Fig. 3.
Supplementary Table 11
Genome-wide association analysis of the four lipid traits using adult and infant samples jointly with SAIGE.
Supplementary Table 12
Supplementary Table 12a: Ratio estimates for intergenerational Mendelian randomization analysis without restriction of instrumental variables (primary analysis). Supplementary Table 12b: Ratio estimates for intergenerational Mendelian randomization analysis excluding specific instrumental variables (secondary analysis).
Supplementary Table 13
Supplementary Table 13a: Intergenerational Mendelian randomization analysis for normalized maternal trait measurements and normalized fetal growth measurements without restriction of instrumental variables. Supplementary Table 13b:Intergenerational Mendelian randomization analysis for normalized maternal trait measurements and normalized fetal growth measurements excluding specific instrumental variables.
Supplementary Table 14
GWAS SNPs used to calculate the genetic scores for maternal height (POS: GRCh38 coordinate; A: effect allele).
Supplementary Table 15
GWAS SNPs used to calculate the genetic scores for maternal pre-pregnancy BMI (POS: GRCh38 coordinate; A: effect allele).
Supplementary Table 16
GWAS SNPs used to calculate the genetic scores for fasting plasma glucose (FPG) (POS: GRCh38 coordinate; A: effect allele).
Supplementary Table 17
GWAS SNPs used to calculate the genetic scores for blood pressure (POS: GRCh38 coordinate; A: effect allele).
Supplementary Table 18
GWAS SNPs used to calculate the genetic scores for TBA (POS: GRCh38 coordinate; A: effect allele).
Supplementary Table 19
GWAS SNPs used to calculate the genetic scores for triglyceride (POS: GRCh38 coordinate; A: effect allele).
Supplementary Table 20
GWAS SNPs used to calculate the genetic scores for total cholesterol (POS: GRCh38 coordinate; A: effect allele).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Huang, S., Liu, S., Huang, M. et al. The Born in Guangzhou Cohort Study enables generational genetic discoveries. Nature 626, 565–573 (2024). https://doi.org/10.1038/s41586-023-06988-4
This article is cited by
-
Genome-wide association studies of thyroid-related hormones, dysfunction, and autoimmunity among 85,421 Chinese pregnancies
Nature Communications (2024)
-
An early look at birth cohort genetics in China
Nature (2024)
-
New genetic variants found in large Chinese mother–baby study
Nature (2024)