Abstract
To provide novel insight regarding the inter-population diversity of loci associated with complex traits, we integrated genome-wide data from UK Biobank (UKB) and 1,000 Genomes Project (1KG) data representative of the genetic diversity among worldwide populations. We investigated genome-wide data of 4,359 traits from 361,194 UKB participants of European descent. Using 1KG data, we explored the allele frequency differences and linkage disequilibrium (LD) structure of UKB genome-wide significant (GWS) loci across worldwide populations. Functional annotation data were used to identify regulatory elements and evaluate the tagging properties of GWS variants. No significant difference was observed in allele frequency between UKB and 1KG GBR (British in England and Scotland). Considering other population groups, we identified genome-wide significant alleles with frequencies different from what expected by chance: UKB vs. 1KG Europeans without GBR (rs74945666; allele=T [0.908 vs. 0.03], standing height pGWAS=1.48×10-17), UKB vs. 1KG African (rs556562; allele=A [0.942 vs. 0.083], platelet count pGWAS=4.84×10-15), UKB vs. 1KG Admixed Americans (rs1812378; allele=T [0.931 vs. 0.089], standing height pGWAS=4.23×10-12), UKB vs. 1KG East Asian (rs55881864; allele=T [0.911 vs. 0.001], monocyte count pGWAS=7.29×10-13), and UKB vs. South Asian (rs74945666; allele=T [0.908 vs. 0.061], standing height pGWAS=1.48×10-17). LD-structure analysis and computational prediction showed differences in how these alleles tag functional elements across human populations. In conclusion, the human diversity of certain GWS loci appear to be affected by local adaptation while in other cases the associations may be biased by residual population stratification.
Introduction
Genome-wide association studies (GWAS) are a powerful tool to identify genetic variants associated with human traits and diseases (Visscher et al. 2017). Since the first GWAS conducted in 2005 (Klein et al. 2005), 4,671 GWAS reporting >19,813 associations have been listed in the GWAS Catalog (Buniello et al. 2019) as of August 13, 2020. This unprecedented amount of information has revolutionized our understanding of the predisposition to complex phenotypes, demonstrating that a large portion of the heritability of complex traits resides in common genetic variation (i.e., polymorphisms in the human genome that show a minor allele frequency (MAF) greater than 1%) (Visscher et al. 2017). In recent years, the investigations of massive cohorts from 100,000 to more than 1,000,000 participants were possible because of large collaborative projects combining numerous studies (Colodro-Conde et al. 2017; Kim et al. 2017; Sullivan et al. 2018; Thompson et al. 2014), the availability of biobanks enrolling an unprecedented number of participant (Fan et al. 2008; Kubo and Guest 2017; Sudlow et al. 2015), and collaboration with direct-to-consumer genetic testing companies (Check Hayden 2017). These large-scale GWAS identifying ever-greater numbers of risk loci with ever-smaller individual effects demonstrated that the genetic architecture of common diseases is highly polygenic and their heritability is likely due to the contribution of several thousand (or even more) risk loci across the human genome (Evangelou et al. 2018; Karlsson Linner et al. 2019; Lee et al. 2018; Timmers et al. 2019). One of the main GWAS promises is that the knowledge gained can be used to develop genetic instruments useful to predict disease risk, treatment response, and disease prognosis. Leveraging data generated by large-scale GWAS, a growing number of studies are developing approaches to test the utility of polygenic information with respect to the human phenotypic spectrum (Inouye et al. 2018; Khera et al. 2019; Sparano et al. 2019; Weigl et al. 2018). Although these successful experiments strongly support the movement towards the application of GWAS data to develop new strategies to prevent and treat human diseases, important challenges remain. Among them, one of the most pressing is related to the limited ancestry and ethnic diversity of large-scale GWAS that have created a large gap between the genetic data available for populations of European descent and non-European human groups (Sirugo et al. 2019). Applying GWAS data generated from European-ancestry cohorts to non-European individuals raise serious issues, including much lower predictive power than that observed in comparisons between like populations (Martin et al. 2019; Mostafavi et al. 2019) and possible biases (e.g., reflecting an accounted population stratification rather than the phenotype of interest) due to the genetic diversity among human populations (Duncan et al. 2019; Martin et al. 2017). The most reliable solution to this problem is to conduct large-scale GWAS in populations with non-European ancestry. Ongoing efforts such as the Million Veteran Program (Gaziano et al. 2016) and the AllofUS Research Program (Sankar and Parker 2017) are investigating multiple ancestry groups representative of the US population to reduce this gap. Although these kinds of projects are expected to eliminate the population disparities in human genetic research, this is likely to be a long-term outcome. To date, to contribute to a more comprehensive understanding of human genetic diversity, we can leverage the data available, combining large-scale genome-wide association datasets generated from cohorts including mainly participants of European descent with reference panels representative of the genetic diversity among worldwide populations (Daub et al. 2013; Hofer et al. 2009; Iorio et al. 2017; Polimanti et al. 2015).
In the present study, we focused our attention on the UK Biobank (UKB). This large cohort including more than 500,000 participants with approximately 90% of them as British individuals of European descent (Bycroft et al. 2018). Based on UKB participants of European descent, GWAS have been conducted with respect to the human phenome spectrum, identifying a large number of risk loci surviving the genome-wide significance threshold (p<5×10-8). Using 1,000 Genomes Project (1KG) data, we explored the diversity of these loci, comparing allele frequency differences across worldwide populations. The results obtained showed that allele frequency differences in certain risk loci are significantly different from that expected from randomly selected variants with similar genomic characteristics (i.e., minor allele frequency (MAF), gene density, distance to nearest gene, and linkage disequilibrium (LD) proxies). In some cases, these population differences appear to be due to the evolutionary events related to local adaptation (i.e., adaptation in response to selective pressure related to the local environment), while other cases may be related to the residual effect of population stratification in UKB GWAS.
Materials and Methods
UK Biobank
The present study was conducted leveraging UKB genome-wide association data. UKB is a large population-based prospective study to explore different life-threatening disorders using information about environment and genes in order to improve diagnosis and treatment (Sudlow et al. 2015). A wide variety of phenotypic information, including socio-demographic and lifestyle factors, electronic health records data, and physiological conditions have been collected for more than 500,000 UKB participants (Bycroft et al. 2018). The genotypes of the whole cohort were defined by applying a bespoke genome-wide DNA microarray that contains about 850,000 genetic variants (including rare, intermediate, and common variants) (Allen et al. 2014). Genetic data were then used to generate genome-wide association datasets that can be employed to explore the genetics of complex traits. The genome-wide datasets used in the present study were derived from the analysis of 361,194 unrelated British participants of European descent. Genome association analyses for over 4,000 phenotypes was conducted using appropriate regression models available in Hail (available at https://github.com/hail-is/hail) including the first 20 ancestry principal components, sex, age, age2, sex x age, and sex x age2 as covariates. The principal components included in the regression model were generated by the UKB investigators using fastPCA algorithm (Galinsky et al. 2016) and considering unrelated subjects and genetic markers pruned for linkage disequilibrium (Bycroft et al. 2018). Details regarding QC criteria, GWAS methods, and the original data are available at https://github.com/Nealelab/UK_Biobank_GWAS/tree/master/imputed-v2-gwas.
1000 Genomes Project Phase3
To dissect the genetic differences of UKB participants with respect to other European samples and other worldwide populations, we used data derived from 1KG Phase3. The 1KG project aims to provide information about common and rare human genetic variation by applying whole-genome sequencing to a large cohort of individuals derived from different populations (Genomes Project et al. 2010; Genomes Project et al. 2012; Genomes Project et al. 2015). The 1KG Phase 3 of the project includes data about 2,504 individuals sampled from 26 populations representative of Africa (AFR), East Asia (EAS), Europe (EUR), South Asia (SAS), and the Americas (admixed; AMR) (Genomes Project et al. 2015). Details regarding alignment, mapping algorithm, SNP (single nucleotide polymorphism) calling, and the data of the project are available at https://www.internationalgenome.org/analysis.
Variants filtering and clumping
We considered genetic association results generated from 361,194 UKB participants of European descent tested with respect to 4,359 phenotypic outcomes including physiological, health, and lifestyle conditions (Supplementary File 1). We focused our attention on variants with a GWAS p-value significance threshold of P ≤ 5×10-8 and MAF ≥ 5 %. Furthermore, to control the potential inflation in the test statistics, as suggested by the investigators that generated the data (details at http://www.nealelab.is/blog/2017/9/11/details-and-considerations-of-the-uk-biobank-gwas), we selected high-confidence associations results generated from variants with at least 25 minor alleles in the smaller group between case or control. To find independent association signals among variants selected, we conducted a P-value-informed clumping with a LD cut-off of R2 = 0.1 within a 1000 kb window.
Allele frequency differences among human populations
We calculated the allele frequency of the index variants identified from the LD-clumping in AFR, EAS, EUR, SAS, and the AMR 1KG superpopulations. Specifically, we tested the following comparisons: i) UKB vs. 1KG GBR (British in England and Scotland) reference sample; ii) UKB vs. 1KG EUR reference panel (excluding GBR sample); iii) UKB vs. each of the non-European 1KG superpopulations (AFR, AMR, EAS, and SAS). For the subsequent analyses, we considered the loci showing allele frequency differences in the top 1% of all index variants investigated with respect to each comparison conducted.
Comparisons with respect to randomly-selected variants matched by genomic characteristics
To verify whether the allele frequency of each variant identified was different from what expected by chance, we generated a control set of matched variants using SNPsnap tool (Pers et al. 2015). This permitted us to identify sets of randomly selected variants SNPs matched to the index variants on the basis of four genomic characteristics: i) MAF, ii) LD proxies, iii) distance to nearest gene, and iv) gene density. Thus, variants identified in the first percentile were used as inputs considering the following parameters: 1KG EUR population (which is the closest reference panel among those available in SNPsnap); LD distance cut-off of R2=0.5; ±5% point deviation; ±50% of gene density relative deviation; ±50% of relative deviation of the distance to nearest gene; ±50% of relative deviation of LD proxies. For each index variant identified in the initial screening described in the section above, we extracted up to 10,000 matched SNP, excluding the HLA region due to its complex LD structure. Based on the corresponding randomly-selected genomically-matched sets, we calculated empirical p values for each index variant tested and considered type I error rate at 1% as the significance threshold. Finally, we checked whether the significative index variants showed allele frequency mismatches and mismapping using previously generated data available at http://kunertgraf.com/data/biobank.html (Kunert-Graf et al. 2020).
Cross-Ancestry LD comparison and Functional Annotation
For the index variants with empirical p values surviving statistical significance, we conducted computational analyses to explore their functional consequences. Using LDlink (Machiela and Chanock 2015, 2018), we tested the effect of the LD structure variability across human populations on the ability of differentiated index variants to tag (measured as LD R2) functional variants in the surrounding regions (±500Kb). RegulomeDB (Boyle et al. 2012) was used to score the regulatory effect of the tagged variants on the basis of high-throughput, experimental data sets as well as computational predictions and manual annotations. LD R2>0.50 and RegulomeDB score = 1a-f (Supplementary File 2) were used as criteria to identify functional tag SNPs.
Enrichment analysis for significant phenotypic traits
To test whether traits related to differentiated loci were overrepresented with respect to certain phenotypic domains, we performed χ2 test comparing whether the proportions of the phenotypic distribution observed with respect to the identified loci are significantly different from the ones of the overall distribution observed across the 4,000+ UKB phenotypes analyzed.
Pan-UK Biobank data
To investigate the loci identified in non-European ancestral groups, we used the newly-released Pan-UKB genome-wide association statistics related to 7,221 phenotypes: 6,636 of AFR individuals; 980 AMR individuals; 8,876 individuals of Central/South Asian ancestry (CSA); 2,709 EAS individuals. A detailed description of the methods used to generate these data is available at https://pan.ukbb.broadinstitute.org/. Using these data, we investigated whether the EUR associations of the index variants were also concordant in AFR, AMR, CSA, and EAS. Pan-UKB data are available at https://pan.ukbb.broadinstitute.org/downloads.
Results
Based on genome-wide significant associations (p< P ≤ 5×10-8) across the UKB phenotypic spectrum assessed (4,359 traits), we identified a total of 15,327 LD-independent risk alleles. Among these, we identified 154 index variants showing allelic frequency differences in the top 1% with respect to the three comparisons conducted: i) UKB vs. 1KG GBR; ii) UKB vs. 1KG EUR (excluding GBR sample); iii) UKB vs. each of the non-European 1KG superpopulations (AFR, AMR, EAS, and SAS) (Figure 1; Supplementary File 3). To test whether the allele frequency differences were significantly different from what expected by chance, we generated a control set of 10,000 variants matched by genomic characteristics (i.e., gene density, distance to the nearest gene, and the number of LD proxies) for each of the index variants (Supplementary File 4). For all significative index variants, we reported their phenotypic associations and those related to the variants in LD with them in Supplementary File 5. In line with the fact that both samples are representative of the genetic variability of British populations, no significant difference was observed in the allele frequency of index variants between the UKB cohort and 1KG GBR panel (Supplementary File 4). Conversely, when comparing UKB with other population groups, allele frequency differences were observed in loci associated with several traits. The differentiated loci appear to be associated mainly with observed that anthropometric traits and hematologic parameters. Across multiple populations comparisons, we observed that the phenotypic enrichments were significantly different from what expected by chance (5.39×10-7<p<2.75×10-79; Figure 2).
UKB British participants vs. 1,000 Genomes Project non-British Europeans
Considering UKB vs. 1KG EUR reference sample (excluding GBR sample), we identified several significant different risk loci associated with different traits (Table 1; Figure 2). Among them, we observed some traits related to anthropometric measurements and white blood cells and platelet parameters: standing height (rs74945666, allele=T [0.909 vs. 0.030], p=1.48×10-17); Heel Broadband ultrasound attenuation direct entry (rs200033476, allele= C [0.917 vs. 0.018], p=4.01×10-9); immature reticulocyte fraction (rs34690548, allele= CA [0.904 vs. 0.0303], p=2.85×10-320); eosinophil percentage (rs200725444, allele= A [0.927 vs. 0.047], p=8.60×10-14); platelet count (rs201088941, allele= TA [0.922 vs. 0.029], p=6.72×10-10). Because of the similar LD structure, we did not observe differences between UKB and EURnoGBR populations with respect to the ability of the index variants to tag functional elements (Supplementary File 6).
UKB British participants vs. 1,000 Genomes Project Africans
Comparing UKB with 1KG AFR superpopulation, we identified loci associated with several conditions (Table 1, Figure 2, Supplementary File 5). Particularly, these were related to anthropometric traits: leg impedance (rs3749748, allele=T [0.944 vs. 0.017], p=1.75×10-123); standing height (rs157573, allele=A [0.876 vs. 0.015], p=8.06×10-33,; rs35497246, allele=C [0.058 vs. 0.893], p=2.53×10-13; rs1812378, allele=T [0.931 vs. 0.031], p=4.23×10-12; rs42525, allele=C [0.877 vs. 0.021],p=1.90×10-10; rs625670, allele=A [0.998 vs. 0.002],p=4.29×10-8); arm impedance (rs1881131, allele=A [0.942 vs. 0.031], p=3.34×10-16); whole body water mass (rs475591, allele=T [0.998 vs. 0.110], p=1.56×10-12); Heel Broadband ultrasound attenuation direct entry (rs200033476, allele=C [0.917 vs. 0.046], p=4.01×10-9). Additionally, we observed several associations with hematologic parameters: lymphocyte count (rs451367, allele=T [0.950 vs. 0.014], p=1.08×10-27; rs3748022, allele=T, p=9.19×10-13); immature reticulocyte fraction (rs603620, allele=A [0.948 vs. 0.019], p=1.55×10-27); monocyte percentage (rs456798, allele=T [0.059 vs. 0.918], p=3.70×10-22,; 10 rs625465, allele=G [0.932 vs. 0.082], p=2.27×10-8); mean platelet (thrombocyte) volume (rs171042; p=3.97×10-18); red blood cell (erythrocyte) distribution width (rs374361, allele=T [0.900 vs. 0.012], p=3.03×10-15; rs55959450, allele=C [0.947 vs. 0.013], p=1.84×10-9); platelet count (Phesant:30080_irnt; rs556562, allele=A [0.943 vs. 0.084], p=4.84×10-15); eosinophil percentage (rs200725444, allele=A [0.927 vs. 0.007], p=8.60×10-14; monocyte count (rs55881864, allele=T [0.912 vs. 0.006], p=7.29×10-13); reticulocyte percentage (rs57236847, allele=G [0.891 vs. 0.037], p=2.53×10-8). We observed also other traits that are related to anthropometric and hematologic phenotypes: palmar fascial fibromatosis (rs651985, allele= G [0.903 vs. 0.044], p=2.29×10-42); systolic blood pressure, automated reading (rs604723, allele= T [0.899 vs. 0.008], p=6.73×10-40; rs55815739, allele=A, p=2.23×10-8); 6mm asymmetry index (; rs55971426, allele=G [0.887 vs. 0.007], p=1.60×10-10).
Regarding cross-ancestry LD analysis, we observed that several index variants showed different tagging properties with respect to functional elements. Indeed, while only rs3749748 tag functional elements in both populations (Supplementary File 7, Supplementary File 8-Figure S8.1), several index variants (i.e., rs157573, rs451367, rs475591, rs625465) are in LD with functional loci in EUR populations but not in AFR populations (Supplementary File 7, Supplementary File 8-Figure S8.2-5).
UKB British participants vs. 1,000 Genomes Project Admixed Americans
The allele frequency differences between UKB and 1KG AMR are related to loci mainly associated to hematologic traits (Figure 2; Table 1): immature reticulocyte fraction (rs34690548, allele=CA [0.904 vs. 0.052], p=2.85×10-320); reticulocyte percentage (rs321600, allele=A [0.997 vs. 0.123], p=1.54×10-30); lymphocyte count (rs451367, allele=T [0.950 vs. 0.091], p=1.08×10-27); neutrophill count (rs571497, allele=A [0.943 vs. 0.109], p=2.58×10-23; rs4544340, allele=T [0.949 vs. 0.087], p=1.86×10-10); mean platelet (thrombocyte) volume (rs171042, allele=T [0.951 vs. 0.127 p=3.97×10-18); eosinophil percentage (rs200725444, allele=A [0.927 vs. 0.055], p=8.60×10-14); platelet distribution width (rs1875103, allele=T [0.998 vs. 0.075], p=4.43×10-12); red blood cell (erythrocyte) distribution width (rs55959450, allele=C [0.947 vs. 0.074], p=1.84×10-9); standing height (rs1812378, allele=T [0.931 vs. 0.089], p=4.23×10-12); heel broadband ultrasound attenuation, direct entry (rs200033476, allele=C [0.917 vs. 0.012], p=4.01×10-9); systolic blood pressure, automated reading (rs55815739, allele=A [0.928 vs. 0.052], p=2.23×10-8). Comparing the LD structure of UKB and AMR populations, we observed two variants (rs571497, rs4544340) tagging functional elements in both populations (Supplementary File 9, Supplementary File 10-Figure S 10.1-2). Conversely, rs451367 associated with lymphocyte count is in LD (R2=0.61) with a functional SNP (rs4808485; RegulomeDB=1a) in British individuals but not in AMR populations (Supplementary File 9; Supplementary File 10-FigureS10.3).
UKB British participants vs. 1,000 Genomes Project East Asians
We observed allele frequencies differences between UKB vs. 1KG EAS in loci associated with parameters (Table 1, Figure 2): monocyte count (rs3732378, allele=A [0.941 vs. 0.029], p=1.64×10-67; rs55881864, allele=T [0.912 vs. 0.001],p=7.29×10-13); immature reticulocyte fraction (rs6014986, allele=A [0.911 vs. 0.016], p=3.05×10-43; rs603620, allele=A [0.948 vs. 0.003], p=1.55×10-27,); eosinophil percentage (rs34495, allele=T [0.916 vs. 0.043], p=1.14×10-15; rs200725444, allele=A [0.927 vs. 0.024], p=8.60×10-14); reticulocyte percentage (rs321600, allele=A [0.997 vs. 0.133], p=1.54×10-30); neutrophil count (rs571497, allele=A [0.943 vs 0.001], p=2.58×10-23); platelet distribution width (rs1875103, allele=T [0.999 vs. 0], p=4.43×10-12); platelet crit (rs9932254, allele=C [0.925 vs. 0.025], p=8.55×10-11); red blood cell (erythrocyte) distribution width (rs55959450, allele=C [0.947 vs. 0.003], p=1.84×10-9). Similarly to the other ancestry comparisons, several UKB-EAS differentiated loci are associated with anthropometric traits: arm impedance (rs1881131, allele=A [0.943 vs. 0.002], p=3.34×10-16); whole body water mass (allele=T [0.998 vs. 0.086], p=1.56×10-12); standing height (rs2861745, allele=G [0.876 vs. 0], p=8.19×10-11; rs42525, allele=C [0.877 vs. 0.024], p=1.90×10-10; rs625670, allele=A [0.998 vs. 0], p=4.29×10-8); Heel Broadband ultrasound attenuation (rs200033476, allele=C [0.917 vs. 0.002], p=4.01×10-9)). Finally, additional variants showing allele frequency differences were related to: palmar fascial fibromatosis (rs651985, allele=G [0.903 vs. 0],p=2.29×10-42; rs55971426l, allele=G [0.887 vs. 0.018]p=1.60×10‘ 10); spherical power (rs56207218, allele=C [0.896 vs. 0.009], p=1.23×10-9); systolic blood pressure, automated reading (rs55815739, allele=A [0.923 vs. 0.003], p=2.23×10-8;).
Comparing UKB and EAS LD structures, we observed that certain index variants tag different functional SNPs depending on the population considered (Supplementary File 11; Supplementary File 12-FigureS12.1-3). Conversely, rs571497 and rs56207218, associated with Neutrophil count and Spherical power respectively, are in LD (R2>0.5) with functional elements in both populations (Supplementary File 11; Supplementary File 12-FigureS12.4-5).
UK Biobank British participants vs. 1KG South Asians
Similarly, to what observed in the other ancestry comparisons, allele frequency differences between UKB and SAS were observed in variants associated with anthropometric traits and hematologic parameters. These included immature reticulocyte fraction (rs34690548, allele=CA [0.904 vs. 0.025], p=2.85×10-320); standing height (rs74945666, allele=T [0.909 vs. 0.061], p=1.48×10-17); eosinophil percentage (rs200725444, allele=A [0.927 vs. 0.030], p=8.60×10-14); Heel Broadband ultrasound attenuation (rs200033476, allele=C [0.917 vs. 0.013], p=4.01×10-9); (Figure 2; Table 1). The UKB-SAS differentiated loci did not show evidence of regulatory function or tagging of regulatory SNPs in any of the two populations (Supplementary File 13).
Cross-ancestry association analysis in non-European UK Biobank participants
Considering Pan-UK Biobank data related to non-European populations, we tested whether the differentiated variants and their functional tagged SNPs were associated with their related phenotypic traits in AFR, AMR, EAS and, CSA participants from UKB. Due to the dramatic difference in sample size between UKB participants of European descent (N=361,194) and UKB participants of non-European descent (980<N<8,876), only two variants differentiated between UKB and AFR were nominally replicated in UKB-AFR participants with respect to their related conditions (rs171042, mean platelet volume; rs374361, red blood cell distribution width) (Supplementary File 14).
Discussion
To provide a more comprehensive understanding of the genetics of complex traits across worldwide populations, we assessed loci associated with complex traits UKB participants of European descent that present allele frequency differences in other human groups worldwide populations leveraging 1KG reference data (Rees et al. 2020). As expected, there was no significant difference in the allele frequency of index variants between UKB cohort and 1KG GBR population, confirming that both samples are presentative of the genetic structure of the British population. Conversely, certain loci associated with complex traits in UKB participants of European descent showed allele frequency differences significantly different from what expected by chance when compared with non-British European populations (EURnoGBR) and with AFR, AMR, EAS, and SAS ancestries. Comparing the LD structure across these human groups, we observed that differentiated loci can tag differently regulatory elements, changing the functional meaning of genome-wide significant variants observed in UKB participants of European descent when analyzed in the context of other ancestral groups.
Considering the traits related to differentiated loci, we observed significant overrepresentation for anthropometric traits and hematologic parameters across multiple ancestry comparisons (5.39×10-7<p<2.75×10-79; Figure 2). These phenotypic categories are well-known to be differentiated across human populations due to evolutionary pressures and human demographic history (Guo et al. 2018).
Several studies investigated the underlying mechanisms that shaped the genetic architecture of anthropometric measures among human populations (Berg et al. 2019; Guo et al. 2018; Park et al. 2016; Polimanti et al. 2016; Turchin et al. 2012; Wood et al. 2014). In particular, gradients were observed polygenic height scores within European populations (north to south) and across Eurasia (east to west) (Berg et al. 2019; Turchin et al. 2012). Several hypotheses have been made regarding the presence of evolutionary pressures shaping the genetic architecture of height and other anthropometric traits (Guo et al. 2018). However, Sohail et al. (2019) demonstrated that the signature of polygenic adaptation on height is overestimated due to GWAS uncorrected stratification. Comparing results obtained from UKB and GIANT (Genetic Investigation of ANthropometric Traits) consortium, population-level differences in genetic height showed robust evidence only at highly significant SNPs while less significant P values were affected by residual population stratification. The findings provided by Sohail et al. (2019) indicate that previous analyses cannot distinguish the proportion of the population differences of genetic height due to evolutionary pressures vs. population stratification biases. In our analyses, we considered genome-wide significant variants (p<5×10-8) identified from UKB participants of European descent. In line with the study of Sohail et al. (2019), we expect that the variants investigated in the present study are less affected by population stratification. Accordingly, the observation that loci differentiated between UKB and 1KG reference populations (data independent from UKB) are enriched for anthropometric traits may support the involvement of evolutionary pressure and population demographic history in shaping the genetic architecture of anthropometric traits.
The second strong enrichment for loci differentiated between UKB and worldwide populations is related to hematologic parameters including traits related to red blood cell (RBC), white blood cell (WBC), and platelet. Similarly to anthropometric traits, several studies assessed genetic variation of hematologic phenotypes traits across human populations, observing strong differences in their geographic distribution (Beutler and West 2005; Chambers et al. 2009; Chen et al. 2020; Eicher et al. 2016; Ganesh et al. 2009; Hodonsky et al. 2017; Kamatani et al. 2010; Rappoport et al. 2019; Schick et al. 2016). This inter-population genetic variability is probably linked to the evolutionary pressures of infectious diseases (Astle et al. 2016; Dominguez-Andres and Netea 2019; Polimanti et al. 2016; Raffield et al. 2018). Risk alleles associated with RBC traits show high frequencies in African malaria-prone regions where there is a high prevalence of anemia and microcytosis (Barrera-Reyes and Tejero 2019; Dominguez-Andres and Netea 2019; Raffield et al. 2018). WBC and platelet-associated loci appear also differentiated across human populations (Chen et al. 2020; Eicher et al. 2016; Rappoport et al. 2019; Schick et al. 2016). Examples of this are i) the Duffy/DARC null variant in AFR individuals that is associated with low WBC and neutrophil counts and confers a selective advantage against malaria (Rappoport et al. 2019); ii) GATA2 genetic variation that reflects differences in eosinophil and basophil counts in Japanese population and monocyte and basophil counts in Europeans (Okada and Kamatani 2012), and iii) the presence of population-specific risk could partially account for the high platelet counts observed in Hispanic/Latinos (lower respect to other human population) (Schick et al. 2016).
Finally, differentiated loci that showed genome-wide significant associations in UKB participants of European descent were not replicated in non-European UKB participants (i.e., AFR, AMR, EAS, and CSA) independently from their tagging of functional elements across populations. Due to the dramatic change in sample size (N=361,194 vs. 980<N<8,876), this lack of replication is likely due to the strong reduction of statistical power in the non-European association analyses. Unfortunately, this is in line with the well-known issue related to the lack of non-European genome-wide data (Sirugo et al. 2019). As mentioned previously, many factors influence how causal variants are captured by tagging SNPs identified in a single population (Rees et al. 2020; Sirugo et al. 2019). We showed that loci associated with complex traits and differentiated across human populations can show different cross-ancestry LD tagging properties that can affect the functional meaning of the variant tested in the context of the ancestry group investigated. Thus, a large amount of genetic data of diverse 16populations are needed to provide a more comprehensive understanding of the molecular mechanisms at the basis of complex diseases.
In conclusion, this study provided novel evidence regarding the predisposition to complex traits in the context of human genetic variation. We observed that loci differentiated are enriched for traits that may be shaped by human evolutionary history (i.e., anthropometric traits and hematologic parameters). Additionally, we showed how the LD structure of human populations can affect the functional meaning of loci known to be associated with a specific ancestry group. Finally, although our data contribute to increasing our knowledge regarding cross-ancestry genetic predisposition to complex traits, they also clearly indicate that there is an urgent need for greater population diversity in genome-wide studies.
Data Availability
Data supporting the findings of this study are available within this article and its additional files.
Conflict of interest
The authors reported no biomedical financial interests or potential conflicts of interest.
Ethical approval
This study was conducted using summary association data generated by previous studies. Owing to the use of previously collected, deidentified, aggregated data, this study did not require institutional review board approval.
Data Availability
Data supporting the findings of this study are available within this article and its additional files. UK Biobank GWAS summary association data are available at https://github.com/Nealelab/UK_Biobank_GWAS/tree/master/imputed-v2-gwas
ACKNOWLEDGMENTS
We thank the participants and investigators of the UK Biobank, the Neale lab for generating the genome-wide data used in the present study, and Dr. Riccardo Pennacchi for the computational support. R.P. acknowledges support from the National Institutes of Health via R21 DA047527 and R21 DC018098. The sources of funding had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.