Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Progress
  • Published:

New approaches to population stratification in genome-wide association studies

Abstract

Genome-wide association (GWA) studies are an effective approach for identifying genetic variants associated with disease risk. GWA studies can be confounded by population stratification — systematic ancestry differences between cases and controls — which has previously been addressed by methods that infer genetic ancestry. Those methods perform well in data sets in which population structure is the only kind of structure present but are inadequate in data sets that also contain family structure or cryptic relatedness. Here, we review recent progress on methods that correct for stratification while accounting for these additional complexities.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: P–P plots for the visualization of stratification or other confounders.

Similar content being viewed by others

References

  1. McCarthy, M. I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev. Genet. 9, 356–369 (2008).

    Article  CAS  PubMed  Google Scholar 

  2. Campbell, C. D. et al. Demonstrating stratification in a European American population. Nature Genet. 37, 868–872 (2005).

    Article  CAS  PubMed  Google Scholar 

  3. Tian, C., Gregersen, P. K. & Seldin, M. F. Accounting for ancestry: population substructure and genome-wide association studies. Hum. Mol. Genet. 17, R143–R150 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Tian, C. et al. Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 4, e4 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Voight, B. F. & Pritchard, J. K. Confounding from cryptic relatedness in case–control association studies. PLoS Genet. 1, e32 (2005).

    Article  PubMed  PubMed Central  Google Scholar 

  6. Weir, B. S., Anderson, A. D. & Hepler, A. B. Genetic relatedness analysis: modern data and new challenges. Nature Rev. Genet. 7, 771–780 (2006).

    Article  CAS  PubMed  Google Scholar 

  7. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

    Article  CAS  PubMed  Google Scholar 

  8. Pritchard, J. K. & Rosenberg, N. A. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Reich, D. E. & Goldstein, D. B. Detecting association in a case–control study while correcting for population stratification. Genet. Epidemiol. 20, 4–16 (2001).

    Article  CAS  PubMed  Google Scholar 

  10. Clayton, D. G. et al. Population structure, differential bias and genomic control in a large-scale, case–control association study. Nature Genet. 37, 1243–1246 (2005).

    Article  CAS  PubMed  Google Scholar 

  11. Price, A. L. et al. The impact of divergence time on the nature of population structure: an example from Iceland. PLoS Genet. 5, e1000505 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Devlin, B., Bacanu, S. A. & Roeder, K. Genomic control to the extreme. Nature Genet. 36, 1129–1130 (2004); author reply in 36, 1131 (2004).

    Article  CAS  PubMed  Google Scholar 

  13. Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Rosenberg, N. A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).

    Article  CAS  PubMed  Google Scholar 

  15. Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201, 786–792 (1978).

    Article  CAS  PubMed  Google Scholar 

  18. Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, 1994).

    Google Scholar 

  19. Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nature Genet. 40, 646–649 (2008).

    Article  CAS  PubMed  Google Scholar 

  21. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–909 (2006).

    Article  CAS  PubMed  Google Scholar 

  22. Zhu, X., Zhang, S., Zhao, H. & Cooper, R. S. Association mapping, using a mixture model for complex traits. Genet. Epidemiol. 23, 181–196 (2002).

    Article  PubMed  Google Scholar 

  23. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Luca, D. et al. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am. J. Hum. Genet. 82, 453–463 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Lee, A. B., Luca, D., Klei, L., Devlin, B. & Roeder. K. Discovering genetic ancestry using spectral graph theory. Genet. Epidemiol. 34, 51–59 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Seldin, M. F. & Price, A. L. Application of ancestry informative markers to association studies in European Americans. PLoS Genet. 4, e5 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Spielman, R. S., McGinnis, R. E. & Ewens, W. J. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52, 506–516 (1993).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Laird, N. M. & Lange, C. Family-based designs in the age of large-scale gene-association studies. Nature Rev. Genet. 7, 385–394 (2006).

    Article  CAS  PubMed  Google Scholar 

  29. Abecasis, G. R., Cardon, L. R. & Cookson, W. O. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292 (2000).

    Article  CAS  PubMed  Google Scholar 

  30. Lange, C., DeMeo, D. L. & Laird, N. M. Power and design considerations for a general class of family-based association tests: quantitative traits. Am. J. Hum. Genet. 71, 1330–1341 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Won, S. et al. On the analysis of genome-wide association studies in family-based designs: a universal, robust analysis approach and an application to four genome-wide association studies. PLoS Genet. 5, e1000741 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Lasky-Su, J. et al. On genome-wide association studies for family-based designs: an integrative analysis approach combining ascertained family samples with unselected controls. Am. J. Hum. Genet. 86, 573–580 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genet. 38, 203–208 (2006).

    Article  CAS  PubMed  Google Scholar 

  34. Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in the genomics era — concepts and misconceptions. Nature Rev. Genet. 9, 255–266 (2008).

    Article  CAS  PubMed  Google Scholar 

  35. Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genet. 42, 348–354 (2010).

    Article  CAS  PubMed  Google Scholar 

  36. Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nature Genet. 42, 355–360 (2010).

    Article  CAS  PubMed  Google Scholar 

  37. Zhu, X., Li, S., Cooper, R. S. & Elston, R. C. A unified association analysis approach for family and unrelated samples correcting for stratification. Am. J. Hum. Genet. 82, 352–365 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Lee S., Zou F. & Wright F. A. Convergence and prediction of principal component scores in high-dimensional settings. Ann. Stat. (in the press).

  39. Thornton, T. & McPeek, M. S. ROADTRIPS: case–control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 86, 172–184 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Rakovski, C. S. & Stram, D. O. A kinship-based modification of the Armitage trend test to address hidden population structure and small differential genotyping errors. PLoS ONE 4, e5825 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Holsinger, K. E. & Weir, B. S. Genetics in geographically structured populations: defining, estimating and interpreting FST . Nature Rev. Genet. 10, 639–650 (2009).

    Article  CAS  PubMed  Google Scholar 

  42. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Abney, M. & McPeek, M. S. Association testing with principal-components-based correction for population stratification [abstract number 58]. Proc. of the 58th Annual Meeting of The American Soc. of Human Genetics [online], (2008).

    Google Scholar 

  44. Cohen, J. C. et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 2869–2872 (2004).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alkes L. Price.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

FURTHER INFORMATION

1000 genomes Project

ADMIXTURE software

EIGENSTRAT, implemented in the EIGENSOFT software

EMMAX software

FBAT software

Nature Reviews Genetics article series on Genome-wide association studies

NHGRI Catalog of Published Genome-Wide Association Studies

PLINK software

QTDT software

ROADTRIPS software

STRUCTURE and STRAT software

TASSEL software

Glossary

Ancestry-informative markers

Genetic markers ascertained for large differences in allele frequency between subpopulations that are genotyped to infer genetic ancestry in new samples.

Armitage trend test

A standard χ2(1 degree of freedom) association test computed as the number of samples times the squared correlation between genotype and phenotype.

Cryptic relatedness

Sample structure due to distant relatedness among samples with no known family relationships.

Differential bias

Spurious differences in allele frequencies between cases and controls due to differences in sample collection, sample preparation and/or genotyping assay procedures.

Exome resequencing

A study design in which exon capture technologies are used to obtain resequencing data covering all exonic regions for each individual in the study.

Family-based association tests

A class of association tests that uses families with one or more affected children as the subjects rather than unrelated cases or controls. The analysis treats the allele that is transmitted to (one or more) affected children from each parent as a 'case' and the untransmitted alleles as 'controls' to avoid the effects of population structure.

Family structure

Sample structure due to familial relatedness among samples.

F ST

A measure of the genetic distance between two populations that describes the proportion of overall genetic variation that is due to differences between populations.

Genetic drift

Random fluctuations in allele frequencies over time due to sampling effects, particularly in small populations.

Genetic heritability

The proportion of the total phenotypic variation in a given characteristic that can be attributed to additive genetic effects. In the broad sense, heritability involves all additive and non-additive genetic variance, whereas in the narrow sense, it involves only additive genetic variance.

Genetic matching

A method of association testing in which cases and controls are matched for genetic ancestry, as inferred by principal components analysis or other methods.

Genomic control

A method for detecting (or detecting and correcting for) stratification based on the genome-wide inflation of association statistics.

Mixed models

A class of models in which phenotypes are modelled using both fixed effects (candidate SNPs and fixed covariates) and random effects (the phenotypic covariance matrix).

Multidimensional scaling

A dimensionality reduction technique, similar to principal components analysis, in which points in a high-dimensional space are projected into a lower-dimensional space while approximately preserving the distance between points.

Population structure

Sample structure due to differences in genetic ancestry among samples.

Principal components analysis

A dimensionality reduction technique used to infer continuous axes of variation in genetic data, often representing genetic ancestry.

Rank statistic

A statistic describing the rank, across markers, of association of each marker. Rank statistics can be transformed into quantiles of a standard normal distribution that can be combined with other statistics.

SNP loadings

The correlations of each SNP to a given principal component in principal components analysis. The principal component coordinates of each sample are proportional to the sum of normalized genotypes weighted by SNP loadings.

Structured association

A method for correcting for stratification in which samples are assigned to subpopulation clusters and evidence of association is stratified by cluster.

Transmission disequilibrium test

A family-based association test involving case–parent trios in which alleles transmitted from parents to children are compared with untransmitted alleles.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Price, A., Zaitlen, N., Reich, D. et al. New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11, 459–463 (2010). https://doi.org/10.1038/nrg2813

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg2813

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing