The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes

  1. Gerton Lunter13,16
  1. 1Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, 1211, Switzerland;
  2. 2Department of Pathology,
  3. 3Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA;
  4. 4Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRIA, UMR5558, Villeurbanne, France;
  5. 5Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, CB10 1HH, Cambridge, United Kingdom;
  6. 6Department of Human Genetics, Radboud University Medical Centre, 6500 HB Nijmegen, The Netherlands;
  7. 7Department of Genetics, Albert Einstein College of Medicine, Bronx, New York 10461, USA;
  8. 8Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA;
  9. 9Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802-5301, USA;
  10. 10Department of Statistics, University of Chicago, Chicago, Illinois 60637, USA;
  11. 11Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA;
  12. 12Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom;
  13. 13Wellcome Trust Centre for Human Genetics, Oxford OX3 7BN, United Kingdom
    1. 14 These authors contributed equally to this work.

    • 15 Present address: Peter MacCallum Cancer Centre, East Melbourne, Victoria, 3002 Australia.

    Abstract

    Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%–48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

    Footnotes

    • 16 Corresponding authors

      E-mail gerton.lunter{at}well.ox.ac.uk

      E-mail smontgom{at}stanford.edu

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.148718.112.

      Freely available online through the Genome Research Open Access option.

    • Received September 3, 2012.
    • Accepted February 27, 2013.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 3.0 Unported License), as described at http://creativecommons.org/licenses/by-nc/3.0/.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server