Second-generation PLINK: rising to the challenge of larger and richer datasets

Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee

doi:10.1186/s13742-015-0047-8

Second-generation PLINK: rising to the challenge of larger and richer datasets

Gigascience. 2015 Feb 25:4:7. doi: 10.1186/s13742-015-0047-8. eCollection 2015.

Authors

Christopher C Chang¹, Carson C Chow², Laurent Cam Tellier³, Shashaank Vattikuti², Shaun M Purcell⁴, James J Lee⁵

Affiliations

¹ Complete Genomics, 2071 Stierlin Court, Mountain View, 94043 CA USA ; BGI Cognitive Genomics Lab, Building No. 11, Bei Shan Industrial Zone, Yantian District, Shenzhen, 518083 China.
² Mathematical Biology Section, NIDDK/LBM, National Institutes of Health, Bethesda, 20892 MD USA.
³ BGI Cognitive Genomics Lab, Building No. 11, Bei Shan Industrial Zone, Yantian District, Shenzhen, 518083 China ; Bioinformatics Centre, University of Copenhagen, Copenhagen, 2200 Denmark.
⁴ Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, 02142 MA USA ; Division of Psychiatric Genomics, Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, 10029 NY USA ; Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, 10029 NY USA ; Analytic and Translational Genetics Unit, Psychiatric and Neurodevelopmental Genetics Unit, Massachusetts General Hospital, Boston, 02114 MA USA.
⁵ Mathematical Biology Section, NIDDK/LBM, National Institutes of Health, Bethesda, 20892 MD USA ; Department of Psychology, University of Minnesota Twin Cities, Minneapolis, 55455 MN USA.

Abstract

Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format.

Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Keywords: Computational statistics; GWAS; High-density SNP genotyping; Population genetics; Whole-genome sequencing.

Publication types

Research Support, N.I.H., Intramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Computational Biology*
Datasets as Topic*
Genetics, Population
Genome-Wide Association Study
Genotyping Techniques
Likelihood Functions
Linkage Disequilibrium
Logistic Models
Polymorphism, Single Nucleotide
Software*

Grants and funding

Intramural NIH HHS/United States