Abstract
The human genome contains tens of thousands of large tandem repeats and hundreds of genes that show common and highly variable copy number changes. Due to their large size and repetitive nature, these Variable Number Tandem Repeats (VNTRs) and multicopy genes are generally recalcitrant to standard genotyping approaches, and as a result this class of variation is poorly characterized. However, several recent studies have demonstrated that copy number variation of VNTRs can modify local gene expression, epigenetics and human traits, indicating that many have a functional role. Here, using read depth from whole genome sequencing to profile copy number, we report results of a phenome-wide association study (PheWAS) of VNTRs and multicopy genes in a discovery cohort of ∼35,000 samples, identifying 32 traits associated with copy number of 38 VNTRs and multicopy genes at 1% FDR. We replicated many of these signals in an independent cohort, and observed that VNTRs showing trait associations were significantly enriched for expression QTLs with nearby genes, providing strong support for our results. Fine-mapping studies indicated that in the majority (∼90%) of cases, the VNTR and multicopy genes we identified represent the causal variants underlying the observed associations. Furthermore, several lie in regions where prior SNV-based GWAS have failed to identify any significant associations with these traits. Our study indicates that copy number of VNTRs and multicopy genes contributes to diverse human traits, and suggests that complex structural variants potentially explain some of the so-called “missing heritability” of SNV-based GWAS.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was supported by NIH grants NS105781 and NS120241 to AJS, NIH predoctoral fellowship NS108797 to OR, and NHLBI BioData Catalyst Fellowship 5120339 to AMT. Research reported in this paper was supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD018522. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai. A full list of acknowledgements for all datasets used in this study are listed in the Supplemental Data file.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study was approved by, and the procedures followed were in accordance with the ethical standards of the Institutional Review Board of the Icahn School of Medicine under HS# 19-01376.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
All data produced in the present study are available upon reasonable request to the authors. Datasets used in this study are available as follows: Database of Genotypes and Phenotypes (dbGaP), NHLBI TOPMed: Women's Health Initiative (WHI) https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001237.v2.p1 Database of Genotypes and Phenotypes (dbGaP), NHLBI TOPMed - NHGRI CCDG: Atherosclerosis Risk in Communities (ARIC), https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001211.v4.p3 Database of Genotypes and Phenotypes (dbGaP), NHLBI TOPMed: MESA and MESA Family AA-CAC, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001416.v2.p1 Database of Genotypes and Phenotypes (dbGaP), NHLBI TOPMed: The Jackson Heart Study (JHS), https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000964.v5.p1 Database of Genotypes and Phenotypes (dbGaP), NHLBI TOPMed: Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing Project: Cardiovascular Health Study, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001368.v3.p2 Database of Genotypes and Phenotypes (dbGaP), NHLBI TOPMed: Genetic Epidemiology of COPD (COPDGene), https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000951.v5.p5 Database of Genotypes and Phenotypes (dbGaP), NHLBI TOPMed: Genomic Activities such as Whole Genome Sequencing and Related Phenotypes in the Framingham Heart Study, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000974.v4.p3 Database of Genotypes and Phenotypes (dbGaP), NHLBI TOPMed - NHGRI CCDG: The BioMe Biobank at Mount Sinai, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001644.v2.p2 Database of Genotypes and Phenotypes (dbGaP), GTEx data, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v7.p2 GTEX portal, https://www.gtexportal.org/ Human Genome Diversity Panel, https://www.internationalgenome.org/data-portal/data-collection/hgdp Parkinsons Progression Markers Initiative (PPMI), https://www.ppmi-info.org/