Abstract
Chronic obstructive pulmonary disease (COPD), the third leading cause of death worldwide, is highly heritable. While COPD is clinically defined by applying thresholds to summary measures of lung function, a quantitative liability score has more power to identify new genetic signals. Here we train a deep convolutional neural network on noisy self-reported and ICD-based labels to predict COPD case/control status from high-dimensional raw spirograms and use the model predictions as a liability score. The machine-learning-based (ML-based) liability score accurately discriminates COPD cases and controls (AUROC = 0.82 ± 0.01) and COPD-related hospitalization (AUROC = 0.89 ± 0.01) without any domain-specific knowledge. Moreover, the ML-based liability score is associated with overall survival (Hazard ratio = 1.22 ± 0.01; P ≤ 2 × 10−16) and exacerbation events (R2 = 0.10 ± 0.01; P ≤ 4 × 10−101). A genome-wide association study on the ML-based liability score replicates existing COPD and lung function loci, but also identifies 67 new loci. Thirty-eight of these have supportive evidence in independent datasets, including a locus near LTBR. We demonstrate the biological plausibility of the novel variants through enrichment analyses, phenome-wide association studies, and generalizability of COPD prediction in multiple datasets. These results provide an example of the potential to improve genetic discovery of disease-relevant variants by training deep neural networks to predict noisy labels from high-dimensional raw data.
Competing Interest Statement
J.C., B.B., B.A., Z.R.M., A.W.C., C.Y.M., and F.H. are employees of Google LLC and own Alphabet stock. This study was funded by Google LLC. B.D.H. receives grant support from Bayer. M.H.C. has received grant support from GSK and Bayer, consulting or speaking fees from Genentech, AstraZeneca, and Illumina.
Funding Statement
This study was funded by Google LLC. B.D.H. is supported by NIH K08 HL136928, U01 HL089856, R01 HL155749, and a Research Grant from the Alpha-1 Foundation. M.H.C. is supported by R01HL153248, R01HL149861, R01HL147148, and R01HL089856. D.H. was supported by NIH 2T32HL007427-41.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Advarra IRB (Columbia, MD) waived ethical approval for this work involving de-identified medical imagery and metadata under 45 CFR 46. Work related to genomics data were additionally reviewed by the respective data sources: UK Biobank, COPDGene, and ICGC. This research has been conducted using the UK Biobank Resource under Application Number 65275.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Genotypes and phenotypes are available for approved projects through the UK Biobank study. This research has been conducted under Application Number 65275. We utilized the GWAS Catalog for replication analysis. This research used data generated by the COPDGene study (dbGaP accession phs000179.v6.p2), which was supported by NIH grants U01 HL089856 and U01 HL089897. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer Ingelheim, Novartis, and Sunovion. ICGC (International COPD Genetics Consortium) genome-wide association summary statistics were obtained from dbGaP under accession phs000179.v5.p2. SpiroMeta summary statistics were obtained from LDHub.