Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Leveraging cancer mutation data to predict the pathogenicity of germline missense variants

View ORCID ProfileBushra Haque, View ORCID ProfileDavid Cheerie, Amy Pan, View ORCID ProfileMeredith Curtis, View ORCID ProfileThomas Nalpathamkalam, Jimmy Nguyen, Celine Salhab, Bhooma Thiruvahindrapura, Jade Zhang, View ORCID ProfileMadeline Couse, View ORCID ProfileTaila Hartley, Michelle M. Morrow, E Magda Price, Susan Walker, David Malkin, View ORCID ProfileFrederick P. Roth, View ORCID ProfileGregory Costain
doi: https://doi.org/10.1101/2024.03.11.24304106
Bushra Haque
1Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
2Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Bushra Haque
David Cheerie
1Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
2Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for David Cheerie
Amy Pan
2Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Meredith Curtis
1Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
2Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Meredith Curtis
Thomas Nalpathamkalam
3The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Thomas Nalpathamkalam
Jimmy Nguyen
1Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
2Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Celine Salhab
1Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
2Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Bhooma Thiruvahindrapura
3The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jade Zhang
4Human Biology Program, University of Toronto, Toronto, Ontario, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Madeline Couse
5Centre for Computational Medicine, Hospital for Sick Children, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Madeline Couse
Taila Hartley
6Children’s Hospital of Eastern Ontario Research Institute, University of Ottawa, Ottawa, Ontario, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Taila Hartley
Michelle M. Morrow
7GeneDx, Gaithersburg, Maryland, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
E Magda Price
6Children’s Hospital of Eastern Ontario Research Institute, University of Ottawa, Ottawa, Ontario, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Susan Walker
8Genomics England, London, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
David Malkin
9Division of Haematology/Oncology, The Hospital for Sick Children, Department of Pediatrics, University of Toronto, Toronto, Ontario, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Frederick P. Roth
2Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
10Donnelly Centre for Cellular and Biomolecular Research (CCBR), University of Toronto, Toronto, Ontario, Canada
11Lunenfeld-Tanenbaum Research Institute (LTRI), Sinai Health System, Toronto, Ontario, Canada
12Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, USA
13Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Frederick P. Roth
Gregory Costain
1Program in Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
2Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
3The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, ON, Canada
14Division of Clinical and Metabolic Genetics, The Hospital for Sick Children, and Department of Paediatrics, University of Toronto, Toronto, ON, Canada
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gregory Costain
  • For correspondence: gregory.costain{at}sickkids.ca
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Innovative and easy-to-implement strategies are needed to improve the pathogenicity assessment of rare germline missense variants. Somatic cancer driver mutations identified through large-scale tumor sequencing studies often impact genes that are also associated with rare Mendelian disorders. The use of cancer mutation data to aid in the interpretation of germline missense variants, regardless of whether the gene is associated with a hereditary cancer predisposition syndrome or a non-cancer-related developmental disorder, has not been systematically assessed. We extracted putative cancer driver missense mutations from the Cancer Hotspots database and annotated them as germline variants, including presence/absence and classification in ClinVar. We trained two supervised learning models (logistic regression and random forest) to predict variant classifications of germline missense variants in ClinVar using Cancer Hotspot data (training dataset). The performance of each model was evaluated with an independent test dataset generated in part from searching public and private genome-wide sequencing datasets from ∼1.5 million individuals. Of the 2,447 cancer mutations, 691 corresponding germline variants had been previously classified in ClinVar: 426 (61.6%) as likely pathogenic/pathogenic, 261 (37.8%) as uncertain significance, and 4 (0.6%) as likely benign/benign. The odds ratio for a likely pathogenic/pathogenic classification in ClinVar was 28.3 (95% confidence interval: 24.2-33.1, p < 0.001), compared with all other germline missense variants in the same 216 genes. Both supervised learning models showed high correlation with pathogenicity assessments in the training dataset. There was high area under precision-recall curve values of 0.847 and 0.829 for logistic regression and random forest models, respectively, when applied to the test dataset. With the use of cancer and germline datasets and supervised learning techniques, our study shows that cancer mutation data can be leveraged to improve the interpretation of germline missense variation potentially causing rare Mendelian disorders.

AUTHOR SUMMARY Our study introduces an approach to improve the interpretation of rare genetic variation, specifically missense variants that can alter proteins and cause disease. We found that published evidence from somatic cancer sequencing studies may be relevant to understanding the impact of the same variant in the context of rare inherited (Mendelian) disorders. By using widely available datasets, we noted that many cancer driver mutations have also been observed as rare germline variants associated with inherited disorders. This intersection led us to employ machine learning techniques to assess how cancer mutation data can predict the pathogenicity of germline variants. We trained machine learning models and tested them on a separate dataset curated by searching public and private genome-wide sequencing data from over a million participants. Our models were able to successfully identify pathogenic genetic changes, demonstrating strong performance in predicting disease-causing variants. This study highlights that cancer mutation data can enhance the interpretation of rare missense variants, aiding in the diagnosis and understanding of rare diseases. Integrating this approach into current genetic classification frameworks could be beneficial, and opens new avenues for leveraging existing cancer research to benefit broader genetic research and diagnostics for rare genetic conditions.

Competing Interest Statement

SW is an employee of Genomics England Limited. MMM is an employee of GeneDx, LLC. The remaining authors have no potential conflicts of interest to declare.

Funding Statement

The study was funded by SickKids Research Institute, Canadian Institutes of Health Research, and the University of Toronto McLaughlin Centre. The funders had no role in the design and conduct of the study.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The Research Ethics Board at the Hospital for Sick Children gave ethical approval for this secondary use data study. The Institutional Review Board of GeneDx gave ethical approval for the use of de-identified data from GeneDx and was assessed in accordance with an IRB-approved protocol (WIRB #20171030).

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • Revised background with added definitions; expanded Results section to include additional analyses on loss-of-function versus gain-of-function, and performance analysis with VEST4 and MutPred2; Results now include both ROC and precision-recall curves; Discussion section updated; New supplemental figures and tables.

AVAILABILITY OF DATA AND MATERIALS

The cancer mutation data from Cancer Hotspots that support the findings of this study are available through a public database and at the following URL: https://www.cancerhotspots.org/ (DOI: 10.1038/nbt.3391) Germline variants and their classifications are available in the ClinVar public archive: https://www.ncbi.nlm.nih.gov/clinvar/ (DOI: https://doi.org/10.1093/nar/gkx1153). For the Cancer Hotspots cancer mutation data transformation, the Python script is openly available on a GitHub repository: https://github.com/haqueb2/Cancer-Hotspots-Reformat. The training dataset used for training supervised learning models, the LRM and RFM pathogenicity scores assigned to training and test dataset variants, and prediction scores generated by other in silico tools for the test dataset are all available in Supplemental Table 6. All variants used in test and training datasets are included in Supplemental Table 6. R scripts used to train supervised learning models can be found in Supplemental Appendix 1 and 2. Datasets from Genomics England (DOI: https://doi.org/10.6084/m9.figshare.4530893.v7), MSSNG (DOI: 10.1016/j.cell.2022.10.009), Care4Rare (DOI:10.1016/j.ajhg.2022.10.002), and GeneDx are not openly available due to controlled access requirements. Access to these datasets can be made available upon request to the respective organizations.

  • LIST OF ABBREVIATIONS

  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
    Back to top
    PreviousNext
    Posted October 28, 2024.
    Download PDF

    Supplementary Material

    Data/Code
    Email

    Thank you for your interest in spreading the word about medRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Leveraging cancer mutation data to predict the pathogenicity of germline missense variants
    (Your Name) has forwarded a page to you from medRxiv
    (Your Name) thought you would like to see this page from the medRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Leveraging cancer mutation data to predict the pathogenicity of germline missense variants
    Bushra Haque, David Cheerie, Amy Pan, Meredith Curtis, Thomas Nalpathamkalam, Jimmy Nguyen, Celine Salhab, Bhooma Thiruvahindrapura, Jade Zhang, Madeline Couse, Taila Hartley, Michelle M. Morrow, E Magda Price, Susan Walker, David Malkin, Frederick P. Roth, Gregory Costain
    medRxiv 2024.03.11.24304106; doi: https://doi.org/10.1101/2024.03.11.24304106
    Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    Leveraging cancer mutation data to predict the pathogenicity of germline missense variants
    Bushra Haque, David Cheerie, Amy Pan, Meredith Curtis, Thomas Nalpathamkalam, Jimmy Nguyen, Celine Salhab, Bhooma Thiruvahindrapura, Jade Zhang, Madeline Couse, Taila Hartley, Michelle M. Morrow, E Magda Price, Susan Walker, David Malkin, Frederick P. Roth, Gregory Costain
    medRxiv 2024.03.11.24304106; doi: https://doi.org/10.1101/2024.03.11.24304106

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Genetic and Genomic Medicine
    Subject Areas
    All Articles
    • Addiction Medicine (431)
    • Allergy and Immunology (757)
    • Anesthesia (221)
    • Cardiovascular Medicine (3298)
    • Dentistry and Oral Medicine (365)
    • Dermatology (280)
    • Emergency Medicine (479)
    • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1173)
    • Epidemiology (13385)
    • Forensic Medicine (19)
    • Gastroenterology (899)
    • Genetic and Genomic Medicine (5158)
    • Geriatric Medicine (482)
    • Health Economics (783)
    • Health Informatics (3276)
    • Health Policy (1143)
    • Health Systems and Quality Improvement (1193)
    • Hematology (432)
    • HIV/AIDS (1019)
    • Infectious Diseases (except HIV/AIDS) (14638)
    • Intensive Care and Critical Care Medicine (913)
    • Medical Education (478)
    • Medical Ethics (127)
    • Nephrology (525)
    • Neurology (4930)
    • Nursing (262)
    • Nutrition (730)
    • Obstetrics and Gynecology (886)
    • Occupational and Environmental Health (795)
    • Oncology (2524)
    • Ophthalmology (728)
    • Orthopedics (282)
    • Otolaryngology (347)
    • Pain Medicine (323)
    • Palliative Medicine (90)
    • Pathology (544)
    • Pediatrics (1302)
    • Pharmacology and Therapeutics (551)
    • Primary Care Research (557)
    • Psychiatry and Clinical Psychology (4218)
    • Public and Global Health (7512)
    • Radiology and Imaging (1708)
    • Rehabilitation Medicine and Physical Therapy (1016)
    • Respiratory Medicine (980)
    • Rheumatology (480)
    • Sexual and Reproductive Health (498)
    • Sports Medicine (424)
    • Surgery (549)
    • Toxicology (72)
    • Transplantation (236)
    • Urology (205)