Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Genomic Machine Learning Meta-regression: Insights on Associations of Study Features with Reported Model Performance

View ORCID ProfileEric Barnett, Daniel Onete, Asif Salekin, View ORCID ProfileStephen V Faraone
doi: https://doi.org/10.1101/2022.01.10.22268751
Eric Barnett
1Department of Neuroscience and Physiology, SUNY Upstate Medical University, Syracuse, New York, USA
2College of Medicine, MD Program, SUNY Upstate Medical University, Syracuse, New York, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Eric Barnett
Daniel Onete
2College of Medicine, MD Program, SUNY Upstate Medical University, Syracuse, New York, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Asif Salekin
3Syracuse University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Stephen V Faraone
1Department of Neuroscience and Physiology, SUNY Upstate Medical University, Syracuse, New York, USA
4Department of Psychiatry and Behavioral Sciences, SUNY Upstate Medical University, Syracuse, New York, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stephen V Faraone
  • For correspondence: svfaraone@upstate.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Many studies have been conducted with the goal of correctly predicting diagnostic status of a disorder using the combination of genetic data and machine learning. The methods of these studies often differ drastically. It is often hard to judge which components of a study led to better results and whether better reported results represent a true improvement or an uncorrected bias inflating performance.

Methods In this systematic review, we extracted information about the methods used and other differentiating features in genomic machine learning models. We used the extracted features in mixed-effects linear regression models predicting model performance. We tested for univariate and multivariate associations as well as interactions between features.

Results In univariate models the number of hyperparameter optimizations reported and data leakage due to feature selection were significantly associated with an increase in reported model performance. In our multivariate model, the number of hyperparameter optimizations, data leakage due to feature selection, and training size were significantly associated with an increase in reported model performance. The interaction between number of hyperparameter optimizations and training size as well as the interaction between data leakage due to optimization and training size were significantly associated reported model performance.

Conclusions Our results suggest that methods susceptible to data leakage are prevalent among genomic machine learning research, which may result in inflated reported performance. The interactions of these features with training size suggest that if data leakage susceptible methods continue to be used, modelling efforts using larger data sets may result in unexpectedly lower results compared to smaller data sets. Best practice guidelines that promote the avoidance and recognition of data leakage may help the field advance and avoid biased results.

Competing Interest Statement

In the past year, Dr. Faraone received income, potential income, travel expenses continuing education support and or research support from Aardvark, Akili, Genomind, Ironshore, KemPharm/Corium, Noven, Ondosis, Otsuka, Rhodes, Supernus, Takeda, Tris and Vallon. With his institution, he has US patent US20130217707 A1 for the use of sodium-hydrogen exchange inhibitors in the treatment of ADHD. In previous years, he received support from: Alcobra, Arbor, Aveksham, CogCubed, Eli Lilly, Enzymotec, Impact, Janssen, Lundbeck/Takeda, McNeil, NeuroLifeSciences, Neurovance, Novartis, Pfizer, Shire, and Sunovion. He also receives royalties from books published by Guilford Press: Straight Talk about Your Childs Mental Health; Oxford University Press: Schizophrenia: The Facts; and Elsevier: ADHD: Non-Pharmacologic Interventions. He is also Program Director of www.adhdinadults.com. Dr. Faraone is supported by NIMH grants U01MH109536-01, U01AR076092-01A1, R0MH116037 and 5R01AG06495502; Oregon Health and Science University, Otsuka Pharmaceuticals and Supernus Pharmaceutical Company. Dr. Salekin is currently supported by the NSF (grant #2124285), NIDCD (grant OSP Institution # SP-31861-2), and New York University (Grant OSP OSP Institution #SP-30255-2). Eric Barnett and Daniel Onete have no financial disclosures

Funding Statement

This project has received funding from the European Union Horizon 2020 research and innovation programme grant agreement No 667302. This project has received funding from the European Unions Horizon 2020 research and innovation programme grant agreement No 965381.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

  • Correction to Author Order

Data Availability

All data availability information and data used in this study can be found in the referenced primary GWASs.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
Back to top
PreviousNext
Posted January 19, 2022.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Genomic Machine Learning Meta-regression: Insights on Associations of Study Features with Reported Model Performance
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Genomic Machine Learning Meta-regression: Insights on Associations of Study Features with Reported Model Performance
Eric Barnett, Daniel Onete, Asif Salekin, Stephen V Faraone
medRxiv 2022.01.10.22268751; doi: https://doi.org/10.1101/2022.01.10.22268751
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Genomic Machine Learning Meta-regression: Insights on Associations of Study Features with Reported Model Performance
Eric Barnett, Daniel Onete, Asif Salekin, Stephen V Faraone
medRxiv 2022.01.10.22268751; doi: https://doi.org/10.1101/2022.01.10.22268751

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetic and Genomic Medicine
Subject Areas
All Articles
  • Addiction Medicine (269)
  • Allergy and Immunology (549)
  • Anesthesia (135)
  • Cardiovascular Medicine (1749)
  • Dentistry and Oral Medicine (238)
  • Dermatology (172)
  • Emergency Medicine (310)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (654)
  • Epidemiology (10785)
  • Forensic Medicine (8)
  • Gastroenterology (584)
  • Genetic and Genomic Medicine (2935)
  • Geriatric Medicine (286)
  • Health Economics (531)
  • Health Informatics (1919)
  • Health Policy (833)
  • Health Systems and Quality Improvement (743)
  • Hematology (290)
  • HIV/AIDS (627)
  • Infectious Diseases (except HIV/AIDS) (12501)
  • Intensive Care and Critical Care Medicine (684)
  • Medical Education (299)
  • Medical Ethics (86)
  • Nephrology (322)
  • Neurology (2785)
  • Nursing (150)
  • Nutrition (431)
  • Obstetrics and Gynecology (556)
  • Occupational and Environmental Health (597)
  • Oncology (1458)
  • Ophthalmology (441)
  • Orthopedics (172)
  • Otolaryngology (255)
  • Pain Medicine (190)
  • Palliative Medicine (56)
  • Pathology (380)
  • Pediatrics (865)
  • Pharmacology and Therapeutics (362)
  • Primary Care Research (334)
  • Psychiatry and Clinical Psychology (2633)
  • Public and Global Health (5342)
  • Radiology and Imaging (1004)
  • Rehabilitation Medicine and Physical Therapy (595)
  • Respiratory Medicine (724)
  • Rheumatology (329)
  • Sexual and Reproductive Health (289)
  • Sports Medicine (278)
  • Surgery (327)
  • Toxicology (47)
  • Transplantation (149)
  • Urology (125)