Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Identifying subtypes of heart failure with machine learning: external, prognostic and genetic validation in three electronic health record sources with 320,863 individuals

View ORCID ProfileAmitava Banerjee, Suliang Chen, Muhammad Dashtban, Laura Pasea, View ORCID ProfileJohan H Thygesen, Ghazaleh Fatemifar, Benoit Tyl, Tomasz Dyszynski, Folkert W. Asselbergs, Lars H. Lund, View ORCID ProfileTom Lumbers, View ORCID ProfileSpiros Denaxas, Harry Hemingway
doi: https://doi.org/10.1101/2022.06.27.22276961
Amitava Banerjee
1Institute of Health Informatics, University College London, London, UK
2Health Data Research UK, University College London, London, UK
3Barts Health NHS Trust, London, UK
4University College London Hospitals NHS Trust, London, UK
Roles: professor in clinical data science and honorary consultant cardiologist
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Amitava Banerjee
  • For correspondence: ami.banerjee@ucl.ac.uk
Suliang Chen
1Institute of Health Informatics, University College London, London, UK
Roles: post-doctoral data scientist
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Muhammad Dashtban
1Institute of Health Informatics, University College London, London, UK
Roles: post-doctoral data scientist
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Laura Pasea
1Institute of Health Informatics, University College London, London, UK
Roles: post-doctoral statistician
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Johan H Thygesen
Roles: PhD lecturer in health data science
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Johan H Thygesen
Ghazaleh Fatemifar
1Institute of Health Informatics, University College London, London, UK
Roles: post-doctoral data scientist
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Benoit Tyl
5Bayer HealthCare SAS, Medical Affairs, Pharmaceuticals, BP 103 10 Place de Belgique, F-92254 La Garenne Colombes Cedex, France
Roles: IEG Medical Advisor, Integrated Care, Cardiovascular
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Tomasz Dyszynski
6Bayer AG, Medical Affairs & Pharmacovigilance, Pharmaceuticals TG Cardio, Thrombosis & Hemophilia Building M084, 112 13353 Berlin, Germany
Roles: global safety leader
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Folkert W. Asselbergs
1Institute of Health Informatics, University College London, London, UK
2Health Data Research UK, University College London, London, UK
7Department of Cardiology, Division Heart & Lungs, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
8Institute of Cardiovascular Science, Faculty of Population Health Sciences, University College London, London, United Kingdom
Roles: professor of precision medicine and consultant cardiologist
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Lars H. Lund
9Division of Cardiology, Department of Medicine, Karolinska Institutet, Stockholm, Sweden; Heart and Vascular Theme, Karolinska University Hospital, Stockholm, Sweden
Roles: professor of cardiology
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Tom Lumbers
1Institute of Health Informatics, University College London, London, UK
2Health Data Research UK, University College London, London, UK
3Barts Health NHS Trust, London, UK
4University College London Hospitals NHS Trust, London, UK
Roles: UKRI Rutherford Fellow and honorary consultant cardiologist
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tom Lumbers
Spiros Denaxas
1Institute of Health Informatics, University College London, London, UK
2Health Data Research UK, University College London, London, UK
Roles: professor of biomedical informatics
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Spiros Denaxas
Harry Hemingway
1Institute of Health Informatics, University College London, London, UK
2Health Data Research UK, University College London, London, UK
10University College London Hospitals National Institute for Health Research (NIHR) Biomedical Research Centre
Roles: professor of clinical epidemiology and honorary consultant in public
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Reliable identification of heart failure (HF) subtypes might allow targeted management. Machine learning (ML) has been used to explore HF subtypes, but neither across large, independent, population-based datasets, nor across the full spectrum of causes and presentations, nor with clinical and non-clinical validation by different ML methods. Using our published framework, we identified and validated HF subtypes to address these gaps.

Methods We analysed individuals ≥30 years with incident HF from two population-based electronic health records resources (1998-2018; Clinical Practice Research Datalink, CPRD: n=188,799 HF cases; The Health Improvement Network, THIN: n=124,263 HF cases). Pre-and post-HF factors (n=645) included demography, history, examination, blood laboratory values and medications. We identified subtypes using four unsupervised ML methods (K-means, hierarchical, K-Medoids and mixture model clustering) with 87 (from 645) factors in each dataset. We evaluated subtypes for: (i) external validity (across independent datasets); (ii) prognostic validity (predictive accuracy for 1-year mortality); and (iii) uniquely, genetic validity (in UK Biobank; n=9573 cases): association with polygenic risk score (PRS) for 11 HF related traits, and direct association with 12 reported HF single nucleotide polymorphisms (SNPs).

Findings After identifying five clusters, we labelled HF subtypes: 1.Early-onset, 2.Late-onset, 3.AF-related, 4.Metabolic, and 5.Cardiometabolic. External validity: Subtypes were similar across datasets (c-statistic: 0.94, 0.80, 0.79, 0.83, 0.92 for the THIN model in CPRD and 0.79, 0.92, 0.90, 0.89, 0.92 for the CPRD model in THIN for subtypes 1-5, respectively). Prognostic validity: One-year all-cause mortality, risk of non-fatal cardiovascular diseases and all-cause hospitalisation (before and after HF diagnosis) differed across subtypes in CPRD and THIN data. Genetic validity: The AF-related subtype showed associations with PRS for related traits. Late-onset and Cardiometabolic subtypes were most comparable and strongly associated with PRS for Hypertension, Myocardial Infarction and Obesity (p-value < 9.09 × 10−4). We developed a prototype for clinical use, which could enable evaluation of effectiveness and cost-effectiveness.

Interpretation Across four methods and three datasets, and including genetic data, in the largest HF study to-date, ML algorithms identified five subtypes in individuals with incident HF. These subtypes may inform aetiologic research, clinical risk prediction and the design of HF trials.

Funding European Union Innovative Medicines Initiative.

Evidence before this study In a systematic review until December 2019, we showed that studies of machine learning in subtyping and risk prediction in cardiovascular diseases are limited by small population size, relatively few factors and poor generalisability of findings due to lack of external validation. We further searched PubMed, medRxiv, bioRxiv, arXiv, for relevant peer-reviewed articles and preprints, focusing on machine learning studies in heart failure. Studies remain focused on single diseases, limited risk factors, often single method of machine learning, rarely use subtyping and risk prediction together, and have not been externally validated across datasets. For heart failure, all subtype discovery studies have identified subtypes based on clustering, but so far with no application to clinical practice.

Added value of this study Across two independent, population-based datasets, we used four machine learning methods for subtyping and risk prediction with 89 aetiologic factors as well as 556 further factors for heart failure. We identified and validated five subtypes in incident heart failure, which differentially predicted outcomes. In addition, we externally validated clinical cluster differences by exploring corresponding genetic differences in a large-scale genetic cohort. Our methods and results highlight potential value of electronic health records and machine learning in understanding disease subtypes. Moreover, our approach to external, prognostic, and genetic validity provides a framework for validation of machine learning approaches for disease subtype discovery.

Implications of all the available evidence Our analyses support coordinated use of large-scale, linked electronic health records to identify and validate disease subtypes with relevance for clinical risk prediction, patient selection for trials, and future genetic research.

Competing Interest Statement

AB is supported by research funding from the National Institute for Health Research (NIHR), British Medical Association, AstraZeneca, and UK Research and Innovation. BT and TD are employees of Bayer. All other authors declare no competing interests.

Funding Statement

European Union Innovative Medicines Initiative.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Approvals were by: (i) MHRA Independent Scientific Advisory Committee [18_217R]: Section 251 (NHS Social Care Act 2006), (ii) Scientific Review Committee [17THIN038-A1] and (iii) UKB 15422: Patient informed consent was not required or provided.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

  • ↵* joint second author

  • slc5{at}hotmail.com

  • a.dashtban{at}ucl.ac.uk

  • l.pasea{at}ucl.ac.uk

  • j.thygesen{at}ucl.ac.uk

  • ghazalehfatemifar{at}gmail.com

  • benoit.tyl{at}bayer.com

  • tomasz.dyszynski{at}bayer.com

  • f.asselbergs{at}ucl.ac.uk

  • lars.lund{at}alumni.duke.edu

  • t.lumbers{at}ucl.ac.uk

  • s.denaxas{at}ucl.ac.uk

  • h.hemingway{at}ucl.ac.uk

Data Availability

All data produced in the present work are contained in the manuscript

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted June 28, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Identifying subtypes of heart failure with machine learning: external, prognostic and genetic validation in three electronic health record sources with 320,863 individuals
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Identifying subtypes of heart failure with machine learning: external, prognostic and genetic validation in three electronic health record sources with 320,863 individuals
Amitava Banerjee, Suliang Chen, Muhammad Dashtban, Laura Pasea, Johan H Thygesen, Ghazaleh Fatemifar, Benoit Tyl, Tomasz Dyszynski, Folkert W. Asselbergs, Lars H. Lund, Tom Lumbers, Spiros Denaxas, Harry Hemingway
medRxiv 2022.06.27.22276961; doi: https://doi.org/10.1101/2022.06.27.22276961
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Identifying subtypes of heart failure with machine learning: external, prognostic and genetic validation in three electronic health record sources with 320,863 individuals
Amitava Banerjee, Suliang Chen, Muhammad Dashtban, Laura Pasea, Johan H Thygesen, Ghazaleh Fatemifar, Benoit Tyl, Tomasz Dyszynski, Folkert W. Asselbergs, Lars H. Lund, Tom Lumbers, Spiros Denaxas, Harry Hemingway
medRxiv 2022.06.27.22276961; doi: https://doi.org/10.1101/2022.06.27.22276961

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Cardiovascular Medicine
Subject Areas
All Articles
  • Addiction Medicine (216)
  • Allergy and Immunology (495)
  • Anesthesia (106)
  • Cardiovascular Medicine (1103)
  • Dentistry and Oral Medicine (196)
  • Dermatology (141)
  • Emergency Medicine (274)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (504)
  • Epidemiology (9787)
  • Forensic Medicine (5)
  • Gastroenterology (481)
  • Genetic and Genomic Medicine (2323)
  • Geriatric Medicine (223)
  • Health Economics (463)
  • Health Informatics (1566)
  • Health Policy (737)
  • Health Systems and Quality Improvement (606)
  • Hematology (238)
  • HIV/AIDS (507)
  • Infectious Diseases (except HIV/AIDS) (11660)
  • Intensive Care and Critical Care Medicine (617)
  • Medical Education (240)
  • Medical Ethics (67)
  • Nephrology (258)
  • Neurology (2149)
  • Nursing (134)
  • Nutrition (339)
  • Obstetrics and Gynecology (427)
  • Occupational and Environmental Health (518)
  • Oncology (1184)
  • Ophthalmology (366)
  • Orthopedics (129)
  • Otolaryngology (220)
  • Pain Medicine (148)
  • Palliative Medicine (50)
  • Pathology (313)
  • Pediatrics (698)
  • Pharmacology and Therapeutics (302)
  • Primary Care Research (267)
  • Psychiatry and Clinical Psychology (2192)
  • Public and Global Health (4676)
  • Radiology and Imaging (782)
  • Rehabilitation Medicine and Physical Therapy (457)
  • Respiratory Medicine (625)
  • Rheumatology (274)
  • Sexual and Reproductive Health (226)
  • Sports Medicine (211)
  • Surgery (252)
  • Toxicology (43)
  • Transplantation (120)
  • Urology (94)