Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Age-dependent topic modelling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk

View ORCID ProfileXilin Jiang, View ORCID ProfileMartin Jinye Zhang, View ORCID ProfileYidong Zhang, Micheal Inouye, Chris Holmes, Alkes L. Price, Gil McVean
doi: https://doi.org/10.1101/2022.10.23.22281420
Xilin Jiang
1Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
2Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
3Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
4Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
5British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge UK
6Heart and Lung Research Institute, University of Cambridge, Cambridge UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Xilin Jiang
  • For correspondence: xilinjiang@hsph.harvar.edu aprice@hsph.harvard.edu gil.mcvean@bdi.ox.ac.uk
Martin Jinye Zhang
4Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
7Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Martin Jinye Zhang
Yidong Zhang
1Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
8CAMS China Oxford Institute, Nuffield Department of Medicine, University of Oxford, Oxford OX3 7BN, UK
9Department of Radiation Oncology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yidong Zhang
Micheal Inouye
5British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge UK
6Heart and Lung Research Institute, University of Cambridge, Cambridge UK
10Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
11Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
12British Heart Foundation Cambridge Centre of Research Excellence, Department of Clinical Medicine, University of Cambridge, Cambridge, UK
13Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Chris Holmes
1Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
2Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
14The Alan Turing Institute, London NW1 2DB, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Alkes L. Price
4Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
7Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
15Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: xilinjiang@hsph.harvar.edu aprice@hsph.harvard.edu gil.mcvean@bdi.ox.ac.uk
Gil McVean
1Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: xilinjiang@hsph.harvar.edu aprice@hsph.harvard.edu gil.mcvean@bdi.ox.ac.uk
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Longitudinal data from electronic health records (EHR) has immense potential to improve clinical diagnoses and personalised medicine, motivating efforts to identify disease subtypes from age-dependent patient comorbidity information. We introduce an age-dependent topic modelling (ATM) method that provides a low-rank representation of longitudinal records of hundreds of distinct diseases in large EHR data sets. The model learns, and assigns to each individual, topic weights for several disease topics, each of which reflects a set of diseases that tend to co-occur as a function of age. Simulations show that ATM attains high accuracy in distinguishing distinct age-dependent comorbidity profiles. We applied ATM to 282,957 UK Biobank samples, analysing 1,726,144 disease diagnoses spanning 348 diseases with ≥1,000 incidences. We inferred 10 disease topics optimising model fit. We identified 52 diseases with heterogeneous comorbidity profiles (≥500 incidences assigned to each of ≥2 topics), including breast cancer, type 2 diabetes (T2D), hypertension, and hypercholesterolemia; for most of these diseases, topic assignments were highly age-dependent, suggesting differences in disease aetiology for early-onset vs. late-onset disease. We defined subtypes of the 52 heterogeneous diseases based on the topic assignments, and compared genetic risk across subtypes using polygenic risk scores (PRS). We identified 18 disease subtypes whose PRS differed significantly from other subtypes of the same disease, including a subtype of T2D characterised by cardiovascular comorbidities and a subtype of asthma characterised by dermatological comorbidities. We further identified specific SNPs underlying these differences. For example, the T2D-associated SNP rs1063192 in the CDKN2B locus has a higher odds ratio in the top quartile of cardiovascular topic weight (1.19±0.02) than in the bottom quartile (1.08±0.02) (P=4×10−5 for difference). In conclusion, ATM identifies disease subtypes with differential genome-wide and locus-specific genetic risk profiles.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

Funded by Wellcome (BST00080- H503.01 to XJ, 100956/Z/13/Z to GM, https:// wellcome.org); the Li Ka Shing Foundation (to GM, https://www.lksf.org); NIH grants R01 HG006399, R01 MH101244, and R37 MH107649 (To ALP); The Alan Turing Institute (https://www.turing.ac.uk), Health Data Research UK (https://www.hdruk.ac.uk), the Medical Research Council UK (https://mrc.ukri.org), the Engineering and Physical Sciences Research Council (EPSRChttps://epsrc.ukri.org) through the Bayes4Health programme Grant EP/R018561/1, and AI for Science and Government UK Research and Innovation (UKRI, https://www.turing.ac.uk/ research/asg) (to CH).The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study used ONLY openly available human data that were originally located at Wellcome Centre Human Genetics.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

  • ↵* These authors jointly supervised the work

Data Availability

All data produced in the present work are contained in the manuscript

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted October 25, 2022.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Age-dependent topic modelling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Age-dependent topic modelling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk
Xilin Jiang, Martin Jinye Zhang, Yidong Zhang, Micheal Inouye, Chris Holmes, Alkes L. Price, Gil McVean
medRxiv 2022.10.23.22281420; doi: https://doi.org/10.1101/2022.10.23.22281420
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Age-dependent topic modelling of comorbidities in UK Biobank identifies disease subtypes with differential genetic risk
Xilin Jiang, Martin Jinye Zhang, Yidong Zhang, Micheal Inouye, Chris Holmes, Alkes L. Price, Gil McVean
medRxiv 2022.10.23.22281420; doi: https://doi.org/10.1101/2022.10.23.22281420

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genetic and Genomic Medicine
Subject Areas
All Articles
  • Addiction Medicine (214)
  • Allergy and Immunology (495)
  • Anesthesia (106)
  • Cardiovascular Medicine (1090)
  • Dentistry and Oral Medicine (194)
  • Dermatology (141)
  • Emergency Medicine (274)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (497)
  • Epidemiology (9746)
  • Forensic Medicine (5)
  • Gastroenterology (480)
  • Genetic and Genomic Medicine (2298)
  • Geriatric Medicine (221)
  • Health Economics (461)
  • Health Informatics (1548)
  • Health Policy (729)
  • Health Systems and Quality Improvement (600)
  • Hematology (236)
  • HIV/AIDS (500)
  • Infectious Diseases (except HIV/AIDS) (11618)
  • Intensive Care and Critical Care Medicine (615)
  • Medical Education (236)
  • Medical Ethics (67)
  • Nephrology (256)
  • Neurology (2136)
  • Nursing (133)
  • Nutrition (332)
  • Obstetrics and Gynecology (424)
  • Occupational and Environmental Health (516)
  • Oncology (1171)
  • Ophthalmology (363)
  • Orthopedics (128)
  • Otolaryngology (220)
  • Pain Medicine (144)
  • Palliative Medicine (50)
  • Pathology (308)
  • Pediatrics (692)
  • Pharmacology and Therapeutics (298)
  • Primary Care Research (265)
  • Psychiatry and Clinical Psychology (2168)
  • Public and Global Health (4640)
  • Radiology and Imaging (775)
  • Rehabilitation Medicine and Physical Therapy (450)
  • Respiratory Medicine (621)
  • Rheumatology (273)
  • Sexual and Reproductive Health (224)
  • Sports Medicine (208)
  • Surgery (250)
  • Toxicology (42)
  • Transplantation (120)
  • Urology (94)