Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare

View ORCID ProfileLin Lawrence Guo, Keith E. Morse, Catherine Aftandilian, Ethan Steinberg, View ORCID ProfileJason Fries, Jose Posada, Scott Lanyon Fleming, Joshua Lemmon, Karim Jessa, View ORCID ProfileNigam Shah, Lillian Sung
doi: https://doi.org/10.1101/2023.03.14.23287202
Lin Lawrence Guo
1Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lin Lawrence Guo
Keith E. Morse
2Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Catherine Aftandilian
3Division of Hematology/Oncology, Department of Pediatrics, Stanford University, Palo Alto, CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ethan Steinberg
4Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jason Fries
4Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jason Fries
Jose Posada
5Universidad del Norte, Columbia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Scott Lanyon Fleming
4Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Joshua Lemmon
1Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Karim Jessa
6Information Services, The Hospital for Sick Children, Toronto ON
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nigam Shah
4Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nigam Shah
Lillian Sung
1Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON
7Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, ON
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: Lillian.sung@sickkids.ca
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Importance Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored.

Objective Primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels.

Methods This study included three cohorts: SickKidsPeds from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions: acute kidney injury, hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia and thrombocytopenia. For each outcome, we created four lab-based labels (abnormal, mild, moderate and severe) based on test result and one diagnosis-based label. Proportion of admissions with a positive label were presented for each outcome stratified by cohort. Using lab-based labels as the gold standard, agreement using Cohen’s Kappa, sensitivity and specificity were calculated for each lab-based severity level.

Results The number of admissions included were: SickKidsPeds (n=59,298), StanfordPeds (n=24,639) and StanfordAdults (n=159,985). The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKidsPeds across all outcomes, with odds ratio (99.9% confidence interval) for abnormal diagnosis-based label ranging from 2.2 (1.7-2.7) for neutropenia to 18.4 (10.1-33.4) for hyperkalemia. Lab-based labels were more similar by institution. When using lab-based labels as the gold standard, Cohen’s Kappa and sensitivity were lower at SickKidsPeds for all severity levels compared to StanfordPeds.

Conclusions Across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions. This difference was not explained by differences in test results. These results may have implications for machine learning model development and deployment.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This study did not receive any funding.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The use of data from The Hospital for Sick Children was approved as a quality improvement project at SickKids and thus, requirement for Research Ethics Board approval and informed consent were waived by The Hospital for Sick Children. The data from Stanford Medicine was de-identified in which protected health information has been redacted. Because of de-identification, the requirement for Institutional Review Board approval and informed consent were waived by Stanford Medicine.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The data are available from the corresponding author upon reasonable request.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted March 17, 2023.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare
Lin Lawrence Guo, Keith E. Morse, Catherine Aftandilian, Ethan Steinberg, Jason Fries, Jose Posada, Scott Lanyon Fleming, Joshua Lemmon, Karim Jessa, Nigam Shah, Lillian Sung
medRxiv 2023.03.14.23287202; doi: https://doi.org/10.1101/2023.03.14.23287202
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare
Lin Lawrence Guo, Keith E. Morse, Catherine Aftandilian, Ethan Steinberg, Jason Fries, Jose Posada, Scott Lanyon Fleming, Joshua Lemmon, Karim Jessa, Nigam Shah, Lillian Sung
medRxiv 2023.03.14.23287202; doi: https://doi.org/10.1101/2023.03.14.23287202

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (228)
  • Allergy and Immunology (506)
  • Anesthesia (110)
  • Cardiovascular Medicine (1245)
  • Dentistry and Oral Medicine (206)
  • Dermatology (147)
  • Emergency Medicine (282)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (534)
  • Epidemiology (10032)
  • Forensic Medicine (5)
  • Gastroenterology (500)
  • Genetic and Genomic Medicine (2467)
  • Geriatric Medicine (238)
  • Health Economics (480)
  • Health Informatics (1647)
  • Health Policy (754)
  • Health Systems and Quality Improvement (637)
  • Hematology (250)
  • HIV/AIDS (536)
  • Infectious Diseases (except HIV/AIDS) (11872)
  • Intensive Care and Critical Care Medicine (626)
  • Medical Education (253)
  • Medical Ethics (75)
  • Nephrology (268)
  • Neurology (2290)
  • Nursing (139)
  • Nutrition (352)
  • Obstetrics and Gynecology (454)
  • Occupational and Environmental Health (537)
  • Oncology (1249)
  • Ophthalmology (377)
  • Orthopedics (134)
  • Otolaryngology (226)
  • Pain Medicine (158)
  • Palliative Medicine (50)
  • Pathology (325)
  • Pediatrics (734)
  • Pharmacology and Therapeutics (315)
  • Primary Care Research (282)
  • Psychiatry and Clinical Psychology (2281)
  • Public and Global Health (4844)
  • Radiology and Imaging (843)
  • Rehabilitation Medicine and Physical Therapy (492)
  • Respiratory Medicine (652)
  • Rheumatology (286)
  • Sexual and Reproductive Health (241)
  • Sports Medicine (227)
  • Surgery (269)
  • Toxicology (44)
  • Transplantation (125)
  • Urology (99)