Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Clinician-Informed Feature Engineering Improves Machine Learning Assignment of Molecular Endotypes in the Intensive Care Unit

View ORCID ProfileBenjamin J Sines, Robert S Hagan, Xi Jiang, Ella Pavlechko, Scott McClain, Xin Hunt, Julia Florou-Moreno, Jake Acquadro, Gabriel Risa, Varun Valsaraj, Jonathan C Schisler, Matthew C Wolfgang
doi: https://doi.org/10.64898/2026.04.06.26350248
Benjamin J Sines
1Division of Pulmonary Critical Care Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
MD MSCR
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Benjamin J Sines
  • For correspondence: benjamin_sines{at}med.unc.edu
Robert S Hagan
1Division of Pulmonary Critical Care Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
MD PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xi Jiang
2SAS Institute Inc, Cary, NC, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ella Pavlechko
2SAS Institute Inc, Cary, NC, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Scott McClain
2SAS Institute Inc, Cary, NC, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Xin Hunt
2SAS Institute Inc, Cary, NC, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Julia Florou-Moreno
2SAS Institute Inc, Cary, NC, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jake Acquadro
2SAS Institute Inc, Cary, NC, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Gabriel Risa
2SAS Institute Inc, Cary, NC, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Varun Valsaraj
2SAS Institute Inc, Cary, NC, United States
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jonathan C Schisler
3McAllister Heart Institute and the Department of Pharmacology, The University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
MS PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Matthew C Wolfgang
4Marsico Lung Institute and Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Objective To develop a workflow that transforms electronic health record data into machine learning-ready features for molecular endotype assignment and to evaluate whether clinician-informed feature engineering improves model performance and interpretability.

Materials and Methods We developed parallel clinician-informed and clinician-agnostic feature engineering pipelines to prepare raw EHR data from mechanically ventilated patients with respiratory failure. Molecular endotype labels derived from paired deep lung and blood profiling of subjects with acute lung injury were used to train candidate machine learning classifiers. Champion models from each pipeline were compared on predefined performance metrics.

Results Bayesian network classifiers were the top-performing models in both pipelines. The clinician-informed pipeline generated fewer features than the clinician-agnostic pipeline (645 vs 1,127) and produced a lower misclassification rate in the final Bayesian network model (0.047 vs 0.14). In an independent cohort of subjects with acute lung injury, the clinician-informed model better distinguished corticosteroid-responsive from non-responsive subgroups.

Discussion Clinical context improved feature engineering efficiency, model interpretability, and classification performance. These findings support the integration of domain expertise into machine learning workflows intended for critical care implementation.

Conclusions Clinician-informed feature engineering can simplify machine learning models while improving performance and preserving clinical relevance. AI tools developed for healthcare should incorporate subject matter expertise early in the feature engineering and analytic workflow.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This project was supported by the North Carolina Collaboratory at The University of North Carolina at Chapel Hill with funding appropriated by the North Carolina General Assembly (V.V. and M.C.W) and by the Rapidly Emerging Antiviral Drug Development Initiative at the University of North Carolina at Chapel Hill with funding from the North Carolina Coronavirus State and Local Fiscal Recovery Funds program, appropriated by the North Carolina General Assembly (J.C.S., R.S.H., and M.C.W.)

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study was approved by the institutional review board at the University of North Carolina at Chapel Hill with a waiver of informed consent (IRB 22-3196, January 11, 2023).

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • Conflict of Interest: The authors have declared that no conflict of interest exists. Key Words: Machine Learning, Clinical Prediction, ARDS, Critical Care

Data Availability

All data produced in the present study are available upon reasonable request to the authors

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted April 07, 2026.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Clinician-Informed Feature Engineering Improves Machine Learning Assignment of Molecular Endotypes in the Intensive Care Unit
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Clinician-Informed Feature Engineering Improves Machine Learning Assignment of Molecular Endotypes in the Intensive Care Unit
Benjamin J Sines, Robert S Hagan, Xi Jiang, Ella Pavlechko, Scott McClain, Xin Hunt, Julia Florou-Moreno, Jake Acquadro, Gabriel Risa, Varun Valsaraj, Jonathan C Schisler, Matthew C Wolfgang
medRxiv 2026.04.06.26350248; doi: https://doi.org/10.64898/2026.04.06.26350248
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Clinician-Informed Feature Engineering Improves Machine Learning Assignment of Molecular Endotypes in the Intensive Care Unit
Benjamin J Sines, Robert S Hagan, Xi Jiang, Ella Pavlechko, Scott McClain, Xin Hunt, Julia Florou-Moreno, Jake Acquadro, Gabriel Risa, Varun Valsaraj, Jonathan C Schisler, Matthew C Wolfgang
medRxiv 2026.04.06.26350248; doi: https://doi.org/10.64898/2026.04.06.26350248

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Intensive Care and Critical Care Medicine
Subject Areas
All Articles
  • Addiction Medicine (576)
  • Allergy and Immunology (868)
  • Anesthesia (306)
  • Cardiovascular Medicine (4482)
  • Dentistry and Oral Medicine (449)
  • Dermatology (385)
  • Emergency Medicine (615)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1528)
  • Epidemiology (15278)
  • Forensic Medicine (31)
  • Gastroenterology (1133)
  • Genetic and Genomic Medicine (6645)
  • Geriatric Medicine (671)
  • Health Economics (1006)
  • Health Informatics (4605)
  • Health Policy (1378)
  • Health Systems and Quality Improvement (1623)
  • Hematology (544)
  • HIV/AIDS (1276)
  • Infectious Diseases (except HIV/AIDS) (15961)
  • Intensive Care and Critical Care Medicine (1111)
  • Medical Education (626)
  • Medical Ethics (147)
  • Nephrology (674)
  • Neurology (6695)
  • Nursing (346)
  • Nutrition (1006)
  • Obstetrics and Gynecology (1153)
  • Occupational and Environmental Health (961)
  • Oncology (3369)
  • Ophthalmology (988)
  • Orthopedics (370)
  • Otolaryngology (421)
  • Pain Medicine (437)
  • Palliative Medicine (131)
  • Pathology (669)
  • Pediatrics (1704)
  • Pharmacology and Therapeutics (700)
  • Primary Care Research (717)
  • Psychiatry and Clinical Psychology (5495)
  • Public and Global Health (9285)
  • Radiology and Imaging (2223)
  • Rehabilitation Medicine and Physical Therapy (1375)
  • Respiratory Medicine (1201)
  • Rheumatology (598)
  • Sexual and Reproductive Health (721)
  • Sports Medicine (535)
  • Surgery (722)
  • Toxicology (100)
  • Transplantation (290)
  • Urology (267)