Integrating a host transcriptomic biomarker with a large language model for diagnosis of lower respiratory tract infection

Hoang Van Phan; Natasha Spottiswoode; Emily C. Lydon; Victoria T. Chu; Adolfo Cuesta; Alexander D. Kazberouk; Natalie L. Richmond; Padmini Deosthale; Carolyn S. Calfee; Charles R. Langelier

doi:10.1101/2024.08.28.24312732

Abstract

BACKGROUND Lower respiratory tract infections (LRTIs) are a leading cause of mortality worldwide and can be difficult to diagnose in critically ill patients, as non-infectious causes of respiratory failure can present with similar clinical features.

METHODS We developed a LRTI diagnostic method combining the pulmonary transcriptomic biomarker FABP4 with electronic medical record (EMR) text assessment using the large language model Generative Pre-trained Transformer 4 (GPT-4). We evaluated this approach in a prospective cohort of critically ill adults with acute respiratory failure from whom tracheal aspirate FABP4 expression was measured by RNA sequencing. Patients with LRTI or non-infectious conditions were identified using retrospective, multi-physician clinical adjudication. We then confirmed our findings by applying this method to an independent validation cohort of 115 adults with acute respiratory failure.

RESULTS In the derivation cohort, a combined classifier incorporating FABP4 expression and GPT-4– assisted EMR analysis achieved an AUC of 0.93 (±0.08) and an accuracy of 84%, outperforming FABP4 expression alone (AUC 0.84 ± 0.11) and GPT-4–based analysis alone (AUC 0.83 ± 0.07). By comparison, the primary medical team’s admission diagnosis had an accuracy of 72%. In the validation cohort, the combined classifier yielded an AUC of 0.98 (±0.04) and an accuracy of 96%.

CONCLUSIONS Integrating a host transcriptional biomarker with EMR text analysis using a large language model may offer a promising new approach to improving the diagnosis of LRTIs in critically ill adults.

Description We present the novel use of a host transcriptional biomarker combined with artificial intelligence analysis of electronic medical record data to diagnose lower respiratory tract infections in a derivation cohort of critically ill adults, then the validation of this approach in a second, fully independent, cohort. This approach demonstrated high diagnostic accuracy compared to a gold standard of post-hoc multi-physician adjudication.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

5R01HL155418 (CRL); Chan Zuckerberg Biohub San Francisco (CRL); R35HL140026 (CSC)

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The cohort was approved by the University of California Institutional Review Board (protocol #10-02701) and informed consent was obtained from patients or surrogate decision makers.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

Addition of a validation cohort to confirm results.

Data Sharing Statement

The gene count data are available at https://github.com/infectiousdisease-langelier-lab/LRTI_FABP4_GPT4_classifier. The code and required source data are available at https://github.com/infectiousdisease-langelier-lab/LRTI_FABP4_GPT4_classifier.

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.