Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Automated Transformation of Unstructured Cardiovascular Diagnostic Reports into Structured Datasets Using Sequentially Deployed Large Language Models

View ORCID ProfileSumukh Vasisht Shankar, View ORCID ProfileLovedeep S Dhingra, View ORCID ProfileArya Aminorroaya, Philip Adejumo, Girish N Nadkarni, Hua Xu, View ORCID ProfileCynthia Brandt, View ORCID ProfileEvangelos K Oikonomou, Aline F Pedroso, View ORCID ProfileRohan Khera
doi: https://doi.org/10.1101/2024.10.08.24315035
Sumukh Vasisht Shankar
1Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sumukh Vasisht Shankar
Lovedeep S Dhingra
1Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
MBBS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lovedeep S Dhingra
Arya Aminorroaya
1Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
MD, MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Arya Aminorroaya
Philip Adejumo
1Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Girish N Nadkarni
2The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
MD MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Hua Xu
5Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Cynthia Brandt
5Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT
MD, MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Cynthia Brandt
Evangelos K Oikonomou
1Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
MD, DPhil
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Evangelos K Oikonomou
Aline F Pedroso
1Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rohan Khera
1Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
3Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
4Center for Outcomes Research and Evaluation (CORE), Yale New Haven Hospital, New Haven, CT, USA
5Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT
MD, MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Rohan Khera
  • For correspondence: rohan.khera{at}yale.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

ABSTRACT

Background Rich data in cardiovascular diagnostic testing are often sequestered in unstructured reports, with the necessity of manual abstraction limiting their use in real-time applications in patient care and research.

Methods We developed a two-step process that sequentially deploys generative and interpretative large language models (LLMs; Llama2 70b and Llama2 13b). Using a Llama2 70b model, we generated varying formats of transthoracic echocardiogram (TTE) reports from 3,000 real-world echo reports with paired structured elements, leveraging temporal changes in reporting formats to define the variations. Subsequently, we fine-tuned Llama2 13b using sequentially larger batches of generated echo reports as inputs, to extract data from free-text narratives across 18 clinically relevant echocardiographic fields. This was set up as a prompt-based supervised training task. We evaluated the fine-tuned Llama2 13b model, HeartDx-LM, on several distinct echocardiographic datasets: (i) reports across the different time periods and formats at Yale New Haven Health System (YNHHS), (ii) the Medical Information Mart for Intensive Care (MIMIC) III dataset, and (iii) the MIMIC IV dataset. We used the accuracy of extracted fields and Cohen’s Kappa as the metrics and have publicly released the HeartDX-LM model.

Results The HeartDX-LM model was trained on randomly selected 2,000 synthetic echo reports with varying formats and paired structured labels, with a wide range of clinical findings. We identified a lower threshold of 500 annotated reports required for fine-tuning Llama2 13b to achieve stable and consistent performance. At YNHHS, the HeartDx-LM model accurately extracted 69,144 out of 70,032 values (98.7%) across 18 clinical fields from unstructured reports in the test set from contemporary records where paired structured data were also available. In older echo reports where only unstructured reports were available, the model achieved 87.1% accuracy against expert annotations for the same 18 fields for a random sample of 100 reports. Similarly, in expert-annotated external validation sets from MIMIC-IV and MIMIC-III, HeartDx-LM correctly extracted 201 out of 220 available values (91.3%) and 615 out of 707 available values (87.9%), respectively, from 100 randomly chosen and expert annotated echo reports from each set.

Conclusion We developed a novel method using paired large and moderate-sized LLMs to automate the extraction of unstructured echocardiographic reports into tabular datasets. Our approach represents a scalable strategy that transforms unstructured reports into computable elements that can be leveraged to improve cardiovascular care quality and enable research.

Competing Interest Statement

Dr. Khera is an Associate Editor of JAMA and is a co-founder of Ensight-AI. Dr. Khera receives support from the National Heart, Lung, and Blood Institute of the National Institutes of Health (under awards R01HL167858 and K23HL153775) and the Doris Duke Charitable Foundation (under award 2022060). He receives support from the Blavatnik Foundation through the Blavatnik Fund for Innovation at Yale. He also receives research support, through Yale, from Bristol-Myers Squibb, BridgeBio, and Novo Nordisk. In addition to 63/346,610, Dr. Khera is a coinventor of U.S. Pending Patent Applications WO2023230345A1, US20220336048A1, 63/484,426, 63/508,315, 63/580,137, 63/619,241, 63/346,610 and 63/562,335. Dr. Khera, Dr. Oikonomou and Mr. Vasisht Shankar are co-inventors of the US patent application 63/606,203. Dr. Khera and Dr. Oikonomou are co-founders of Evidence2Health, a precision health platform to improve evidence-based cardiovascular care. Dr. Oikonomou receives support from the National Heart, Lung, and Blood Institute of the National Institutes of Health (under award F32HL170592). He is a co-inventor of the U.S. Patent Applications 18/813,882, 17/720,068, 63/619,241, 63/177,117, 63/580,137, 63/606,203, 63/562,335, US11948230B2, US20210374951A1. He has been a consultant for Caristo Diagnostics Ltd and Ensight-AI Inc, and has received royalty fees from technology licensed through the University of Oxford.. Mr. Vasisht Shankar works as a data scientist at Evidence2Health (outside the current work). Dr. Nadkarni is a founder of Renalytix, Pensieve, and Verici and provides consultancy services to AstraZeneca, Reata, Renalytix, and Pensieve. He also has equity in Renalytix, Pensieve, and Verici.

Funding Statement

Dr. Khera was supported by the National Heart, Lung, and Blood Institute of the National Institutes of Health (under awards R01AG089981, R01HL167858, and K23HL153775) and the Doris Duke Charitable Foundation (under award 2022060). Dr. Oikonomou was supported by the National Heart, Lung, and Blood Institute of the National Institutes of Health (under award F32HL170592). The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study was reviewed by the Yale Institutional Review Board, which waived the need for informed consent, as it represents a secondary analysis of existing data.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted October 08, 2024.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Automated Transformation of Unstructured Cardiovascular Diagnostic Reports into Structured Datasets Using Sequentially Deployed Large Language Models
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Automated Transformation of Unstructured Cardiovascular Diagnostic Reports into Structured Datasets Using Sequentially Deployed Large Language Models
Sumukh Vasisht Shankar, Lovedeep S Dhingra, Arya Aminorroaya, Philip Adejumo, Girish N Nadkarni, Hua Xu, Cynthia Brandt, Evangelos K Oikonomou, Aline F Pedroso, Rohan Khera
medRxiv 2024.10.08.24315035; doi: https://doi.org/10.1101/2024.10.08.24315035
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Automated Transformation of Unstructured Cardiovascular Diagnostic Reports into Structured Datasets Using Sequentially Deployed Large Language Models
Sumukh Vasisht Shankar, Lovedeep S Dhingra, Arya Aminorroaya, Philip Adejumo, Girish N Nadkarni, Hua Xu, Cynthia Brandt, Evangelos K Oikonomou, Aline F Pedroso, Rohan Khera
medRxiv 2024.10.08.24315035; doi: https://doi.org/10.1101/2024.10.08.24315035

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (434)
  • Allergy and Immunology (758)
  • Anesthesia (222)
  • Cardiovascular Medicine (3312)
  • Dentistry and Oral Medicine (366)
  • Dermatology (282)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1175)
  • Epidemiology (13397)
  • Forensic Medicine (19)
  • Gastroenterology (900)
  • Genetic and Genomic Medicine (5174)
  • Geriatric Medicine (482)
  • Health Economics (785)
  • Health Informatics (3283)
  • Health Policy (1145)
  • Health Systems and Quality Improvement (1198)
  • Hematology (432)
  • HIV/AIDS (1022)
  • Infectious Diseases (except HIV/AIDS) (14650)
  • Intensive Care and Critical Care Medicine (914)
  • Medical Education (478)
  • Medical Ethics (128)
  • Nephrology (525)
  • Neurology (4949)
  • Nursing (262)
  • Nutrition (734)
  • Obstetrics and Gynecology (888)
  • Occupational and Environmental Health (797)
  • Oncology (2530)
  • Ophthalmology (730)
  • Orthopedics (284)
  • Otolaryngology (348)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (547)
  • Pediatrics (1305)
  • Pharmacology and Therapeutics (551)
  • Primary Care Research (558)
  • Psychiatry and Clinical Psychology (4223)
  • Public and Global Health (7525)
  • Radiology and Imaging (1713)
  • Rehabilitation Medicine and Physical Therapy (1018)
  • Respiratory Medicine (981)
  • Rheumatology (480)
  • Sexual and Reproductive Health (500)
  • Sports Medicine (425)
  • Surgery (551)
  • Toxicology (72)
  • Transplantation (237)
  • Urology (206)