Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Identifying and Characterizing Gallstone Disease from Clinical Narratives with Zero-shot Learning and Automated Prompt Optimization

View ORCID ProfileSy Hwang, View ORCID ProfileAgnes Wang, Sunil Thomas, View ORCID ProfileAshley Batugo, View ORCID ProfileDavid E Kaplan, View ORCID ProfileDaniel J Rader, View ORCID ProfileDanielle Mowery, Junghyun Lim
doi: https://doi.org/10.64898/2026.01.29.26345132
Sy Hwang
1Departments of Genomics & Computational Biology, University of Pennsylvania, Philadelphia, PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sy Hwang
Agnes Wang
6Department of Computation and Information Science, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Agnes Wang
Sunil Thomas
5Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ashley Batugo
5Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ashley Batugo
David E Kaplan
2Departments of Gastroenterology, University of Pennsylvania, Philadelphia, PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for David E Kaplan
Daniel J Rader
3Departments of Genetics, University of Pennsylvania, Philadelphia, PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Daniel J Rader
Danielle Mowery
4Departments of Biostatistics, Epidemiology, & Informatics, University of Pennsylvania, Philadelphia, PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Danielle Mowery
  • For correspondence: dlmowery{at}pennmedicine.upenn.edu
Junghyun Lim
2Departments of Gastroenterology, University of Pennsylvania, Philadelphia, PA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

We built and evaluated a zero-shot LLM pipeline with automated, task-aware prompt optimization to extract radiology and symptom fields for gallstone phenotyping from de-identified EHR text. Across symptomatic, asymptomatic, and control cohorts, it performed reliably on high-signal binary fields and symptom flags but lagged on fine-grained stone burden and complications, establishing a practical baseline and motivating targeted refinements

Introduction

Gallstone disease is highly prevalent and a cause of significant morbidity in the United States.1 Discernment of true cases and true controls is often challenging due to inconsistencies in translation of asymptomatic and symptomatic gallstone disease to International Classification of Diseases (ICD)-9/10 codes.2 In clinical practice, asymptomatic gallstones are often found incidentally on imaging (e.g. ultrasonography [US] and computed tomography [CT]) and are often overlooked/ not entered in as ICD-9/10 billing codes. Thus, relying solely on ICD-9/10 codes may lead to incorrectly assigning asymptomatic gallstone cases to controls. The need to reduce these inaccuracies emphasizes the need for the utilization of structured and unstructured data to identify gallstone cases. To our knowledge, no prior studies have published validation in the Epic Clarity regarding phenotype algorithms developed for gallstone disease via utilization of large language modeling (LLM).

Methods

Data

We developed a phenotype algorithm for gallstone disease based on structured and unstructured data incorporating an LLM algorithm, leveraging the Epic Clarity. We constructed a gallstone disease data mart of putative cases. Inclusion criteria consisted of adults (18 years old) with at least one abdominal imaging (US, CT, or magnetic resonance imaging [MRI]) from 1/1/2024 to 1/1/2025. Based on ICD and CPT codes, we categorized these individuals into symptomatic gallstone disease group (SGD), asymptomatic gallstone disease group (AGD), or no gallstone disease (NGD) based on the following criteria. For SGD, those with ICD codes for cholelithiasis, choledocholithiasis, cholecystitis, and gallstone pancreatitis, and CPT code for cholecystectomy and endoscopic retrograde cholangiopancreatography (ERCP) entered after the date of imaging were included. For AGD, those with ICD codes for cholelithiasis and choledocholithiasis entered after the date of imaging were included, and those with ICD codes for cholecystitis or gallstone pancreatitis (which reflect SGD) and CPT code for cholecystectomy or ERCP were excluded. For NGD, we excluded ICD codes for cholelithiasis, choledocholithiasis, cholecystitis, and gallstone pancreatitis, and CPT code for cholecystectomy and endoscopic retrograde cholangiopancreatography. For all individuals across the groups, we excluded individuals with ICD codes for acalculous cholecystitis, gallbladder polyps, biliary malignancy, pancreatic malignancy, neoplasm of ampulla of Vater, and sickle cell disease. We additionally excluded individuals with CPT codes for Whipple or liver transplant surgery and ERCP with non-gallstone-related procedures.

Upon categorizing the individuals into the three groups, we randomly sampled 100 patients from SGD, 50 patients from AGD, and 50 patients from NGD. We then collected structured data including demographics (age, race, sex), ICD and CPT codes. We also collected unstructured data which were free text from de-identified clinical notes, radiology reports, and pathology reports. Specifically, clinical notes (1 year back from the latest procedure date for SGD, from the latest abdominal imaging from AGD and NGD), radiology reports of the abdominal imaging from 1/1/2024 to 1/1/2025, and pathology reports from cholecystectomy that occurred following abdominal imaging. Among the randomly sampled patients, 100 SGD patients, 47 AGD patients, and 49 NGD patients had the notes and reports available for collection. All extracted variables were stored and analyzed in our HIPAA-compliant, Penn Medicine-managed server and project workspace. For these patients, a clinician with training in gastroenterology performed manual chart review, extracted data pertaining to symptoms, radiological evidence, and pathology descriptions of gallstone disease, and verified SGD, AGD, and NGD groups. Upon chart review, 118 patients were verified as SGD, 26 patients as AGD, and 52 patients as NGD.

Prompting

We framed the study as a zero-shot instruction task. For each patient, the LLM was provided with patient’s de-identified notes within the study window, ordered chronologically. The prompt instructed the model to base all judgments only on the supplied corpus. For radiology, the model was tasked to capture to presence or absence of gallstones, gallbladder and gallbladder sludge. The model was instructed to extract potential complications including cholecystitis and extrahepatic duct dilation, as well as stone burden. For symptoms, the model was tasked with recording an overall symptom flag with structured pain attributes like location, radiation, character, onset, frequency and relation to food.

To improve instruction quality and task adherence, we perform task-aware automated prompt optimization 3, which involves two iterative processes. First, the system generates multiple candidate instruction prompts based on carefully curated initial prompts, producing variations in style, specificity and phrasing. Second, a critic phase scores each candidate prompt and provides targeted feedback based on classification performance. This process involves lightweight reasoning steps to guide consistent mapping from text spans to categorization through chain-of-thought. Candidates are then re-scored according to a selected task metric and the best instruction prompts are retained. This process iterates until either performance plateaus or a budget maximum is reached, whichever comes first, and yields a final prompt that is faithful to constraints and robust to variability in notes.

For model selection, we evaluated several instruction-tuned models and chose OpenAI’s GPT-4.1 as it delivered the best balance of latency, cost and label fidelity for our classification tasks. It produced well-formed outputs with fewer normalization corrections required post-hoc and had sufficient context size for the amount of input text we needed to provide. We set the sampling temperature to 0.1 to minimize stochastic variability, which proved to be critical when our outputs had to contain strictly enumerated label values. This configuration gave us adequate sensitivity while reducing spurious misclassification.

Results

The model was reliable on high-signal binary fields (presence of gallbladder F1≈0.90–0.93; presence of gallstones F1≈0.70–0.77) and symptom presence when clearly documented (F1≈0.62–0.84 across cohorts), but it underperformed on fine-grained attributes such as stone count, size, and location (F1≈0.44–0.59) and on complications like cholecystitis, particularly in the asymptomatic cohort (F1≈0.40–0.62) (See Appendix). Error patterns reflect narrative heterogeneity and underspecification. Conservative defaults (“No/NA”) likely improved precision at the expense of recall. This presents a serviceable baseline and points to clear avenues for improvement.

Discussion and Conclusions

Zero-shot, schema-constrained prompting produced a workable baseline for gallstone phenotyping with minimal engineering. It is sufficient for coarse cohort characterization and triage but remains limited on fine-grained attributes and complications due to narrative variability and temporal ambiguity. Next steps include post-hoc deterministic checks, developing proxy labels, few-shot exemplars, retrieval augmentation and fine-tuning.

Data Availability

All data produced by the study is not openly available due to sensitivity of the dataset.

Appendix

View this table:
  • View inline
  • View popup

References

  1. 1.↵
    Shaffer EA. Epidemiology of gallbladder stone disease. Best Pract Res Clin Gastroenterol. 2006;20(6):981–996. doi:10.1016/j.bpg.2006.05.004
    OpenUrlCrossRefPubMed
  2. 2.↵
    Callahan A, Shah NH, Chen JH. Research and Reporting Considerations for Observational Studies Using Electronic Health Record Data. Ann Intern Med. 2020;172(11 Suppl):S79–S84. doi:10.7326/M19-0873
    OpenUrlCrossRefPubMed
  3. 3.
    Agarwal E, Singh J, Dani V, Magazine R, Ganu T, Nambi A. PromptWizard: task-aware prompt optimization framework. arXiv. 2024. https://arxiv.org/abs/2405.18369.
Back to top
PreviousNext
Posted January 30, 2026.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Identifying and Characterizing Gallstone Disease from Clinical Narratives with Zero-shot Learning and Automated Prompt Optimization
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Identifying and Characterizing Gallstone Disease from Clinical Narratives with Zero-shot Learning and Automated Prompt Optimization
Sy Hwang, Agnes Wang, Sunil Thomas, Ashley Batugo, David E Kaplan, Daniel J Rader, Danielle Mowery, Junghyun Lim
medRxiv 2026.01.29.26345132; doi: https://doi.org/10.64898/2026.01.29.26345132
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Identifying and Characterizing Gallstone Disease from Clinical Narratives with Zero-shot Learning and Automated Prompt Optimization
Sy Hwang, Agnes Wang, Sunil Thomas, Ashley Batugo, David E Kaplan, Daniel J Rader, Danielle Mowery, Junghyun Lim
medRxiv 2026.01.29.26345132; doi: https://doi.org/10.64898/2026.01.29.26345132

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (576)
  • Allergy and Immunology (867)
  • Anesthesia (306)
  • Cardiovascular Medicine (4480)
  • Dentistry and Oral Medicine (449)
  • Dermatology (385)
  • Emergency Medicine (614)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1528)
  • Epidemiology (15276)
  • Forensic Medicine (31)
  • Gastroenterology (1133)
  • Genetic and Genomic Medicine (6643)
  • Geriatric Medicine (671)
  • Health Economics (1006)
  • Health Informatics (4602)
  • Health Policy (1378)
  • Health Systems and Quality Improvement (1622)
  • Hematology (544)
  • HIV/AIDS (1275)
  • Infectious Diseases (except HIV/AIDS) (15959)
  • Intensive Care and Critical Care Medicine (1110)
  • Medical Education (626)
  • Medical Ethics (147)
  • Nephrology (674)
  • Neurology (6692)
  • Nursing (346)
  • Nutrition (1006)
  • Obstetrics and Gynecology (1152)
  • Occupational and Environmental Health (961)
  • Oncology (3369)
  • Ophthalmology (988)
  • Orthopedics (370)
  • Otolaryngology (421)
  • Pain Medicine (437)
  • Palliative Medicine (131)
  • Pathology (668)
  • Pediatrics (1703)
  • Pharmacology and Therapeutics (699)
  • Primary Care Research (717)
  • Psychiatry and Clinical Psychology (5494)
  • Public and Global Health (9284)
  • Radiology and Imaging (2223)
  • Rehabilitation Medicine and Physical Therapy (1375)
  • Respiratory Medicine (1201)
  • Rheumatology (598)
  • Sexual and Reproductive Health (720)
  • Sports Medicine (535)
  • Surgery (720)
  • Toxicology (100)
  • Transplantation (290)
  • Urology (266)