Structured Abstract
Introduction Large-language models can help extract information from clinical notes, making them potentially useful for research in ulcerative colitis. However, it remains unclear if these models will scale well in practice.
Methods We analyzed the performance and cost of programmatically using GPT-4 to abstract Mayo endoscopic subscores (MES) from 499 colonoscopy reports using different prompting strategies.
Results Zero-shot prompting, where GPT-4 is instructed without examples, was most accurate (83.55%) and cost-effective ($0.097/note).
Discussion Using GPT-4 to automatically curate the MES and other variables is a practical strategy for quantifying UC activity and measuring improvements to clinical care.
Introduction
The Mayo endoscopic subscore (MES) is a core measure of ulcerative colitis (UC) activity (1), but is not always explicitly documented in colonoscopy reports. Thus, clinical studies that use the MES frequently require manual review of these reports to abstract these scores from free-text descriptions. Large language models like GPT-4 have shown promise in their ability to extract information from clinical notes Prior studies have used the more user friendly, chatbot interface to interact with these models. However, these models can also be used in a programmatic fashion, raising the possibility of being natively deployed within electronic health record (EHR) systems to dynamically maintain disease registries, optimize study recruitment, and support quality improvement.
As a next step, we studied the scalability and cost-effectiveness of using GPT-4 to automate the extraction of the modified MES. We hypothesized that more sophisticated, n-shot and iterative-style prompts would yield more accurate results despite higher costs associated with this strategy.
Methods
We utilized an existing set of 499 annotated colonoscopy reports sourced from two hospitals in California 217 were from San Francisco General Hospital (SFGH), a safety net hospital, and 282 from the University of California, San Francisco Health, a tertiary care hospital. These reports were annotated based on 1) their suitability for MES scoring (e.g. clear diagnosis of UC, surgically unaltered anatomy), and, 2) the modified MES (4) if appropriate.
We developed two generic conversation templates for zero-shot and n-shot prompting to programmatically interact with GPT-4-turbo via LangChain (5), a framework that enables context-rich prompts, and UCSF Versa, a PHI-compliant programmatic interface with GPT-4. (See Table 1, Supplemental Digital Content 1, for precise prompt templates and protocol texts.) We refer to n-shot prompting as providing n colonoscopy reports per Mayo score plus a non-Mayo scorable report in addition to the scoring protocol; zero shot refers to providing only the scoring protocol. For these templates we also studied the performance of GPT-4 with prompts that asked it to not only include the MES, but an explanation as well. We also provided GPT-4 with a parsed variation of colonoscopy report for UCSF and SFGH centers where only the main relevant text of the colonoscopy procedure report was provided as opposed to the colonoscopy report text in its entirety, which includes extraneous text strings. See Method Details, Supplemental Digital Content 1, for additional explanation.
Results
Zero-shot prompts on trimmed notes produced the best performing results consistently on both UCSF (81.45-83.55% weighted average accuracy) and SFGH (73.15-78.11% weighted average accuracy) reports (Table 1). We found that that n-shot prompting actually decreased classification performance across multiple metrics, rejecting our hypothesis that providing examples would improve GPT-4’s performance. Further, the cost of n-shot prompting is prohibitively more expensive on average (e.g., more than 8 times the cost for UCSF reports and 11 times for SFGH reports between n-shot and zero-shot prompting per note). For Mayo scorable accuracy, whether an MES can be assigned to the colonoscopy report, we find that GPT-4 performs very well for zero-shot prompt templates (accuracy 90.28-93.66%), where n-shot prompting reduced its accuracy (best score at 84.28%). We also studied results for splitting Mayo scorable reports and MES separately (“zero-shot, two-task prompting”) but performance gains were negligible (Table 1).
With respect to prompt variations such as parsing colonoscopy reports and soliciting explanations for MES values, across all strata we find that the greatest difference in performance is 4.02% for trimming and -3.31% for requiring an explanation (Table 2). In particular, parsing the text generally increases performance across all measures and decreases under classification as well. Interestingly, we find that across standard statistical learning metrics, the UCSF data shows an increase in performance when requiring explanation of the MES and a decrease in performance for SFGH data although the magnitude of these differences is minimal (worst difference in magnitude amongst accuracy, precision, recall and F1-score is 2.76%).
Discussion
This is among the first few studies to use an LLM in a programmatic fashion to extract study variables from clinical notes. We found that the most accurate and scalable prompting strategy is conveniently the most simple when it comes to producing MES scores from colonoscopy reports. Zero-shot prompts are not only easy to implement, but cost effective. GPT-4 proves itself to be reasonably effective at being able to simultaneously determine whether a colonoscopy procedure report is Mayo scorable, and providing an MES when it is. Further, we find that n-shot prompting is unreliable both in performance and cost.
Beyond template variations of zero-shot, n-shot, and zero-shot, two-task prompting, our study explores prompt parameter interactions in GPT-4 performance that are currently absent in the literature (e.g., explanation requirement and text parsing). Further, our study explores the possibility of the generalizability of LLM information extraction across different centers. The primary limitation of our study then is a sophisticated endpoint. For instance, although the colonoscopy report distribution for IBD patients is representative across UCSF and SFGH centers, we have limited class representation for more severe MES graded UC (Mayo scores 2 and 3 in particular). Other studies in the literature explore continuously valued endpoints as well multidimensional endpoints extracted from clinical text (6,7).
However, these studies focus primarily on the performance of GPT-4 and LLMs on clinical notes. There has been little consideration and commentary on prompt engineering and consequently the costs of deployment—in other words, the practicality of deploying GPT-4 for other studies. Deploying four-shot prompts on 282 parsed UCSF colonoscopy reports, requiring GPT-4 to produce an explanation, comes out to an average total cost of $418.77. There are thousands of colonoscopy procedure reports for IBD patients at our medical centers, but billions of notes across all diseases and patients in the US (8). Generative AI is and will be very expensive to deploy across all clinical areas and target variables. While LLMs like GPT-4 will enable retrospective information extraction, and consequently new observational studies using EHR data, we strongly advise clinical researchers to be mindful of various prompting strategies and their costs.
Data Availability
All data is PHI-compliant and is not to be publicly available.
Data Acknowledgement
The authors thank UCSF Academic Research Services for technical support related to enabling software in a secure, PHI compliant environment; UCSF AI Tiger Team for facilitating and managing access to Versa API (UCSF secure access to Microsoft Azure, OpenAI Large language Models); and the Chancellor’s Task Force for Generative AI.
Footnotes
Financial/Grant Support Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under Award Number K99LM014099, the National Center for Advancing Translational Sciences, National Institutes of Health, through UCSF-CTSI Grant Number UL1 TR001872, as well as the UCLA Clinical and Translational Science Institute through grant number UL1TR001881. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. This research project has benefitted from the Microsoft Accelerate Foundation Models Research (AFMR) grant program through which leading foundation models hosted by Microsoft Azure along with access to Azure credits were provided to conduct the research.
Conflicts of Interest: VAR receives research support from Alnylam, Takeda, Merck, Genentech, Blueprint Medicines, Stryker, Mitsubishi Tanabe, and Janssen. He also is a shareholder of ZebraMD. RPY has nothing to disclose.
Writing Assistance: None.