Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models

James C Dickerson; Marni B McClure; Margaret Shaw; Marissa B Reitsma; Nicole H Dalal; Allison W Kurian; Jennifer L Caswell-Jin

doi:10.64898/2026.03.23.26349012

Abstract

Background Manual chart abstraction is a major bottleneck in clinical research. In oncology, important outcomes such as disease recurrence and the treatment history are often only documented in clinical notes, limiting the scale and quality of observational and epidemiologic studies. We developed an open-source pipeline that, in a HIPAA-compliant setting, can use any commercially available large language model (LLM) to abstract variables. We sought to understand if a wide range of variables could be abstracted from complex longitudinal oncology records with performance similar to that of expert medical oncologists.

Methods We randomly selected 100 patients from an institutional breast cancer cohort enriched for complex care. We abstracted a range of key variables from unstructured data, including dates of diagnosis and recurrence, clinical stage, biomarker subtype, genetic testing results, and prescribed systemic therapies, including treatment timing, intent, and reason for discontinuation. The inputs to the LLMs were unnormalized, unlabeled, and unedited clinical notes, pathology reports, med admin records, and demographics. Breast oncologists abstracted the same variables to create the reference standard. For systemic therapy extraction, a second oncologist and research coordinators served as comparators. In addition to variable-level performance, we examined whether survival and hazard-ratio estimates were similar for fully LLM-derived datasets compared with expert-derived datasets.

Results Among 100 patients, the median chart had more than 3,100 pages of text; patients received a median of 7 lines of therapy over 6.5 years of follow-up. The best-performing LLM achieved 99% concordance with the expert for recurrence status, 100% for germline BRCA1/2 pathogenic variant detection, 99% for hormone receptor status, 96% for HER2 status, 91% for clinical stage, 91% for PIK3CA mutation status, and 90% for ESR1 mutation status. For anti-cancer drug extraction, the best-performing LLM approached inter-oncologist variability. For exact therapy-line reconstruction, mean patient-level performance remained 9 percentage points lower than the second oncologist, although inter-LLM disagreement was similar to inter-oncologist disagreement. All four LLMs tested outperformed the research coordinators on systemic therapy abstraction. Recurrence-free survival, overall survival, and hazard ratio estimates were similar between expert-derived and LLM-derived datasets. In an external cohort of 97 young patients with early-stage breast cancer, the unmodified pipeline showed similar performance for recurrence detection and adjuvant endocrine therapy use.

Conclusions Off-the-shelf general-purpose LLMs in a fixed retrieval pipeline were able to abstract a range of variables from complex longitudinal oncology records with performance approaching inter-oncologist variability for key tasks, without any fine-tuning or institution-specific retraining. This approach offers a practical path to scaling the creation of research-grade retrospective datasets from narrative medical records.

Competing Interest Statement

JCD: Stock Ownership: Johnson & Johnson, Merck. JLC: research funding to her institution from Effector Therapeutics and Novartis.

Funding Statement

This work was supported by the Stanford Center for Digital Health

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Our study was approved by the Stanford IRB under protocol 19482

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The analytic code and core abstraction pipeline will be made publicly available on GitHub upon publication.

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.