Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models

James C Dickerson, Marni B McClure, Margaret Shaw, Marissa B Reitsma, Nicole H Dalal, Allison W Kurian, Jennifer L Caswell-Jin
doi: https://doi.org/10.64898/2026.03.23.26349012
James C Dickerson
1Stanford University, Stanford, CA, USA 94305
MD, MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jcdicker{at}stanford.edu
Marni B McClure
1Stanford University, Stanford, CA, USA 94305
2Women’s Malignancies Branch, Center for Cancer Research, National Cancer Institute, Bethesda, MD, USA 20814
MD, PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Margaret Shaw
1Stanford University, Stanford, CA, USA 94305
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marissa B Reitsma
1Stanford University, Stanford, CA, USA 94305
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nicole H Dalal
1Stanford University, Stanford, CA, USA 94305
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Allison W Kurian
1Stanford University, Stanford, CA, USA 94305
MD MSc
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jennifer L Caswell-Jin
1Stanford University, Stanford, CA, USA 94305
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Manual chart abstraction is a major bottleneck in clinical research. In oncology, important outcomes such as disease recurrence and the treatment history are often only documented in clinical notes, limiting the scale and quality of observational and epidemiologic studies. We developed an open-source pipeline that, in a HIPAA-compliant setting, can use any commercially available large language model (LLM) to abstract variables. We sought to understand if a wide range of variables could be abstracted from complex longitudinal oncology records with performance similar to that of expert medical oncologists.

Methods We randomly selected 100 patients from an institutional breast cancer cohort enriched for complex care. We abstracted a range of key variables from unstructured data, including dates of diagnosis and recurrence, clinical stage, biomarker subtype, genetic testing results, and prescribed systemic therapies, including treatment timing, intent, and reason for discontinuation. The inputs to the LLMs were unnormalized, unlabeled, and unedited clinical notes, pathology reports, med admin records, and demographics. Breast oncologists abstracted the same variables to create the reference standard. For systemic therapy extraction, a second oncologist and research coordinators served as comparators. In addition to variable-level performance, we examined whether survival and hazard-ratio estimates were similar for fully LLM-derived datasets compared with expert-derived datasets.

Results Among 100 patients, the median chart had more than 3,100 pages of text; patients received a median of 7 lines of therapy over 6.5 years of follow-up. The best-performing LLM achieved 99% concordance with the expert for recurrence status, 100% for germline BRCA1/2 pathogenic variant detection, 99% for hormone receptor status, 96% for HER2 status, 91% for clinical stage, 91% for PIK3CA mutation status, and 90% for ESR1 mutation status. For anti-cancer drug extraction, the best-performing LLM approached inter-oncologist variability. For exact therapy-line reconstruction, mean patient-level performance remained 9 percentage points lower than the second oncologist, although inter-LLM disagreement was similar to inter-oncologist disagreement. All four LLMs tested outperformed the research coordinators on systemic therapy abstraction. Recurrence-free survival, overall survival, and hazard ratio estimates were similar between expert-derived and LLM-derived datasets. In an external cohort of 97 young patients with early-stage breast cancer, the unmodified pipeline showed similar performance for recurrence detection and adjuvant endocrine therapy use.

Conclusions Off-the-shelf general-purpose LLMs in a fixed retrieval pipeline were able to abstract a range of variables from complex longitudinal oncology records with performance approaching inter-oncologist variability for key tasks, without any fine-tuning or institution-specific retraining. This approach offers a practical path to scaling the creation of research-grade retrospective datasets from narrative medical records.

Competing Interest Statement

JCD: Stock Ownership: Johnson & Johnson, Merck. JLC: research funding to her institution from Effector Therapeutics and Novartis.

Funding Statement

This work was supported by the Stanford Center for Digital Health

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Our study was approved by the Stanford IRB under protocol 19482

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The analytic code and core abstraction pipeline will be made publicly available on GitHub upon publication.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted March 25, 2026.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models
James C Dickerson, Marni B McClure, Margaret Shaw, Marissa B Reitsma, Nicole H Dalal, Allison W Kurian, Jennifer L Caswell-Jin
medRxiv 2026.03.23.26349012; doi: https://doi.org/10.64898/2026.03.23.26349012
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models
James C Dickerson, Marni B McClure, Margaret Shaw, Marissa B Reitsma, Nicole H Dalal, Allison W Kurian, Jennifer L Caswell-Jin
medRxiv 2026.03.23.26349012; doi: https://doi.org/10.64898/2026.03.23.26349012

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Oncology
Subject Areas
All Articles
  • Addiction Medicine (576)
  • Allergy and Immunology (867)
  • Anesthesia (305)
  • Cardiovascular Medicine (4477)
  • Dentistry and Oral Medicine (449)
  • Dermatology (384)
  • Emergency Medicine (613)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1526)
  • Epidemiology (15268)
  • Forensic Medicine (31)
  • Gastroenterology (1132)
  • Genetic and Genomic Medicine (6643)
  • Geriatric Medicine (670)
  • Health Economics (1005)
  • Health Informatics (4592)
  • Health Policy (1378)
  • Health Systems and Quality Improvement (1621)
  • Hematology (544)
  • HIV/AIDS (1273)
  • Infectious Diseases (except HIV/AIDS) (15954)
  • Intensive Care and Critical Care Medicine (1110)
  • Medical Education (624)
  • Medical Ethics (147)
  • Nephrology (673)
  • Neurology (6679)
  • Nursing (346)
  • Nutrition (1004)
  • Obstetrics and Gynecology (1152)
  • Occupational and Environmental Health (961)
  • Oncology (3364)
  • Ophthalmology (987)
  • Orthopedics (370)
  • Otolaryngology (421)
  • Pain Medicine (437)
  • Palliative Medicine (131)
  • Pathology (667)
  • Pediatrics (1701)
  • Pharmacology and Therapeutics (698)
  • Primary Care Research (715)
  • Psychiatry and Clinical Psychology (5487)
  • Public and Global Health (9279)
  • Radiology and Imaging (2218)
  • Rehabilitation Medicine and Physical Therapy (1375)
  • Respiratory Medicine (1201)
  • Rheumatology (598)
  • Sexual and Reproductive Health (720)
  • Sports Medicine (535)
  • Surgery (717)
  • Toxicology (100)
  • Transplantation (290)
  • Urology (266)