Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models

James C Dickerson, Marni B McClure, Margaret Shaw, Marissa B Reitsma, Nicole H Dalal, Allison W Kurian, Jennifer L Caswell-Jin
doi: https://doi.org/10.64898/2026.03.23.26349012
James C Dickerson
1Stanford University, Stanford, CA, USA 94305
MD, MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jcdicker{at}stanford.edu
Marni B McClure
1Stanford University, Stanford, CA, USA 94305
2Women’s Malignancies Branch, Center for Cancer Research, National Cancer Institute, Bethesda, MD, USA 20814
MD, PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Margaret Shaw
1Stanford University, Stanford, CA, USA 94305
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marissa B Reitsma
1Stanford University, Stanford, CA, USA 94305
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nicole H Dalal
1Stanford University, Stanford, CA, USA 94305
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Allison W Kurian
1Stanford University, Stanford, CA, USA 94305
MD MSc
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jennifer L Caswell-Jin
1Stanford University, Stanford, CA, USA 94305
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Manual chart abstraction is a major bottleneck in clinical research. In oncology, important outcomes such as disease recurrence and the treatment history are often only documented in clinical notes, limiting the scale and quality of observational and epidemiologic studies. We developed an open-source pipeline that, in a HIPAA-compliant setting, can use any commercially available large language model (LLM) to abstract variables. We sought to understand if a wide range of variables could be abstracted from complex longitudinal oncology records with performance similar to that of expert medical oncologists.

Methods We randomly selected 100 patients from an institutional breast cancer cohort enriched for complex care. We abstracted a range of key variables from unstructured data, including dates of diagnosis and recurrence, clinical stage, biomarker subtype, genetic testing results, and prescribed systemic therapies, including treatment timing, intent, and reason for discontinuation. The inputs to the LLMs were unnormalized, unlabeled, and unedited clinical notes, pathology reports, med admin records, and demographics. Breast oncologists abstracted the same variables to create the reference standard. For systemic therapy extraction, a second oncologist and research coordinators served as comparators. In addition to variable-level performance, we examined whether survival and hazard-ratio estimates were similar for fully LLM-derived datasets compared with expert-derived datasets.

Results Among 100 patients, the median chart had more than 3,100 pages of text; patients received a median of 7 lines of therapy over 6.5 years of follow-up. The best-performing LLM achieved 99% concordance with the expert for recurrence status, 100% for germline BRCA1/2 pathogenic variant detection, 99% for hormone receptor status, 96% for HER2 status, 91% for clinical stage, 91% for PIK3CA mutation status, and 90% for ESR1 mutation status. For anti-cancer drug extraction, the best-performing LLM approached inter-oncologist variability. For exact therapy-line reconstruction, mean patient-level performance remained 9 percentage points lower than the second oncologist, although inter-LLM disagreement was similar to inter-oncologist disagreement. All four LLMs tested outperformed the research coordinators on systemic therapy abstraction. Recurrence-free survival, overall survival, and hazard ratio estimates were similar between expert-derived and LLM-derived datasets. In an external cohort of 97 young patients with early-stage breast cancer, the unmodified pipeline showed similar performance for recurrence detection and adjuvant endocrine therapy use.

Conclusions Off-the-shelf general-purpose LLMs in a fixed retrieval pipeline were able to abstract a range of variables from complex longitudinal oncology records with performance approaching inter-oncologist variability for key tasks, without any fine-tuning or institution-specific retraining. This approach offers a practical path to scaling the creation of research-grade retrospective datasets from narrative medical records.

Introduction

Clinical research depends on abstracting key variables from the electronic medical record and converting them into structured data for analysis.1,2 This manual process is how we build disease registries, retrospective cohorts, and many real-world evidence datasets. Abstracting information from the chart is labor-intensive and vulnerable to inconsistency across reviewers. Many key variables such as diagnoses, outcomes, biomarker results, treatment exposures, and disease progression are often documented only in free text rather than in structured fields. As a result, much of the clinically meaningful information in the narrative record is not currently used for research, which limits the completeness and quality of observational datasets.3,4

Breast cancer illustrates this problem clearly. Most metastatic breast cancers are recurrences of earlier-stage disease, but SEER only captures initial treatment for patients with de novo metastatic disease, who represent just 6% of new breast diagnoses.5,6 As a result, population-level estimates of metastatic breast cancer burden in the United States depend on simulation models and claims-based inference.5,7,8 Retrospective studies that inform clinical decision-making often rely on labor-intensive manual abstraction of smaller-than-ideal cohorts because identifying cancer recurrence is challenging at scale.9,10 Answering the full range of clinically meaningful questions through manual chart review is impossible.

Large language models (LLMs) may offer a practical way to automate this work. To date, most approaches have relied on proprietary systems or developing task-specific models trained on institutional data. These approaches limit the reproducibility, portability, and local control over protected health information.11-14 Therefore, we sought to understand if off-the-shelf general-purpose LLMs (e.g. ChatGPT, Gemini), available to many healthcare systems today in HIPAA-compliant environments, can reliably abstract complex longitudinal cancer records without fine-tuning, labeled training data, or institution-specific retraining.15 If successful, this approach could provide a practical path toward unlocking large volumes of clinical data for research.

We developed a chart abstraction pipeline able to use commercially available LLMs and applied it to patients with complex breast cancer histories. We compared the performance of the LLM-derived abstraction with expert oncologist review for recurrence, systemic therapy, tumor characteristics, and genomic testing across multiple LLMs. We chose a wide range of tasks to understand if the design was flexible, and to capture the key variables often missing from observational cancer datasets. Because it was unclear how to best measure abstraction of systemic therapies, we decided to benchmark this task to a second oncologist and to research coordinators. Lastly, we assessed if LLM-derived datasets preserved downstream survival and epidemiologic inference.

Methods

Study design and objectives

We compared a fully automated retrieval-based pipeline using commercially available LLMs to expert oncologist chart review for a range of variables (Table 1). The primary endpoints were agreement between LLM-derived and expert-derived data on cancer recurrence, including whether recurrence occurred and its timing, and on the number and composition of systemic treatment lines. Secondary endpoints were agreement on other clinical variables, comparison with research coordinators for selected variables, inter-oncologist variability in systemic therapy abstraction, and preservation of downstream inference when fully LLM-derived datasets were substituted for expert-derived datasets. This study was approved by the Stanford IRB under protocol 19482.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1. Clinical variables extracted from the electronic medical record and evaluation metrics.

Date-based variables were evaluated using concordance within ±90 days in the primary analysis, with sensitivity analyses using alternate thresholds. Categorical variables were evaluated using exact concordance. Systemic therapy abstraction included both exact lines of therapy and component anti-cancer drugs, as well as matched-line start and stop dates, treatment intent, and reason for discontinuation. For genetic testing we had access to the files from Foundation Medicine and from the companies doing the germline genetic testing and so this was used as the reference.

^ Patient-level mean Jaccard similarity was used as the benchmark for inter-expert agreement in systemic therapy abstraction.

# For analysis, discontinuation reasons were collapsed to disease progression versus other. Progression included documented progressive disease, hospice enrollment related to advancing cancer, death, or decline in performance status in the setting of progressing disease. Other included planned completion, toxicity or adverse event, patient preference, clinician choice, and transfer of care.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2. Cohort characteristics.

Clinical characteristics from the first expert derived dataset of the 100-patient breast cancer cohort and measures of chart complexity and size.

Study cohort and data extraction

We randomly selected 100 patients from an institutional database of patients with breast cancer who had undergone either tissue or blood-based FoundationOne testing (Foundation Medicine, Cambridge, MA, USA).16 This intentionally enriched the sample for complex disease trajectories while preserving heterogeneity in documentation patterns and treating clinicians across the academic flagship campus and a community-affiliate practice.

Reference standard generation and human comparators

Four medical oncologists specializing in breast cancer care (JCD, NHD, MBM, and JLC) were each assigned a random subset of patients and asked to abstract the variables listed in Table 1. They were instructed to use all available data in the medical record, excluding external notes accessible only through Epic’s external note viewer Care Everywhere (Epic Systems, Verona, WI, USA).17 Experts were provided the clinical definitions used in the study (Supplementary Methods), but were not told how to abstract the variables.

For most clinical variables, a single expert abstraction served as the reference standard. Because systemic therapy abstraction was expected to be the most difficult task, and the hardest to directly measure, a second oncologist independently abstracted the same systemic therapy variables for each patient. This allowed inter-expert variability, rather than a single ground truth, to serve as the benchmark for performance. Experts were asked to reconstruct lines of therapy, list their component drugs, assign start and stop dates, classify treatment intent as curative or palliative, and determine whether discontinuation was due to disease progression or another reason such as toxicity or patient preference.

As an additional point of comparison, research coordinators had previously abstracted the original date of diagnosis, the date of distant metastatic disease, and the same systemic therapy variables. Because the research coordinator abstraction was performed in 2022, all comparisons involving research coordinators were censored on June 1, 2022.

LLM abstraction pipeline overview

A detailed description of the pipeline is in the Supplement and GitHub repository (Figure 1). On January 10, 2026, we downloaded all available text-based electronic medical record documents for each patient from Stanford’s research warehouse.18 Pathology reports, demographics, clinical notes, and medication administration records were exported as separate CSV files, with one row per document and one folder per patient. Original metadata, including document type, author, and entry date, were preserved. The warehouse does not preserve images, scanned documents, or screenshots in the notes. No preprocessing, normalization, or cleaning of source documents was performed before ingestion into the pipeline.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1. Fully Automated Chart Abstraction Workflow.

Text documents from the electronic medical record were exported from the institutional research warehouse, segmented into chunks, and indexed for exact-word, BM25, and semantic retrieval. For each task, retrieved text was then passed to commercially available LLMs using schema-constrained prompts to generate structured outputs.

The pipeline takes each narrative document and segments it into chunks approximating paragraph-length excerpts. These then can be retrieved and passed to the LLM with a task-specific prompt. Chunks are embedded to enable semantic retrieval.19 For each variable in Table 1, retrieval was restricted to the document types most relevant to manual abstraction of that variable, and custom prompts were used for each variable. For example, for the dates of diagnosis, pathology reports were used first, followed by clinical notes. Retrieval used a combination of search strategies to find candidate text chunks including exact word search, BM25, and semantic embedding queries depending on the variable (Supplementary Table S1).20,21

Retrieved text chunks were then provided to HIPAA-compliant, commercially available LLMs. We evaluated GPT-5, GPT-4o, DeepSeek-R1, and Gemini 2.5 Pro through Stanford’s SecureGPT platform.22 Prompts and the evidence sent to the LLMs were identical for each task. Tasks were stateless (i.e. no chat history), but for multi-pass tasks, intermediate outputs were carried forward. After pilot development, prompts, retrieval parameters, and task-specific workflows were fixed before final model evaluation. For a given task, all models received identical text chunks and prompts. The LLMs had no access to expert annotations nor structured reference data.

Outcomes and statistical analysis

The primary endpoints of this descriptive study were agreement between LLM-derived and expert-derived abstraction for recurrence dates and lines of systemic therapy. We also evaluated the concordance for the additional variables listed in Table 1. For all non-drug variables, we prespecified greater than 90% agreement with expert-derived data for at least two LLMs as an arbitrary threshold for encouraging performance. For date-based variables, agreement was defined as concordance within ± 90 days, which we prespecified as a clinically meaningful threshold and examined ± 30 days for select variables. Categorical variables required exact concordance. For mutation testing, the primary analysis focused on mutation presence, with all other outcomes treated as negative or not found.

For prescribed systemic therapy, we evaluated both exact lines of therapy and the anti-cancer drugs contained within those lines. We prespecified inter-oncologist variability as the benchmark for systemic therapy performance. We considered overlap of the confidence intervals for patient-level mean Jaccard similarity between at least two LLMs and the second oncologist as encouraging performance.23 After therapy-name normalization, predicted and reference lines were matched using a greedy one-to-one algorithm based on temporal proximity. We also calculated precision, recall, and F1 score for both line-level and drug-level extraction. Additional analyses included agreement on the number of therapy lines and, among matched lines, agreement on start and stop dates within ± 30 days, treatment intent, and reason for discontinuation.

After abstraction, we created two separate analytic datasets: one fully LLM-derived and one expert-derived. We then examined overall survival from initial breast cancer diagnosis and invasive recurrence-free survival. Censoring was done at the last note in the chart. Differences in survival were measured using log-rank tests and visualized with Kaplan–Meier curves. Because death registry linkage was not used, and less than half the cohort had a documented death date in the ground truth data, we conservatively estimated survival by assuming that patients without a recorded death date were dead at their last clinical follow-up. The standard Kaplan-Meier censoring approach is also presented.24 To compare effect estimates in human-derived and LLM-derived datasets, we evaluated univariate hazard ratio estimates for stage (4 vs 1 – 3) and hormone receptor status (negative vs positive) on overall survival and recurrence-free survival for the early stage patients. We compared hazard ratio estimates and 95% confidence intervals from expert-derived and LLM-derived datasets and used Cochran’s Q test as a formal test for heterogeneity between effect estimates.

Confidence intervals (CIs) for all agreement and performance metrics were estimated using bias-corrected and accelerated bootstrap resampling at the patient level with 1,000 iterations.25 All analyses were conducted in R version 4.5.2 (R Foundation for Statistical Computing, Vienna, Austria) using base functions and the boot and survival packages. The analytic code is available on GitHub.

External Validation in Young Breast Cancer Patients

To examine performance in a clinically distinct population, we evaluated a previously manually abstracted cohort of young patients diagnosed with breast cancer before becoming pregnant.26 We compared recurrence detection and receipt of adjuvant endocrine therapy for patients who had an initial invasive diagnosis and hormone receptor–positive disease. We compared expert-abstraction to the two best performing LLMs. The pipeline, retrieval strategy, and prompts were applied to this cohort without modification.

Results

For simplicity, the main text focuses on the two LLMs that performed best overall; results for all four models are in Supplementary Figure S1 and Supplementary Table S2.

The 100-patient cohort was complex with around 1,100 individual drugs prescribed in a median of 7 lines of therapy (interquartile range [IQR], 4.75 – 9) over 6.5 years of follow-up from diagnosis (IQR, 3.7 – 9.9 years). The median chart size was approximately 2.3 million tokens or roughly 3,100 pages of text. The average time to process an individual patient ranged from 1 – 11 minutes (Supplementary Table S7).

Agreement for non-systemic therapy variables

Across non-systemic variables, the pipeline showed high agreement with expert oncologist abstraction (Figure 2; Supplementary Figure S1; Supplementary Table S2). The two best-LLMs achieved at least 90% agreement across all 13 variables. Agreement was highest for variables anchored to pathology reports or demographic data. Variables with heterogeneous documentation, those requiring longitudinal reasoning such as stage assignment, and variables often available only in scanned PDFs or screenshots not retained in the research warehouse had lower agreement (Figure 2; Supplementary Table S6).

Figure 2.
  • Download figure
  • Open in new tab
Figure 2. Agreement with expert oncologist abstraction across key clinical variables.

Forest plot showing agreement with expert oncologist abstraction and 95% bootstrap confidence intervals for the best and second-best performing LLMs across key clinical variables. Research coordinators are shown for variables they previously abstracted. Date-based variables were evaluated using concordance within ± 90 days. Categorical variables were evaluated using exact concordance.

* Treatment intent and reason for discontinuation were evaluated only among matched therapy lines; evaluable denominators were n = 651 for the second oncologist, n = 645 for Gemini, n = 620 for GPT-5, and n = 551 for the research coordinator.

For the original diagnosis date, the two best-performing LLMs agreed with the expert on more than 97% of patients. No original diagnosis dates differed by more than 1 year (Supplementary Figure S2). For metastatic recurrence, agreement within ± 90 days was 90% and 96% for the two best-performing LLMs; 90% and 82% of recurrence dates were within 30 days. Locoregional recurrence had high date concordance (Supplementary Figure S2), but with more false negatives than metastatic recurrence. The two best-performing LLMs missed 3 and 5 events, respectively. In the 6 total unique cases, the LLMs had classified the recurrence as distant metastatic rather than locoregional. Full model-by-model results are shown in Supplementary Table S2, with an expanded forest plot in Supplementary Figure S1, confusion matrices for stage and hormone receptor and HER2 status in Supplementary Figure S3, and a qualitative error analysis in Supplementary Table S6.

Systemic therapy abstraction relative to inter-expert variability

Compared with the first expert oncologist, the second oncologist achieved a mean patient-level Jaccard similarity of 0.95 (95% CI, 0.92 to 0.96) for all prescribed anti-cancer drugs and 0.86 (95% CI, 0.81 to 0.89) for reconstructing the exact lines of therapy (Figure 4; Supplementary Tables S2 and S3). Among matched lines (n = 651), oncologist agreement was 93% for treatment intent, 92% for reason for discontinuation, 98% for start date within ± 30 days, and 90% for stop date within ± 30 days.

Figure 3.
  • Download figure
  • Open in new tab
Figure 3. Survival estimates derived from expert versus LLM chart abstraction.

Kaplan–Meier curves comparing survival estimates generated from fully expert-derived data and fully LLM-derived data for the two best-performing LLMs. Solid curves show a conservative lower bound analysis where censoring is assumed to represent death/recurrence. The dashed curves show the standard Kaplan–Meier approach with censoring at last follow-up for comparison. (A) Overall survival. (B) Invasive recurrence-free survival among patients with stage 1 – 3 diseases. Differences between expert and LLM curves were separately assessed using log-rank tests.

Figure 4.
  • Download figure
  • Open in new tab
Figure 4. Patient-level overlap in systemic therapy abstraction.

Violin plots showing patient-level Jaccard similarity for abstraction of all prescribed anti-cancer drugs (Panel A) and exact therapy lines (Panel B). Diamonds indicate medians and bars are the interquartile ranges. Results are shown for the second oncologist, the two best-performing LLMs, and research coordinators relative to the expert oncologist reference.

The LLMs had a similar mean patient-level Jaccard similarity of 0.91 (95% CI, 0.89 to 0.93) and 0.90 (95% CI, 0.87 to 0.92) for prescribed anti-cancer drugs with CIs that overlapped those of the second oncologist. For lines of therapy, they were notably lower at 0.77 (95% CI, 0.72 to 0.81) and 0.73 (95% CI, 0.69 to 0.78). When the two best-performing LLMs were compared with each other, they disagreed to a similar extent as the two expert oncologists did with Jaccard similarities of 0.96 (95% CI, 0.94 to 0.98) for drugs and 0.84 (95% CI, 0.79 to 0.87) for lines of therapy. Violin plots for all four LLMs are shown in Supplementary Figure S6.

Research coordinators performed worse than all LLMs on systemic therapy abstraction, with a patient-level Jaccard similarity of 0.70 (95% CI, 0.67 to 0.74) for prescribed anti-cancer drugs and 0.53 (95% CI, 0.49 to 0.58) for the lines of therapy. Even after we post hoc collapsed all aromatase inhibitors and all taxanes into single entities, research coordinators still had worse performance than the worst LLM for line reconstruction (0.65 vs 0.71). Abstraction performance was stable for most LLMs across large and long charts, but performance declined for research coordinators as the chart got bigger (Supplementary Figure S5). A representative treatment timeline for a complex patient is shown in Supplementary Figure S4.

Preservation of survival and effect estimates

Cohort level inference for survival was preserved when fully LLM-derived datasets were used in place of fully expert-derived datasets (Figure 3). Because fewer than half of patients had a documented date of death, overall survival estimates were sensitive to censoring assumptions, with the median in the expert-derived dataset differing by approximately 3 years depending on the censoring rule; Figure 3’s solid lines uses a conservative approach in which patients without a documented death date are treated as having died at their last clinical follow-up. The dashed lines show traditional censoring at last follow-up. The censoring rule did not change the similarity of survival estimates for LLM and expert. Median overall survival was 78.2 months for the expert and 78.2 months for the two LLMs (log-rank P = 0.99 and 1.00). Median recurrence-free survival was 34.9 months for the expert and 34.9 months for the two LLMs (P = 0.91 and 0.97).

When comparing the hazard ratio and confidence intervals for the risk of death for disease stage they were similar for expert-derived and LLM derived datasets: 2.80 (95% CI, 1.57 to 5.00) in the expert-derived dataset, compared with 2.85 (95% CI, 1.38 to 5.86) and 2.58 (95% CI, 1.25 to 5.32) for the best two LLMs. For hormone receptor-negative versus hormone receptor-positive disease, the hazard ratio was 1.86 (95% CI, 1.18 to 2.93) in the expert-derived dataset, compared with estimates that were numerically similar but did cross 1.00 for the LLMs with ratios of 1.66 (95% CI, 0.91 to 3.00) and 1.60 (95% CI, 0.87 to 2.91) in the corresponding LLM-derived datasets. Cochran Q heterogeneity testing did not identify significant differences in these effect estimates (Supplementary Table S5).

External Validation in Young Breast Cancer Patients

In an external cohort of 97 young patients with hormone receptor–positive invasive breast cancer diagnosed before pregnancy, expert manual abstraction identified 11 recurrences, compared with 13 and 13 identified by the two leading LLMs. One discordant case was an expert error; the other was an LLM error. The LLM error was on the smallest chart in the cohort—there were no Stanfrd notes within 7 years of the documented recurrence event and it could only be identified in outside hospital notes that were not in the research warehouse for data ingestion. Receipt of adjuvant endocrine therapy had an F1 score of 0.99 and 0.96 for the two LLMs.

Discussion

In this retrospective study of 100 patients with complex breast cancer histories, a pipeline using off-the-shelf general-purpose LLMs abstracted clinical variables from the medical record with high concordance with expert oncologists. Concordance was highest for variables anchored to pathology, such as estrogen receptor status, and lowest for variables requiring clinical reasoning, such as why a treatment was stopped. For abstracting the anti-cancer treatments, the best-performing model approached inter-oncologist variability. However, there was a clear difference in how the oncologists and LLMs reconstructed therapy lines. Despite differences at the individual patient level, cohort level survival and hazard ratio estimates were similar when fully LLM-derived datasets were compared to expert-derived datasets.

Manual chart abstraction is a major bottleneck in clinical research. Because we write notes in unstructured text, answering many of the clinically meaningful questions is impractical because creating the analytic dataset is too labor-intensive.5 We have high uncertainty across many clinical domains even when, in principle, the information needed to answer these questions exists in the medical record. This is especially true for rare and complex diseases, settings with heterogeneous practice patterns, areas in which unreplicated retrospective studies dominate clinical decision-making, and domains with less mature disease registries than cancer. A practical implication of our approach is that it would make conducting multi-institutional cohort studies much easier: institutions can share a validated common abstraction pipeline while keeping patient-level data local.13 Doing this would allow consortiums to build large harmonized cohorts to address a broad range of clinically important questions.

Our results highlight challenges that extend beyond LLMs to any research in which chart review is a central component. Sparse documentation remains a major barrier, particularly in fragmented health systems such as the United States. In health systems with more centralized longitudinal records, we would expect the LLM-pipeline approach to be powerful. As an example, the lack of high-quality death records in our analysis shows how a lack of data integration, and the downstream analytic assumptions, shift results by clinically meaningful amounts. Variables, syndromes, and diseases with ambiguous clinical definitions remain difficult to abstract and to study. Poorly designed abstraction frameworks, whether given to an LLM or a human-abstractor, will generate poor datasets.27

As LLM-based abstraction becomes more common, whether through abstractors copying and pasting into chatbots or through validated pipelines, it will be critical to establish norms for study design, transparent definitions, statistical rigor, and benchmarking against domain experts.28 Most retrospective studies are fundamentally correlative and, on their own, do not support causal inference.29 Scaling abstraction will not get us closer to the truth without corresponding scientific rigor.

This study has limitations: it was conducted in a single healthcare system and focused on a single disease. Effective abstraction was contingent on the relevant data being present in the record. Errors could be because of a lack of data, badly designed retrievers, badly designed prompts, or LLM reasoning errors. There remains a clear gap between experts and LLMs on the most complex abstraction tasks, although we expect this gap to narrow as the LLMs improve over time (Supplementary Figure S1).

Conclusion

Off-the-shelf general-purpose LLMs in a fixed retrieval pipeline can abstract clinically meaningful variables from complex longitudinal oncology records. Performance was strongest for variables anchored to pathology and lower for ones that required clinical interpretation. Although a gap remained at the individual-patient level, survival and hazard ratio estimates at the cohort level were preserved. These findings suggest that LLM-based abstraction could enable the creation of large research-grade retrospective datasets that were previously impractical to assemble, thereby supporting a broad range of clinical questions. Realizing this potential will require carefully designed abstraction tasks, validated outputs, and cautious interpretation of results.

Data Availability

The analytic code and core abstraction pipeline will be made publicly available on GitHub upon publication.

Prior Presentations

This work was presented as a poster at American Society of Clinical Oncology Annual Meeting 2025 and the San Antonio Breast Cancer Symposium 2025

Funding

This work was supported by the Stanford Center for Digital Health

Competing Interests / Conflicts of Interest

JCD: Stock Ownership: Johnson & Johnson, Merck. JLC: research funding to her institution from Effector Therapeutics and Novartis.

Data and Code Availability

The analytic code and core abstraction pipeline will be made publicly available upon publication. Please contact jcdicker{at}stanford.edu for collaborations and early access to the pipeline.

Acknowledgements

We would like to thank our research coordinators Jessica Orford, Sonia Rios-Ventura, and Rozelle Laquindanum.

References

  1. 1.↵
    A’Mar T, Beatty JD, Fedorenko C, et al. Incorporating Breast Cancer Recurrence Events Into Population-Based Cancer Registries Using Medical Claims: Cohort Study. JMIR Cancer. Aug 17 2020;6(2):e18143. doi:10.2196/18143
    OpenUrlCrossRef
  2. 2.↵
    Cao A, Johnson KL, Fumagalli IA, et al. Integrating a Shareable Artificial Intelligence Model Into Clinical Research for Cancer Recurrence in Patients With Breast and Colorectal Cancer. JCO Clin Cancer Inform. Nov 2025;9:e2500143. doi:10.1200/cci-25-00143
    OpenUrlCrossRef
  3. 3.↵
    Jourquin J, Dickerson JC, Marks EG, et al. Susan G. Komen’s ShareForCures: a patient-engaged, nationwide breast cancer research registry. JNCI: Journal of the National Cancer Institute. 2025;doi:10.1093/jnci/djaf353
    OpenUrlCrossRef
  4. 4.↵
    Riaz F, Vaughn JL, Zhu H, et al. Inpatient Immunotherapy Outcomes Study: A Multicenter Retrospective Analysis. JCO Oncology Practice. 2025;21(8):1165–1173. doi:10.1200/op-24-00788
    OpenUrlCrossRefPubMed
  5. 5.↵
    Warren JL, Yabroff KR. Challenges and opportunities in measuring cancer recurrence in the United States. J Natl Cancer Inst. Aug 2015;107(8)doi:10.1093/jnci/djv134
    OpenUrlCrossRefPubMed
  6. 6.↵
    NIH. Breast Cancer Recent Trends in SEER Age-Adjusted Incidence Rates, 2000-2022. Accessed 7/29/25, 2025. https://seer.cancer.gov/statistics-network/explorer/application.html?site=55&data_type=1&graph_type=2&compareBy=sex&chk_sex_3=3&chk_sex_2=2&rate_type=2&race=1&age_range=1&stage=101&advopt_precision=1&advopt_show_ci=on&hdn_view=0&advopt_show_apc=on&advopt_display=2#resultsRegion0
  7. 7.↵
    Izci H, Tambuyzer T, Tuand K, et al. A Systematic Review of Estimating Breast Cancer Recurrence at the Population Level With Administrative Data. JNCI: Journal of the National Cancer Institute. 2020;112(10):979–988. doi:10.1093/jnci/djaa050
    OpenUrlCrossRefPubMed
  8. 8.↵
    Mariotto AB, Etzioni R, Hurlbert M, Penberthy L, Mayer M. Estimation of the Number of Women Living with Metastatic Breast Cancer in the United States. Cancer Epidemiol Biomarkers Prev. Jun 2017;26(6):809–815. doi:10.1158/1055-9965.Epi-16-0889
    OpenUrlAbstract/FREE Full Text
  9. 9.↵
    Fehrenbacher L, Capra AM, Quesenberry CP, Jr., Fulton R, Shiraz P, Habel LA. Distant invasive breast cancer recurrence risk in human epidermal growth factor receptor 2-positive T1a and T1b node-negative localized breast cancer diagnosed from 2000 to 2006: a cohort from an integrated health care delivery system. J Clin Oncol. Jul 10 2014;32(20):2151–8. doi:10.1200/jco.2013.52.0858
    OpenUrlAbstract/FREE Full Text
  10. 10.↵
    Sun L, Bleiberg B, Hwang WT, et al. Association Between Duration of Immunotherapy and Overall Survival in Advanced Non-Small Cell Lung Cancer. JAMA Oncol. Aug 1 2023;9(8):1075–1082. doi:10.1001/jamaoncol.2023.1891
    OpenUrlCrossRefPubMed
  11. 11.↵
    Gupta S, Basu A, Nievas M, et al. PRISM: Patient Records Interpretation for Semantic clinical trial Matching system using large language models. npj Digital Medicine. 2024/10/28 2024;7(1):305. doi:10.1038/s41746-024-01274-7
    OpenUrlCrossRefPubMed
  12. 12.
    Zhu M, Lin H, Jiang J, et al. Large language model trained on clinical oncology data predicts cancer progression. npj Digital Medicine. 2025/07/02 2025;8(1):397. doi:10.1038/s41746-025-01780-2
    OpenUrlCrossRefPubMed
  13. 13.↵
    Jee J, Fong C, Pichotta K, et al. Automated real-world data integration improves cancer outcome prediction. Nature. 2024/12/01 2024;636(8043):728-736. doi:10.1038/s41586-024-08167-5
    OpenUrlCrossRefPubMed
  14. 14.↵
    Luo I, Graber-Naidich A, Zhang M, et al. Leveraging large language models to extract smoking history from clinical notes for lung cancer surveillance. npj Digital Medicine. 2025/11/28 2025;8(1):731. doi:10.1038/s41746-025-02009-y
    OpenUrlCrossRefPubMed
  15. 15.↵
    Gliadkovskaya A. OpenAI rolls out ChatGPT for Healthcare, a gen AI workspace for hospitals and clinics. 2026. https://www.fiercehealthcare.com/health-tech/openai-rolls-out-chatgpt-healthcare-genai-workspace-enterprises
  16. 16.↵
    Milbury CA, Creeden J, Yip WK, et al. Clinical and analytical validation of FoundationOne®CDx, a comprehensive genomic profiling assay for solid tumors. PLoS One. 2022;17(3):e0264138. doi:10.1371/journal.pone.0264138
    OpenUrlCrossRefPubMed
  17. 17.↵
    Epic Systems C. Care Everywhere. Epic Systems Corporation. Updated 2026. https://www.epic.com/careeverywhere/?search=&country=&usstate=
  18. 18.↵
    Repository SRD. Data types in STARR. https://starr.stanford.edu/data-types
  19. 19.↵
    Sounack T, Davis J, Durieux B, et al. BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP. Preprint. arXiv. 2025/06/12 2025;
  20. 20.↵
    Robertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval. 2009;3(4):333–389. doi:10.1561/1500000019
    OpenUrlCrossRef
  21. 21.↵
    Malkov YA, Yashunin DA. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv. 2018;doi:10.48550/arXiv.1603.09320
    OpenUrlCrossRef
  22. 22.↵
    Ng MY, Helzer J, Pfeffer MA, Seto T, Hernandez-Boussard T. Development of Secure Infrastructure for Advancing Generative AI Research in Healthcare at an Academic Medical Center. Res Sq. Sep 24 2024;doi:10.21203/rs.3.rs-5095287/v1
    OpenUrlCrossRef
  23. 23.↵
    Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles. 1901;7:547–579.
    OpenUrl
  24. 24.↵
    Kaplan EL, Meier P. Nonparametric Estimation from Incomplete Observations. Journal of the American Statistical Association. 1958;53(282):457–481. doi:10.2307/2281868
    OpenUrlCrossRefWeb of Science
  25. 25.↵
    Efron B. Better Bootstrap Confidence Intervals. Journal of the American Statistical Association. 1987;82(397):171–185. doi:10.2307/2289144
    OpenUrlCrossRefWeb of Science
  26. 26.↵
    Ransohoff JD, Lewinsohn RM, Dickerson J, et al. Endocrine Therapy Interruption, Resumption, and Outcomes Associated With Pregnancy After Breast Cancer. JAMA Oncol. Apr 1 2025;11(4):423–426. doi:10.1001/jamaoncol.2024.6868
    OpenUrlCrossRefPubMed
  27. 27.↵
    Pfaffenlehner M, Behrens M, Zöller D, et al. Methodological challenges using routine clinical care data for real-world evidence: a rapid review utilizing a systematic literature search and focus group discussion. BMC Med Res Methodol. Jan 14 2025;25(1):8. doi:10.1186/s12874-024-02440-x
    OpenUrlCrossRef
  28. 28.↵
    Castelo-Branco L, Pellat A, Martins-Branco D, et al. ESMO Guidance for Reporting Oncology real-World evidence (GROW). ESMO Real World Data Digit Oncol. Nov 2023;1:100003. doi:10.1016/j.esmorw.2023.10.001
    OpenUrlCrossRefPubMed
  29. 29.↵
    Dahabreh IJ, Bibbins-Domingo K. Causal Inference About the Effects of Interventions From Observational Studies in Medical Journals. JAMA. 2024;331(21):1845–1853. doi:10.1001/jama.2024.7741
    OpenUrlCrossRefPubMed
Back to top
PreviousNext
Posted March 25, 2026.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models
James C Dickerson, Marni B McClure, Margaret Shaw, Marissa B Reitsma, Nicole H Dalal, Allison W Kurian, Jennifer L Caswell-Jin
medRxiv 2026.03.23.26349012; doi: https://doi.org/10.64898/2026.03.23.26349012
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Fully Automated Abstraction of Longitudinal Breast Oncology Records with Off-The-Shelf Large Language Models
James C Dickerson, Marni B McClure, Margaret Shaw, Marissa B Reitsma, Nicole H Dalal, Allison W Kurian, Jennifer L Caswell-Jin
medRxiv 2026.03.23.26349012; doi: https://doi.org/10.64898/2026.03.23.26349012

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Oncology
Subject Areas
All Articles
  • Addiction Medicine (576)
  • Allergy and Immunology (868)
  • Anesthesia (306)
  • Cardiovascular Medicine (4483)
  • Dentistry and Oral Medicine (449)
  • Dermatology (385)
  • Emergency Medicine (615)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1528)
  • Epidemiology (15283)
  • Forensic Medicine (31)
  • Gastroenterology (1134)
  • Genetic and Genomic Medicine (6651)
  • Geriatric Medicine (671)
  • Health Economics (1006)
  • Health Informatics (4606)
  • Health Policy (1378)
  • Health Systems and Quality Improvement (1624)
  • Hematology (545)
  • HIV/AIDS (1276)
  • Infectious Diseases (except HIV/AIDS) (15965)
  • Intensive Care and Critical Care Medicine (1111)
  • Medical Education (626)
  • Medical Ethics (147)
  • Nephrology (675)
  • Neurology (6699)
  • Nursing (346)
  • Nutrition (1006)
  • Obstetrics and Gynecology (1153)
  • Occupational and Environmental Health (961)
  • Oncology (3370)
  • Ophthalmology (989)
  • Orthopedics (370)
  • Otolaryngology (421)
  • Pain Medicine (437)
  • Palliative Medicine (131)
  • Pathology (670)
  • Pediatrics (1704)
  • Pharmacology and Therapeutics (700)
  • Primary Care Research (717)
  • Psychiatry and Clinical Psychology (5497)
  • Public and Global Health (9288)
  • Radiology and Imaging (2225)
  • Rehabilitation Medicine and Physical Therapy (1375)
  • Respiratory Medicine (1202)
  • Rheumatology (598)
  • Sexual and Reproductive Health (721)
  • Sports Medicine (536)
  • Surgery (722)
  • Toxicology (100)
  • Transplantation (290)
  • Urology (267)