Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Real-World Data for Predicting Rapid Relapse Triple Negative Cancer: A Study Using NCDB and EHR Data

View ORCID ProfilePallavi Jonnalagadda, Samilia Obeng-Gyasi, Daniel G. Stover, Barbara L. Andersen, View ORCID ProfileSaurabh Rahurkar
doi: https://doi.org/10.64898/2026.01.28.26345096
Pallavi Jonnalagadda
1Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH
MD DrPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Pallavi Jonnalagadda
Samilia Obeng-Gyasi
2Division of Surgical Oncology, Department of Surgery, The Ohio State University, Columbus, OH
MD MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel G. Stover
1Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH
3Department of Internal Medicine, The Ohio State University, Columbus, OH
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Barbara L. Andersen
4Department of Psychology, The Ohio State University, Columbus, OH
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Saurabh Rahurkar
1Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH
DDS DrPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Saurabh Rahurkar
  • For correspondence: saurabh.rahurkar{at}osumc.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background Many patients with triple-negative breast cancer (TNBC), particularly those who are older, Black, or insured by Medicaid, do not receive guideline-concordant treatment, despite its association with up to 4x higher survival. Early identification of patients at risk for rapid relapse may enable timely interventions and improve outcomes. This study applies machine learning (ML) to real-world data to predict risk of rapid relapse in TNBC.

Methods We trained various ML models (logistic regression, decision trees, random forests, XGBoost, naïve Bayes, support vector machines) using National Cancer Database (NCDB) data and fine-tuned them using electronic health record (EHR) data from a cancer registry. Class imbalance was addressed using synthetic minority oversampling technique (SMOTE). Model performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), receiver operating characteristics area under the curve ROC AUC, accuracy, and F1 scores. Transfer learning, cross-validation, and threshold optimization were applied to enhance the ensemble model’s performance on clinical data.

Results Initial models trained on NCDB data exhibited high NPV but low sensitivity and PPV. SMOTE and hyperparameter tuning produced modest improvements. External testing on EHR data from a cancer registry had similar model performance. After applying transfer learning, cross-validation, and threshold optimization using the clinical data, the ensemble model achieved higher performance. The optimized ensemble model achieved a sensitivity of 0.87, specificity of 0.99, PPV of 0.90, NPV of 0.98, ROC AUC of 0.99, accuracy of 0.98, and F1-score of 0.88. This optimized model, leveraging readily available clinical data, demonstrated superior performance compared to initial NCDB-trained models and those reported in extant literature.

Conclusions Transfer learning and threshold optimization effectively adapted ML models trained on NCDB data to an independent real-world clinical dataset from a single site, producing a high-performing model for predicting rapid relapse in TNBC. This model, potentially translatable to fast health interoperability resources (FHIR)-compatible workflows, represents a promising tool for identifying patients at high risk. Future work should include prospective external validation, evaluation of integration into clinical workflows, and implementation studies to determine whether the model improves care processes such as timely patient navigation and treatment planning.

Author Summary In this study, we set out to understand which patients with triple-negative breast cancer might experience a rapid return of their disease. Many people with this aggressive form of cancer do not receive the treatments that are known to improve survival, especially patients who are older, Black, or insured through public programs. Being able to identify those at highest risk early in their care could help health teams provide timely support and ensure that patients receive the treatments they need.

To do this, we used information from a large national cancer database to build computer-based models that learn from patterns in patient data. We then refined these models using real medical records from a cancer center to make sure they worked well in everyday clinical settings. After adjusting and improving the models, we developed a tool that can correctly identify most patients who are likely to have a rapid return of their cancer.

Our hope is that this type of tool could eventually be built into routine care and help guide timely follow-up, support services, and treatment planning. More testing in real clinical environments will be important to understand how well the tool improves care and outcomes for patients.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

Yes

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The Ohio State University Office of Responsible Research Practices deemed this study institutional review board exempt.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The data underlying the results presented in the study are available from - https://www.facs.org/quality-programs/cancer-programs/national-cancer-database/puf/ and https://u.osu.edu/secondarydatacore/osu-pcornet-common-data-model-cdm/\

https://www.facs.org/quality-programs/cancer-programs/national-cancer-database/puf/

https://u.osu.edu/secondarydatacore/osu-pcornet-common-data-model-cdm/

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted January 30, 2026.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Real-World Data for Predicting Rapid Relapse Triple Negative Cancer: A Study Using NCDB and EHR Data
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Real-World Data for Predicting Rapid Relapse Triple Negative Cancer: A Study Using NCDB and EHR Data
Pallavi Jonnalagadda, Samilia Obeng-Gyasi, Daniel G. Stover, Barbara L. Andersen, Saurabh Rahurkar
medRxiv 2026.01.28.26345096; doi: https://doi.org/10.64898/2026.01.28.26345096
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Real-World Data for Predicting Rapid Relapse Triple Negative Cancer: A Study Using NCDB and EHR Data
Pallavi Jonnalagadda, Samilia Obeng-Gyasi, Daniel G. Stover, Barbara L. Andersen, Saurabh Rahurkar
medRxiv 2026.01.28.26345096; doi: https://doi.org/10.64898/2026.01.28.26345096

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Oncology
Subject Areas
All Articles
  • Addiction Medicine (576)
  • Allergy and Immunology (868)
  • Anesthesia (306)
  • Cardiovascular Medicine (4483)
  • Dentistry and Oral Medicine (449)
  • Dermatology (385)
  • Emergency Medicine (615)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1528)
  • Epidemiology (15284)
  • Forensic Medicine (31)
  • Gastroenterology (1134)
  • Genetic and Genomic Medicine (6653)
  • Geriatric Medicine (671)
  • Health Economics (1006)
  • Health Informatics (4606)
  • Health Policy (1378)
  • Health Systems and Quality Improvement (1624)
  • Hematology (545)
  • HIV/AIDS (1276)
  • Infectious Diseases (except HIV/AIDS) (15965)
  • Intensive Care and Critical Care Medicine (1111)
  • Medical Education (626)
  • Medical Ethics (147)
  • Nephrology (675)
  • Neurology (6699)
  • Nursing (346)
  • Nutrition (1006)
  • Obstetrics and Gynecology (1153)
  • Occupational and Environmental Health (961)
  • Oncology (3370)
  • Ophthalmology (989)
  • Orthopedics (370)
  • Otolaryngology (421)
  • Pain Medicine (437)
  • Palliative Medicine (131)
  • Pathology (670)
  • Pediatrics (1705)
  • Pharmacology and Therapeutics (700)
  • Primary Care Research (717)
  • Psychiatry and Clinical Psychology (5497)
  • Public and Global Health (9288)
  • Radiology and Imaging (2226)
  • Rehabilitation Medicine and Physical Therapy (1375)
  • Respiratory Medicine (1202)
  • Rheumatology (598)
  • Sexual and Reproductive Health (721)
  • Sports Medicine (536)
  • Surgery (723)
  • Toxicology (100)
  • Transplantation (290)
  • Urology (267)