Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM): A New Realm of AI Safety & Validation

Meenesh Bhimani, Alex Miller, Jonathan D. Agnew, Markel Sanz Ausin, Mariska Raglow-Defranco, Harpreet Mangat, Michelle Voisard, Maggie Taylor, Sebastian Bierman-Lytle, Vishal Parikh, Juliana Ghukasyan, Rae Lasko, Saad Godil, Ashish Atreja, Subhabrata Mukherjee
doi: https://doi.org/10.1101/2025.03.17.25324157
Meenesh Bhimani
1Hippocratic AI, Palo Alto, California, USA
MD, MHA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: research{at}hippocraticai.com
Alex Miller
1Hippocratic AI, Palo Alto, California, USA
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jonathan D. Agnew
2School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada
PhD, MBA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Markel Sanz Ausin
1Hippocratic AI, Palo Alto, California, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mariska Raglow-Defranco
1Hippocratic AI, Palo Alto, California, USA
BA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Harpreet Mangat
1Hippocratic AI, Palo Alto, California, USA
MD, MBA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Michelle Voisard
1Hippocratic AI, Palo Alto, California, USA
BSN, RN, CCM
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Maggie Taylor
1Hippocratic AI, Palo Alto, California, USA
RN, BSN, CCM
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Sebastian Bierman-Lytle
1Hippocratic AI, Palo Alto, California, USA
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Vishal Parikh
1Hippocratic AI, Palo Alto, California, USA
BS, BA, MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Juliana Ghukasyan
1Hippocratic AI, Palo Alto, California, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rae Lasko
1Hippocratic AI, Palo Alto, California, USA
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Saad Godil
1Hippocratic AI, Palo Alto, California, USA
MEng
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ashish Atreja
3UC Davis Health, Davis, California, USA
4VALID.AI, Davis, California, USA
MD, MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Subhabrata Mukherjee
1Hippocratic AI, Palo Alto, California, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background The deployment of artificial intelligence (AI) in healthcare necessitates robust safety validation frameworks, particularly for systems directly interacting with patients. While theoretical frameworks exist, there remains a critical gap between abstract principles and practical implementation. Traditional LLM benchmarking approaches provide very limited output coverage and are insufficient for healthcare applications requiring high safety standards.

Objective To develop and evaluate a comprehensive framework for healthcare AI safety validation through large-scale clinician engagement.

Methods We implemented the RWE-LLM (Real-World Evaluation of Large Language Models in Healthcare) framework, drawing inspiration from red teaming methodologies while expanding their scope to achieve comprehensive safety validation. Our approach emphasizes output testing rather than relying solely on input data quality across four stages: pre-implementation, tiered review, resolution, and continuous monitoring. We engaged 6,234 US licensed clinicians (5,969 nurses and 265 physicians) with an average of 11.5 years of clinical experience. The framework employed a three-tier review process for error detection and resolution, evaluating a non-diagnostic AI Care Agent focused on patient education, follow-ups, and administrative support across four iterations (pre-Polaris and Polaris 1.0, 2.0, and 3.0).

Results Over 307,000 unique calls were evaluated using the RWE-LLM framework. Each interaction was subject to potential error flagging across multiple severity categories, from minor clinical inaccuracies to significant safety concerns. The multi-tiered review system successfully processed all flagged interactions, with internal nursing reviews providing initial expert evaluation followed by physician adjudication when necessary. The framework demonstrated effective throughput in addressing identified safety concerns while maintaining consistent processing times and documentation standards. Systematic improvements in safety protocols were achieved through a continuous feedback loop between error identification and system enhancement. Performance metrics demonstrated substantial safety improvements between iterations, with correct medical advice rates improving from ∼80.0% (pre-Polaris), to 96.79% (Polaris 1.0), to 98.75% (Polaris 2.0) and 99.38% (Polaris 3.0). Incorrect advice resulting in potential minor harm decreased from 1.32% to 0.13% and 0.07%, and severe harm concerns were eliminated (0.06% to 0.10% and 0.00%).

Conclusions The successful nationwide implementation of the RWE-LLM framework establishes a practical model for ensuring AI safety in healthcare settings. Our methodology demonstrates that comprehensive output testing provides significantly stronger safety assurance than traditional input validation approaches used by horizontal LLMs. While resource-intensive, this approach proves that rigorous safety validation for healthcare AI systems is both necessary and achievable, setting a benchmark for future deployments.

Competing Interest Statement

MB, AM, MSA, MRD, HM, MV, MT, SBL, VP, JG, RL, SG, and SM are employees of Hippocratic AI, which provided funding for this study. JDA is an Adjunct Professor at the University of British Columbia and received compensation for work performed on this project. AA is an employee of UC Davis Health and received compensation for work performed on this project. All authors have reviewed and approved the manuscript and materials included in this submission.

Funding Statement

This research was supported by Hippocratic AI, Inc.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability Statement

Data supporting the results can be accessed by contacting the corresponding author.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted March 18, 2025.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM): A New Realm of AI Safety & Validation
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM): A New Realm of AI Safety & Validation
Meenesh Bhimani, Alex Miller, Jonathan D. Agnew, Markel Sanz Ausin, Mariska Raglow-Defranco, Harpreet Mangat, Michelle Voisard, Maggie Taylor, Sebastian Bierman-Lytle, Vishal Parikh, Juliana Ghukasyan, Rae Lasko, Saad Godil, Ashish Atreja, Subhabrata Mukherjee
medRxiv 2025.03.17.25324157; doi: https://doi.org/10.1101/2025.03.17.25324157
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM): A New Realm of AI Safety & Validation
Meenesh Bhimani, Alex Miller, Jonathan D. Agnew, Markel Sanz Ausin, Mariska Raglow-Defranco, Harpreet Mangat, Michelle Voisard, Maggie Taylor, Sebastian Bierman-Lytle, Vishal Parikh, Juliana Ghukasyan, Rae Lasko, Saad Godil, Ashish Atreja, Subhabrata Mukherjee
medRxiv 2025.03.17.25324157; doi: https://doi.org/10.1101/2025.03.17.25324157

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (430)
  • Allergy and Immunology (756)
  • Anesthesia (221)
  • Cardiovascular Medicine (3292)
  • Dentistry and Oral Medicine (364)
  • Dermatology (279)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1171)
  • Epidemiology (13374)
  • Forensic Medicine (19)
  • Gastroenterology (899)
  • Genetic and Genomic Medicine (5153)
  • Geriatric Medicine (482)
  • Health Economics (783)
  • Health Informatics (3268)
  • Health Policy (1140)
  • Health Systems and Quality Improvement (1190)
  • Hematology (431)
  • HIV/AIDS (1017)
  • Infectious Diseases (except HIV/AIDS) (14627)
  • Intensive Care and Critical Care Medicine (913)
  • Medical Education (477)
  • Medical Ethics (127)
  • Nephrology (523)
  • Neurology (4925)
  • Nursing (262)
  • Nutrition (730)
  • Obstetrics and Gynecology (883)
  • Occupational and Environmental Health (795)
  • Oncology (2524)
  • Ophthalmology (724)
  • Orthopedics (281)
  • Otolaryngology (347)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (543)
  • Pediatrics (1302)
  • Pharmacology and Therapeutics (550)
  • Primary Care Research (557)
  • Psychiatry and Clinical Psychology (4212)
  • Public and Global Health (7504)
  • Radiology and Imaging (1705)
  • Rehabilitation Medicine and Physical Therapy (1013)
  • Respiratory Medicine (980)
  • Rheumatology (480)
  • Sexual and Reproductive Health (497)
  • Sports Medicine (424)
  • Surgery (548)
  • Toxicology (72)
  • Transplantation (236)
  • Urology (205)