Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Performance of ChatGPT on free-response, clinical reasoning exams

Eric Strong, Alicia DiGiammarino, Yingjie Weng, Preetha Basaviah, Poonam Hosamani, Andre Kumar, Andrew Nevins, John Kugler, Jason Hom, Jonathan H Chen
doi: https://doi.org/10.1101/2023.03.24.23287731
Eric Strong
1Division of Hospital Medicine, Stanford University School of Medicine, Stanford, CA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: estrong{at}stanford.edu
Alicia DiGiammarino
2Office of Medical Education, Stanford University School of Medicine, Stanford, CA
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yingjie Weng
3Quantitative Sciences Unit, Stanford University, Stanford CA
MHS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Preetha Basaviah
4Primary Care and Population Health, Stanford University School of Medicine, Stanford, CA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Poonam Hosamani
1Division of Hospital Medicine, Stanford University School of Medicine, Stanford, CA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Andre Kumar
1Division of Hospital Medicine, Stanford University School of Medicine, Stanford, CA
MD MEd
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Andrew Nevins
5Division of Infectious Diseases, Stanford University School of Medicine, Stanford, CA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
John Kugler
1Division of Hospital Medicine, Stanford University School of Medicine, Stanford, CA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jason Hom
1Division of Hospital Medicine, Stanford University School of Medicine, Stanford, CA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jonathan H Chen
1Division of Hospital Medicine, Stanford University School of Medicine, Stanford, CA
6Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, CA
7Clinical Excellence Research Center, Stanford University School of Medicine, Stanford CA
MD PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Importance Studies show that ChatGPT, a general purpose large language model chatbot, could pass the multiple-choice US Medical Licensing Exams, but the model’s performance on open-ended clinical reasoning is unknown.

Objective To determine if ChatGPT is capable of consistently meeting the passing threshold on free-response, case-based clinical reasoning assessments.

Design Fourteen multi-part cases were selected from clinical reasoning exams administered to pre-clerkship medical students between 2019 and 2022. For each case, the questions were run through ChatGPT twice and responses were recorded. Two clinician educators independently graded each run according to a standardized grading rubric. To further assess the degree of variation in ChatGPT’s performance, we repeated the analysis on a single high-complexity case 20 times.

Setting A single US medical school

Participants ChatGPT

Main Outcomes and Measures Passing rate of ChatGPT’s scored responses and the range in model performance across multiple run throughs of a single case.

Results 12 out of the 28 ChatGPT exam responses achieved a passing score (43%) with a mean score of 69% (95% CI: 65% to 73%) compared to the established passing threshold of 70%. When given the same case 20 separate times, ChatGPT’s performance on that case varied with scores ranging from 56% to 81%.

Conclusions and Relevance ChatGPT’s ability to achieve a passing performance in nearly half of the cases analyzed demonstrates the need to revise clinical reasoning assessments and incorporate artificial intelligence (AI)-related topics into medical curricula and practice.

Competing Interest Statement

Dr. Hom reported receiving grant funding from the NIH/Undiagnosed Diseases Network (5U01HG010218-04). Dr. Hom reported receiving consulting fees from MORE Health, Inc. Dr. Chen reported receiving grants from the NIH/National Institute on Drug Abuse Clinical Trials Network (UG1DA015815-CTN-0136), Stanford Artificial Intelligence in Medicine and Imaging- Human-Centered Artificial Intelligence Partnership Grant, Doris Duke Charitable Foundation - Covid-19 Fund to Retain Clinical Scientists (20211260), Google Inc (in a research collaboration to leverage health data to predict clinical outcomes), and the American Heart Association - Strategically Focused Research Network - Diversity in Clinical Trials. Dr. Chen reported receiving consulting fees from Sutton Pierce and Younker Hyde MacFarlane PLLC and being a co-founder of Reaction Explorer LLC, a company that develops and licenses organic chemistry education software using rule-based artificial intelligence technology.

Funding Statement

This study did not receive any funding

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • ↵* Co-First Authors

  • ↵+ Co-Last Authors

Data Availability

All data produced in the present study are available upon reasonable request to the authors

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted March 29, 2023.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Performance of ChatGPT on free-response, clinical reasoning exams
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Performance of ChatGPT on free-response, clinical reasoning exams
Eric Strong, Alicia DiGiammarino, Yingjie Weng, Preetha Basaviah, Poonam Hosamani, Andre Kumar, Andrew Nevins, John Kugler, Jason Hom, Jonathan H Chen
medRxiv 2023.03.24.23287731; doi: https://doi.org/10.1101/2023.03.24.23287731
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Performance of ChatGPT on free-response, clinical reasoning exams
Eric Strong, Alicia DiGiammarino, Yingjie Weng, Preetha Basaviah, Poonam Hosamani, Andre Kumar, Andrew Nevins, John Kugler, Jason Hom, Jonathan H Chen
medRxiv 2023.03.24.23287731; doi: https://doi.org/10.1101/2023.03.24.23287731

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Medical Education
Subject Areas
All Articles
  • Addiction Medicine (432)
  • Allergy and Immunology (758)
  • Anesthesia (221)
  • Cardiovascular Medicine (3306)
  • Dentistry and Oral Medicine (365)
  • Dermatology (282)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1174)
  • Epidemiology (13391)
  • Forensic Medicine (19)
  • Gastroenterology (899)
  • Genetic and Genomic Medicine (5165)
  • Geriatric Medicine (482)
  • Health Economics (783)
  • Health Informatics (3280)
  • Health Policy (1144)
  • Health Systems and Quality Improvement (1196)
  • Hematology (432)
  • HIV/AIDS (1021)
  • Infectious Diseases (except HIV/AIDS) (14647)
  • Intensive Care and Critical Care Medicine (914)
  • Medical Education (478)
  • Medical Ethics (128)
  • Nephrology (525)
  • Neurology (4939)
  • Nursing (262)
  • Nutrition (733)
  • Obstetrics and Gynecology (886)
  • Occupational and Environmental Health (796)
  • Oncology (2526)
  • Ophthalmology (729)
  • Orthopedics (283)
  • Otolaryngology (347)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (545)
  • Pediatrics (1303)
  • Pharmacology and Therapeutics (551)
  • Primary Care Research (557)
  • Psychiatry and Clinical Psychology (4221)
  • Public and Global Health (7520)
  • Radiology and Imaging (1709)
  • Rehabilitation Medicine and Physical Therapy (1016)
  • Respiratory Medicine (981)
  • Rheumatology (480)
  • Sexual and Reproductive Health (499)
  • Sports Medicine (425)
  • Surgery (550)
  • Toxicology (72)
  • Transplantation (236)
  • Urology (206)