ABSTRACT
Importance Diagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning.
Objective To assess the impact of the GPT-4 LLM on physicians’ diagnostic reasoning compared to conventional resources.
Design Multi-center, randomized clinical vignette study.
Setting The study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions.
Participants Resident and attending physicians with training in family medicine, internal medicine, or emergency medicine.
Intervention(s) Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.
Main Outcome(s) and Measure(s) The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis.
Results 50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.
Conclusions and Relevance In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.
Competing Interest Statement
The authors have declared no competing interest.
Clinical Trial
NCT06157944
Funding Statement
Dr Ethan Goh, Dr Jason Hom, Dr Eric Strong, Yingjie Weng, Dr Josephine Cool, Dr Zahir Kanjee, Dr Andrew P.J Olson, Dr Adam Rodman are funded by Gordon and Betty Moore Foundation. Dr Robert Gallo is supported by a VA Advanced Fellowship in Medical Informatics. The views expressed are those of the authors and not necessarily those of the Department of Veterans Affairs or those of the United States government. Dr Arnold Milstein is funded by pooled philanthropic gifts to Stanford University and research funding from Stanford Healthcare and Stanford Children's Health. Dr Jonathan H. Chen is funded by: NIH/National Institute of Allergy and Infectious Diseases (1R01AI17812101) NIH/National Institute on Drug Abuse Clinical Trials Network (UG1DA015815 - CTN-0136) Gordon and Betty Moore Foundation (Grant #12409) Stanford Artificial Intelligence in Medicine and Imaging - Human-Centered Artificial Intelligence (AIMI-HAI) Partnership Grant Doris Duke Charitable Foundation - Covid-19 Fund to Retain Clinical Scientists (20211260) Google, Inc. Research collaboration Co-I to leverage EHR data to predict a range of clinical outcomes. American Heart Association - Strategically Focused Research Network - Diversity in Clinical Trials
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
IRB of Stanford University gave ethical approval for this work
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced in the present study are available upon reasonable request to the authors