RT Journal Article
SR Electronic
T1 From Tool to Teammate: A Randomized Controlled Trial of Clinician-AI Collaborative Workflows for Diagnosis
JF medRxiv
FD Cold Spring Harbor Laboratory Press
SP 2025.06.07.25329176
DO 10.1101/2025.06.07.25329176
A1 Everett, Selin S.
A1 Bunning, Bryan J.
A1 Jain, Priyank
A1 Lopez, Ivan
A1 Agarwal, Anup
A1 Desai, Manisha
A1 Gallo, Robert
A1 Goh, Ethan
A1 Kadiyala, Vinay B.
A1 Kanjee, Zahir
A1 Koshy, Jacob M.
A1 Olson, Andrew
A1 Rodman, Adam
A1 Schulman, Kevin
A1 Strong, Eric
A1 Chen, Jonathan H.
A1 Horvitz, Eric
YR 2025
UL http://medrxiv.org/content/early/2025/06/08/2025.06.07.25329176.abstract
AB Early studies of large language models (LLMs) in clinical settings have largely treated artificial intelligence (AI) as a tool rather than an active collaborator. As LLMs now demonstrate expert-level diagnostic performance, the focus shifts from whether AI can offer valuable suggestions to how it can be effectively integrated into physicians’ diagnostic workflows. We conducted a randomized controlled trial (n=70 clinicians) to evaluate the value of employing a custom GPT system designed to engage collaboratively with clinicians on diagnostic reasoning challenges. The collaborative design began with independent diagnostic assessments from both the clinician and the AI. These were then combined in an AI-generated synthesis that integrated the two perspectives, highlighting points of agreement and disagreement and offering commentary on each. We evaluated two workflow variants: one where the AI provided an initial opinion (AI-first), and another where it followed the clinician’s assessment (AI-second). Clinicians using either collaborative workflow outperformed those using traditional tools, achieving average accuracies of 85% (AI-first) and 82% (AI-second), compared to 75% with traditional resources (p &lt; 0.0004 and p &lt; 0.00001; mean differences = 9.8% and 6.8%; 95% CI = 4.6%–15% and 4.0%–9.6%). Performance did not differ significantly between workflows or from the AI-alone score of 90%. These results underscore the value of collaborative AI systems that complement clinician expertise and foster effective coordination between human and machine reasoning in diagnostic decision-making.Competing Interest StatementThe authors have declared no competing interest.Clinical TrialNCT06911645Funding StatementThis work was funded by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), Stanford Medical Scholars Research Program, Stanford Bio-X Interdisciplinary Initiatives Seed Grants Program, and the Gordon and Betty Moore Foundation [Grant #12409].Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Ethics committee/IRB of Stanford University gave ethical approval for this work.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesData is shared in the body and supplementary information.