An interdisciplinary, randomized, single-blind evaluation of state-of-the-art large language models for their implications and risks in medical diagnosis and management

Peikai Chen; Jifu Cai; Jiaying Zhou; Shaoxi Chen; Chenguang Xu; Lihua Yuan; Xiaoying Dai; Xiaowei Chen; Yanzhe Wei; Xia Li; Shaofeng Gong; Xiaolong Liang; Jiancheng Yang; Jun Jin; Kanglin Dai; Yuzhen Cui; Guan-Ming Kuang; Jianshen Xie; Libing Luo; Haibing Xiao; Shijie Yin; Jun Yang; Yulan Yan; Jianliang Chen; Yihua Chen; Qianshen Zhang; Qingshan Zhou; Lina Zhao; Min Wu; Xin Tang; Lei Rong; Zanxin Wang; Weifu Qiu; Yanli Wang; Liwen Cui; Xiangyang Li; Yong Hu; Huiren Tao; Nan Wu; Pearl Pai; Minxin Wei; Michael Kai-tsun To; Kenneth M.C. Cheung

doi:10.1101/2025.06.20.25326623

Abstract

Background State-of-the-art (SOTA) large language models (LLMs) are poised to revolutionize clinical medicine by transforming diagnostic, therapeutic, and interdisciplinary reasoning. Despite their promising capabilities, rigorous benchmarking of these models is essential to address concerns about their clinical proficiency and safety, particularly in high-risk environments.

Methods This study implemented a multi-disciplinary, randomized, single-blind evaluation framework involving 27 experienced specialty clinicians with an average of 25.9 years of practice. The assessment covered 685 simulated and real clinical cases across 13 subspecialties, including both common and rare conditions. Evaluators rated LLM responses on medical strength (0–10 scale, where >9.5 signified leading expert proficiency) and hallucination severity (0–5 scale for fabricated or misleading medical elements). Seven SOTA LLMs were tested, including top-ranked models from the ARENA leaderboard, with statistical analyses applied to adjust for confounders such as response length.

Findings The evaluation revealed clinical plausibility in general-purpose LLMs, with Gemini 2.0 Flash leading raw scores and DeepSeek R1 excelling in adjusted analyses. Top models demonstrated proficiency comparable to a physician of 6 years post qualification experience (score ∼6.0), yet significant risks were noted. Instances of incompetence (scores ≤4) were detected across specialties, and 40 hallucination instances involving fabricated conditions, medications, and classification errors. These findings underscore the importance of implementing stringent safeguards to mitigate potential adverse outcomes in clinical applications.

Interpretation While SOTA LLMs show substantial promise in enhancing clinical reasoning and decision-making, their unguarded application in medicine could present serious risks, such as misinformation and diagnostic errors. Human expert oversight remains crucial, particularly given reported incompetence and hallucination risks. Larger, multi-center studies are warranted to evaluate their real-world performance and track their evolution before broader clinical adoption.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was supported by the Sanming Project of Medicine in Shenzhen (SZSM202311022), Shenzhen Clinical Research Center for Rare Diseases (LCYSSQ20220823091402005), Shenzhen Key Medical Discipline Construction Fund (SZXK2020084 and SZXK077), and the Shenzhen Science and Technology Major Project of China (KJZD20240903102759061).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Ethical approval for this study was obtained from The University of Hong Kong - Shenzhen Hospital Ethics Committee with ID Ethics-2025-078.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

The authors declare that they do not have any conflict interests.

Data Availability

The code for processing the LLM responses, running the web-system, and for analyzing the resulting data were deposited on an open repository (\url{https://github.com/HKUSZH/LLMMed}). The JSON files for the model responses, evaluation scores, and sample questions, were also uploaded to the same link. Complete question-sets are available upon request to the corresponding authors.

https://github.com/HKUSZH/LLMMed