Evaluating Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry

Kevin W. Jin; Yasna Rostam-Abadi; Pooja Chaudhary; Margaret A. Garrett; Ashley S. Huang; Mario Montelongo; Caesa Nagpal; Jasperina Shei; Judah Weathers; Juliana S. Zhang; Qingyu Chen; Jiyeong Kim; Matteo Malgaroli; Walter S. Mathis; Carolyn I. Rodriguez; Salih Selek; Manu S. Sharma; Christopher Pittenger; Sarah W. Yip; Brian A. Zaboski; Hua Xu

doi:10.64898/2026.02.03.26345402

Summary

Background Existing large language model (LLM) evaluations rely on accuracy benchmarks that fail to capture whether models reason well while making diagnoses. Studies that do analyse reasoning focus on post hoc explanations accompanying model outputs rather than distinct, clinician-visible artifacts such as detailed reasoning traces. This creates a translational gap in domains such as psychiatry, where diagnosis relies on narrative interpretation, diagnostic reasoning, and clinical judgment under uncertainty.

Methods We conducted a mixed-methods evaluation of four state-of-the-art LLMs using a clinician-curated dataset of 196 psychiatric case vignettes, including 135 published cases and 61 novel clinician-authored vignettes. Diagnostic accuracy was assessed using multiple metrics (top-1 accuracy, top-5 accuracy, recall@5, and mean reciprocal rank) based on ranked lists of five differential diagnoses per vignette. Clinical reasoning quality was evaluated on a randomly selected subset of 30 vignettes through clinician assessment of model-generated diagnostic reasoning traces alongside qualitative commentary from board-certified psychiatrists. We examined the association between clinician-rated reasoning quality and diagnostic correctness and included an illustrative comparison with psychiatry residents.

Findings Clinician-rated diagnostic reasoning quality was strongly associated with diagnostic correctness in mixed-effects logistic regression analyses (β = 1·80; p < 0·001), whereas data extraction quality alone was not. Across the full vignette set, models demonstrated moderate to high diagnostic accuracy. The highest-performing model achieved a top-5 accuracy of 0·801 and also received the highest clinician-rated reasoning scores. In an illustrative comparison, model diagnostic accuracy fell within the range observed for psychiatry residents.

Interpretation Diagnostic reasoning quality captures clinically meaningful variation in LLM performance beyond accuracy metrics. Psychiatry may represent a stringent testbed for evaluating reasoning in narrative-driven clinical domains. Evaluation frameworks for LLM-based clinical decision support should incorporate structured assessment of reasoning processes, not accuracy alone.

Evidence before this study We searched PubMed and Scopus for studies evaluating large language models for psychiatric diagnosis and/or differential diagnosis from text vignettes. Searches were run from database inception to February 6, 2026, using terms including (“large language model” OR LLM “artificial intelligence” OR “generative AI” OR AI OR ChatGPT OR GPT OR Claude OR Gemini OR DeepSeek OR Llama) AND (psychiatr* OR mental OR DSM OR “differential diagnosis” OR diagnos*) AND (vignette OR case OR “case report”). We included empirical studies that evaluated model diagnostic outputs using psychiatric cases/vignettes and excluded editorials/commentaries and studies that did not report case-level diagnostic performance. We did not formally assess risk of bias/quality; studies were heterogeneous in vignette sources, model access, and outcome definitions, precluding quantitative pooling. Prior studies have typically examined small vignette sets, focused on narrow diagnostic domains, evaluated single models, or relied primarily on outcome-based accuracy metrics. When diagnostic reasoning has been assessed, it has usually been inferred from post hoc explanations accompanying model outputs rather than evaluated as a distinct, clinician-visible artifact. Clinician-grounded evaluations of diagnostic reasoning across multiple contemporary models remain limited.

Added value of this study This study provides a large-scale, clinician-grounded evaluation of diagnostic accuracy and diagnostic reasoning quality across four contemporary large language models using a diverse dataset of psychiatric case vignettes. Rather than relying solely on outcome-based explanations, we directly evaluated model-generated diagnostic reasoning traces as clinician-visible artifacts using structured clinician ratings and qualitative analysis. By integrating multiple accuracy metrics with clinician assessment of reasoning coherence, flexibility, and plausibility, we demonstrate that clinician-rated reasoning quality is strongly associated with diagnostic correctness, whereas data extraction quality alone is not. Our analysis also identifies recurrent reasoning failure modes not captured by accuracy metrics, highlighting psychiatry as a stringent testbed for evaluating reasoning in narrative-driven clinical domains.

Implications of all the available evidence Evaluations of large language models for clinical decision support should extend beyond accuracy to include systematic assessment of clinician-visible diagnostic reasoning. Mixed-methods, clinician-grounded evaluation frameworks that examine both diagnostic outcomes and reasoning artifacts may be critical for responsible assessment of LLMs in psychiatry and other areas of medicine where diagnosis depends on interpretation, judgment, and tolerance of uncertainty.

Competing Interest Statement

In the last 3 years, CIR has served as a consultant for Biohaven Pharmaceuticals and Osmind; and receives research grant support from Biohaven Pharmaceuticals, a stipend from American Psychiatric Association Publishing for her role as Deputy Editor at The American Journal of Psychiatry, a stipend from Springer Nature for her role as Deputy Editor for Neuropsychopharmacology, and book royalties from American Psychiatric Association Publishing. The remaining authors report no financial or other relationship relevant to the subject of this manuscript.

Funding Statement

This study received material support in the form of API credits from the OpenAI Researcher Access Program and the Google Gemini Academic Program. KWJ is supported by the National Science Foundation's Graduate Research Fellowship Program and was formerly supported by the National Library of Medicine's T15 University-based Biomedical Informatics and Data Science Training Program. MM was supported by the National Institute of Mental Health under grant K23MH134068. JK is supported by the NIH (K01MH137386). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

Updated framing to be more relevant to digital health audiences, to reflect the most recent journal submission.

Data Availability

Analysis code, evaluation prompts, and derived evaluation outputs will be publicly available in a repository upon publication. Clinician-authored fictitious vignettes will also be publicly available. We will not publicly redistribute text derived from published case reports or verbatim model reasoning traces; citations to original sources will be provided, and access to restricted materials may be provided under controlled conditions (eg, to qualified researchers under a data-use agreement and/or institutional approval).