Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Evaluating Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry

View ORCID ProfileKevin W. Jin, View ORCID ProfileYasna Rostam-Abadi, Pooja Chaudhary, Margaret A. Garrett, Ashley S. Huang, View ORCID ProfileMario Montelongo, Caesa Nagpal, Jasperina Shei, View ORCID ProfileJudah Weathers, Juliana S. Zhang, View ORCID ProfileQingyu Chen, View ORCID ProfileJiyeong Kim, View ORCID ProfileMatteo Malgaroli, View ORCID ProfileWalter S. Mathis, View ORCID ProfileCarolyn I. Rodriguez, View ORCID ProfileSalih Selek, Manu S. Sharma, View ORCID ProfileChristopher Pittenger, View ORCID ProfileSarah W. Yip, View ORCID ProfileBrian A. Zaboski, View ORCID ProfileHua Xu
doi: https://doi.org/10.64898/2026.02.03.26345402
Kevin W. Jin
1Program in Computational Biology and Biomedical Informatics, Yale University, New Haven, CT
2Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT
BS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kevin W. Jin
  • For correspondence: kevin.jin{at}yale.edu
Yasna Rostam-Abadi
3Department of Psychiatry, Yale School of Medicine, New Haven, CT
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yasna Rostam-Abadi
Pooja Chaudhary
4Louis A. Faillace, MD, Department of Psychiatry and Behavioral Sciences, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Margaret A. Garrett
4Louis A. Faillace, MD, Department of Psychiatry and Behavioral Sciences, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ashley S. Huang
4Louis A. Faillace, MD, Department of Psychiatry and Behavioral Sciences, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mario Montelongo
4Louis A. Faillace, MD, Department of Psychiatry and Behavioral Sciences, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mario Montelongo
Caesa Nagpal
4Louis A. Faillace, MD, Department of Psychiatry and Behavioral Sciences, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jasperina Shei
4Louis A. Faillace, MD, Department of Psychiatry and Behavioral Sciences, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Judah Weathers
5Yale Child Study Center, Yale School of Medicine, New Haven, CT
DPhil, MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Judah Weathers
Juliana S. Zhang
4Louis A. Faillace, MD, Department of Psychiatry and Behavioral Sciences, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Qingyu Chen
2Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Qingyu Chen
Jiyeong Kim
6Stanford Center for Digital Health, Department of Medicine, Stanford University, Stanford, CA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jiyeong Kim
Matteo Malgaroli
7Department of Psychiatry, New York University Grossman School of Medicine, New York, NY
8New York University Center for Data Science, New York, NY
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Matteo Malgaroli
Walter S. Mathis
3Department of Psychiatry, Yale School of Medicine, New Haven, CT
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Walter S. Mathis
Carolyn I. Rodriguez
9Department of Psychiatry and Behavioral Sciences, Stanford School of Medicine, Stanford, CA
10Veterans Affairs Palo Alto Health Care System, Palo Alto, CA
MD, PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Carolyn I. Rodriguez
Salih Selek
4Louis A. Faillace, MD, Department of Psychiatry and Behavioral Sciences, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Salih Selek
Manu S. Sharma
3Department of Psychiatry, Yale School of Medicine, New Haven, CT
11The Institute of Living, Hartford Healthcare, Hartford, CT
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Christopher Pittenger
3Department of Psychiatry, Yale School of Medicine, New Haven, CT
5Yale Child Study Center, Yale School of Medicine, New Haven, CT
12Department of Psychology, Yale University, New Haven, CT
13Department of Neuroscience, Yale School of Medicine, New Haven, CT
14Wu Tsai Institute, Yale University, New Haven, CT
15Center for Brain & Mind Health, Yale School of Medicine, New Haven, CT
MD, PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Christopher Pittenger
Sarah W. Yip
3Department of Psychiatry, Yale School of Medicine, New Haven, CT
5Yale Child Study Center, Yale School of Medicine, New Haven, CT
14Wu Tsai Institute, Yale University, New Haven, CT
15Center for Brain & Mind Health, Yale School of Medicine, New Haven, CT
PhD, MSc
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sarah W. Yip
Brian A. Zaboski
3Department of Psychiatry, Yale School of Medicine, New Haven, CT
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Brian A. Zaboski
Hua Xu
2Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT
14Wu Tsai Institute, Yale University, New Haven, CT
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Hua Xu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Summary

Background Existing large language model (LLM) evaluations rely on accuracy benchmarks that fail to capture whether models reason well while making diagnoses. Studies that do analyse reasoning focus on post hoc explanations accompanying model outputs rather than distinct, clinician-visible artifacts such as detailed reasoning traces. This creates a translational gap in domains such as psychiatry, where diagnosis relies on narrative interpretation, diagnostic reasoning, and clinical judgment under uncertainty.

Methods We conducted a mixed-methods evaluation of four state-of-the-art LLMs using a clinician-curated dataset of 196 psychiatric case vignettes, including 135 published cases and 61 novel clinician-authored vignettes. Diagnostic accuracy was assessed using multiple metrics (top-1 accuracy, top-5 accuracy, recall@5, and mean reciprocal rank) based on ranked lists of five differential diagnoses per vignette. Clinical reasoning quality was evaluated on a randomly selected subset of 30 vignettes through clinician assessment of model-generated diagnostic reasoning traces alongside qualitative commentary from board-certified psychiatrists. We examined the association between clinician-rated reasoning quality and diagnostic correctness and included an illustrative comparison with psychiatry residents.

Findings Clinician-rated diagnostic reasoning quality was strongly associated with diagnostic correctness in mixed-effects logistic regression analyses (β = 1·80; p < 0·001), whereas data extraction quality alone was not. Across the full vignette set, models demonstrated moderate to high diagnostic accuracy. The highest-performing model achieved a top-5 accuracy of 0·801 and also received the highest clinician-rated reasoning scores. In an illustrative comparison, model diagnostic accuracy fell within the range observed for psychiatry residents.

Interpretation Diagnostic reasoning quality captures clinically meaningful variation in LLM performance beyond accuracy metrics. Psychiatry may represent a stringent testbed for evaluating reasoning in narrative-driven clinical domains. Evaluation frameworks for LLM-based clinical decision support should incorporate structured assessment of reasoning processes, not accuracy alone.

Evidence before this study We searched PubMed and Scopus for studies evaluating large language models for psychiatric diagnosis and/or differential diagnosis from text vignettes. Searches were run from database inception to February 6, 2026, using terms including (“large language model” OR LLM “artificial intelligence” OR “generative AI” OR AI OR ChatGPT OR GPT OR Claude OR Gemini OR DeepSeek OR Llama) AND (psychiatr* OR mental OR DSM OR “differential diagnosis” OR diagnos*) AND (vignette OR case OR “case report”). We included empirical studies that evaluated model diagnostic outputs using psychiatric cases/vignettes and excluded editorials/commentaries and studies that did not report case-level diagnostic performance. We did not formally assess risk of bias/quality; studies were heterogeneous in vignette sources, model access, and outcome definitions, precluding quantitative pooling. Prior studies have typically examined small vignette sets, focused on narrow diagnostic domains, evaluated single models, or relied primarily on outcome-based accuracy metrics. When diagnostic reasoning has been assessed, it has usually been inferred from post hoc explanations accompanying model outputs rather than evaluated as a distinct, clinician-visible artifact. Clinician-grounded evaluations of diagnostic reasoning across multiple contemporary models remain limited.

Added value of this study This study provides a large-scale, clinician-grounded evaluation of diagnostic accuracy and diagnostic reasoning quality across four contemporary large language models using a diverse dataset of psychiatric case vignettes. Rather than relying solely on outcome-based explanations, we directly evaluated model-generated diagnostic reasoning traces as clinician-visible artifacts using structured clinician ratings and qualitative analysis. By integrating multiple accuracy metrics with clinician assessment of reasoning coherence, flexibility, and plausibility, we demonstrate that clinician-rated reasoning quality is strongly associated with diagnostic correctness, whereas data extraction quality alone is not. Our analysis also identifies recurrent reasoning failure modes not captured by accuracy metrics, highlighting psychiatry as a stringent testbed for evaluating reasoning in narrative-driven clinical domains.

Implications of all the available evidence Evaluations of large language models for clinical decision support should extend beyond accuracy to include systematic assessment of clinician-visible diagnostic reasoning. Mixed-methods, clinician-grounded evaluation frameworks that examine both diagnostic outcomes and reasoning artifacts may be critical for responsible assessment of LLMs in psychiatry and other areas of medicine where diagnosis depends on interpretation, judgment, and tolerance of uncertainty.

Competing Interest Statement

In the last 3 years, CIR has served as a consultant for Biohaven Pharmaceuticals and Osmind; and receives research grant support from Biohaven Pharmaceuticals, a stipend from American Psychiatric Association Publishing for her role as Deputy Editor at The American Journal of Psychiatry, a stipend from Springer Nature for her role as Deputy Editor for Neuropsychopharmacology, and book royalties from American Psychiatric Association Publishing. The remaining authors report no financial or other relationship relevant to the subject of this manuscript.

Funding Statement

This study received material support in the form of API credits from the OpenAI Researcher Access Program and the Google Gemini Academic Program. KWJ is supported by the National Science Foundation's Graduate Research Fellowship Program and was formerly supported by the National Library of Medicine's T15 University-based Biomedical Informatics and Data Science Training Program. MM was supported by the National Institute of Mental Health under grant K23MH134068. JK is supported by the NIH (K01MH137386). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

  • Updated framing to be more relevant to digital health audiences, to reflect the most recent journal submission.

Data Availability

Analysis code, evaluation prompts, and derived evaluation outputs will be publicly available in a repository upon publication. Clinician-authored fictitious vignettes will also be publicly available. We will not publicly redistribute text derived from published case reports or verbatim model reasoning traces; citations to original sources will be provided, and access to restricted materials may be provided under controlled conditions (eg, to qualified researchers under a data-use agreement and/or institutional approval).

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted February 11, 2026.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Evaluating Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Evaluating Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry
Kevin W. Jin, Yasna Rostam-Abadi, Pooja Chaudhary, Margaret A. Garrett, Ashley S. Huang, Mario Montelongo, Caesa Nagpal, Jasperina Shei, Judah Weathers, Juliana S. Zhang, Qingyu Chen, Jiyeong Kim, Matteo Malgaroli, Walter S. Mathis, Carolyn I. Rodriguez, Salih Selek, Manu S. Sharma, Christopher Pittenger, Sarah W. Yip, Brian A. Zaboski, Hua Xu
medRxiv 2026.02.03.26345402; doi: https://doi.org/10.64898/2026.02.03.26345402
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Evaluating Diagnostic Accuracy and Clinical Reasoning of Multiple Large Language Models in Psychiatry
Kevin W. Jin, Yasna Rostam-Abadi, Pooja Chaudhary, Margaret A. Garrett, Ashley S. Huang, Mario Montelongo, Caesa Nagpal, Jasperina Shei, Judah Weathers, Juliana S. Zhang, Qingyu Chen, Jiyeong Kim, Matteo Malgaroli, Walter S. Mathis, Carolyn I. Rodriguez, Salih Selek, Manu S. Sharma, Christopher Pittenger, Sarah W. Yip, Brian A. Zaboski, Hua Xu
medRxiv 2026.02.03.26345402; doi: https://doi.org/10.64898/2026.02.03.26345402

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
  • Psychiatry and Clinical Psychology
Subject Areas
All Articles
  • Addiction Medicine (576)
  • Allergy and Immunology (868)
  • Anesthesia (306)
  • Cardiovascular Medicine (4483)
  • Dentistry and Oral Medicine (449)
  • Dermatology (385)
  • Emergency Medicine (615)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1528)
  • Epidemiology (15282)
  • Forensic Medicine (31)
  • Gastroenterology (1134)
  • Genetic and Genomic Medicine (6650)
  • Geriatric Medicine (671)
  • Health Economics (1006)
  • Health Informatics (4606)
  • Health Policy (1378)
  • Health Systems and Quality Improvement (1624)
  • Hematology (544)
  • HIV/AIDS (1276)
  • Infectious Diseases (except HIV/AIDS) (15965)
  • Intensive Care and Critical Care Medicine (1111)
  • Medical Education (626)
  • Medical Ethics (147)
  • Nephrology (674)
  • Neurology (6698)
  • Nursing (346)
  • Nutrition (1006)
  • Obstetrics and Gynecology (1153)
  • Occupational and Environmental Health (961)
  • Oncology (3370)
  • Ophthalmology (988)
  • Orthopedics (370)
  • Otolaryngology (421)
  • Pain Medicine (437)
  • Palliative Medicine (131)
  • Pathology (670)
  • Pediatrics (1704)
  • Pharmacology and Therapeutics (700)
  • Primary Care Research (717)
  • Psychiatry and Clinical Psychology (5497)
  • Public and Global Health (9287)
  • Radiology and Imaging (2225)
  • Rehabilitation Medicine and Physical Therapy (1375)
  • Respiratory Medicine (1202)
  • Rheumatology (598)
  • Sexual and Reproductive Health (721)
  • Sports Medicine (536)
  • Surgery (722)
  • Toxicology (100)
  • Transplantation (290)
  • Urology (267)