Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment

View ORCID ProfileThomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Ahmad Safavi-Naini, View ORCID ProfileAli Soroush, Jonathan H Chen
doi: https://doi.org/10.1101/2024.06.06.24308399
Thomas Savage
1Department of Medicine, Stanford University, Stanford, California
2Division of Hospital Medicine, Stanford University, Stanford, CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Thomas Savage
  • For correspondence: tsavage{at}stanford.edu
John Wang
1Department of Medicine, Stanford University, Stanford, California
3Division of Gastroenterology and Hepatology, Department of Medicine, Stanford University, Palo Alto, CA, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Robert Gallo
4Palo Alto Veterans Affairs Medical Center
5Department of Health Policy, Stanford University, Stanford CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Abdessalem Boukil
6Linguamind AI
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Vishwesh Patel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Seyed Amir Ahmad Safavi-Naini
7Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ali Soroush
7Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York
8The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York
9Henry D. Janowitz Division of Gastroenterology, Icahn School of Medicine at Mount Sinai, New York
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ali Soroush
Jonathan H Chen
1Department of Medicine, Stanford University, Stanford, California
2Division of Hospital Medicine, Stanford University, Stanford, CA
10Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA
11Clinical Excellence Research Center, Stanford University, Stanford, CA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Introduction The inability of Large Language Models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to measure uncertainty in ways that are useful to physician-users.

Objective Evaluate the ability for uncertainty metrics to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.

Methods We examined the discrimination and calibration of Confidence Elicitation, Token-Level Probabilities, and Sample Consistency metrics across GPT3.5, GPT4, Llama2-70B and Llama3-70B. Uncertainty metrics were evaluated against three datasets of open-ended patient scenarios.

Results Sample Consistency methods outperformed Token Level Probability and Confidence Elicitation methods. Sample Consistency by sentence embedding cosine similarity achieved the highest discrimination performance with poor calibration, while Sample Consistency by GPT annotation achieved the second-best discrimination with more accurate calibration. Nearly all uncertainty metrics had better discriminative performance with diagnosis questions rather than treatment selection questions and verbalized confidence (Confidence Elicitation) was found to consistently over-estimate model confidence.

Conclusions Sample Consistency methods are the optimal metrics for assessing LLM uncertainty for the tasks of medical diagnosis and treatment selection. We suggest Sample Consistency by sentence embedding cosine similarity if the user has a set of reference cases with which to re-calibrate their results, and Sample Consistency by GPT annotation if the user does not have reference cases and requires accurate raw calibration. Our results also confirm LLMs are consistently over-confident when verbalizing their confidence through Confidence Elicitation.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

We have no funding sources to report applicable to this project.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study used the public MedQA dataset which is publicly published (https://doi.org/10.48550/arXiv.2009.13081). We also use the New England Journal of Medicine Case Report Series, which is also publicly published. Finally we created a third dataset of our own fictional simulated patient data de novo for this study.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data is available for review at: https://doi.org/10.6084/m9.figshare.25962529.v1

https://doi.org/10.6084/m9.figshare.25962529.v1

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted June 07, 2024.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment
Thomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Ahmad Safavi-Naini, Ali Soroush, Jonathan H Chen
medRxiv 2024.06.06.24308399; doi: https://doi.org/10.1101/2024.06.06.24308399
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment
Thomas Savage, John Wang, Robert Gallo, Abdessalem Boukil, Vishwesh Patel, Seyed Amir Ahmad Safavi-Naini, Ali Soroush, Jonathan H Chen
medRxiv 2024.06.06.24308399; doi: https://doi.org/10.1101/2024.06.06.24308399

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (427)
  • Allergy and Immunology (753)
  • Anesthesia (220)
  • Cardiovascular Medicine (3281)
  • Dentistry and Oral Medicine (362)
  • Dermatology (274)
  • Emergency Medicine (478)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1164)
  • Epidemiology (13340)
  • Forensic Medicine (19)
  • Gastroenterology (897)
  • Genetic and Genomic Medicine (5130)
  • Geriatric Medicine (479)
  • Health Economics (781)
  • Health Informatics (3253)
  • Health Policy (1138)
  • Health Systems and Quality Improvement (1189)
  • Hematology (427)
  • HIV/AIDS (1014)
  • Infectious Diseases (except HIV/AIDS) (14613)
  • Intensive Care and Critical Care Medicine (910)
  • Medical Education (475)
  • Medical Ethics (126)
  • Nephrology (522)
  • Neurology (4901)
  • Nursing (261)
  • Nutrition (725)
  • Obstetrics and Gynecology (880)
  • Occupational and Environmental Health (795)
  • Oncology (2516)
  • Ophthalmology (722)
  • Orthopedics (280)
  • Otolaryngology (346)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (540)
  • Pediatrics (1298)
  • Pharmacology and Therapeutics (548)
  • Primary Care Research (554)
  • Psychiatry and Clinical Psychology (4193)
  • Public and Global Health (7482)
  • Radiology and Imaging (1702)
  • Rehabilitation Medicine and Physical Therapy (1010)
  • Respiratory Medicine (979)
  • Rheumatology (478)
  • Sexual and Reproductive Health (495)
  • Sports Medicine (424)
  • Surgery (546)
  • Toxicology (71)
  • Transplantation (235)
  • Urology (203)