Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

View ORCID ProfileSuhana Bedi, View ORCID ProfileYutong Liu, View ORCID ProfileLucy Orr-Ewing, View ORCID ProfileDev Dash, View ORCID ProfileSanmi Koyejo, View ORCID ProfileAlison Callahan, View ORCID ProfileJason A. Fries, View ORCID ProfileMichael Wornow, View ORCID ProfileAkshay Swaminathan, View ORCID ProfileLisa Soleymani Lehmann, Hyo Jung Hong, View ORCID ProfileMehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, View ORCID ProfileArnold Milstein, View ORCID ProfileMichael A. Pfeffer, View ORCID ProfileNigam H. Shah
doi: https://doi.org/10.1101/2024.04.15.24305869
Suhana Bedi
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Suhana Bedi
Yutong Liu
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yutong Liu
Lucy Orr-Ewing
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lucy Orr-Ewing
Dev Dash
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Dev Dash
Sanmi Koyejo
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sanmi Koyejo
Alison Callahan
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Alison Callahan
Jason A. Fries
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jason A. Fries
Michael Wornow
2Stanford;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Michael Wornow
Akshay Swaminathan
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Akshay Swaminathan
Lisa Soleymani Lehmann
3Harvard University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Lisa Soleymani Lehmann
Hyo Jung Hong
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mehr Kashyap
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mehr Kashyap
Akash R. Chaurasia
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nirav R. Shah
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Karandeep Singh
4University of California San Diego;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Troy Tazbaz
5US Food and Drug Administration
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Arnold Milstein
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Arnold Milstein
Michael A. Pfeffer
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Michael A. Pfeffer
Nigam H. Shah
1Stanford University;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nigam H. Shah
  • For correspondence: nigam{at}stanford.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

1. Abstract

Importance Large Language Models (LLMs) can assist in a wide range of healthcare-related activities. Current approaches to evaluating LLMs make it difficult to identify the most impactful LLM application areas.

Objective To summarize the current evaluation of LLMs in healthcare in terms of 5 components: evaluation data type, healthcare task, Natural Language Processing (NLP)/Natural Language Understanding (NLU) task, dimension of evaluation, and medical specialty.

Data Sources A systematic search of PubMed and Web of Science was performed for studies published between 01-01-2022 and 02-19-2024.

Study Selection Studies evaluating one or more LLMs in healthcare.

Data Extraction and Synthesis Three independent reviewers categorized 519 studies in terms of data used in the evaluation, the healthcare tasks (the what) and the NLP/NLU tasks (the how) examined, the dimension(s) of evaluation, and the medical specialty studied.

Results Only 5% of reviewed studies utilized real patient care data for LLM evaluation. The most popular healthcare tasks were assessing medical knowledge (e.g. answering medical licensing exam questions, 44.5%), followed by making diagnoses (19.5%), and educating patients (17.7%). Administrative tasks such as assigning provider billing codes (0.2%), writing prescriptions (0.2%), generating clinical referrals (0.6%) and clinical notetaking (0.8%) were less studied. For NLP/NLU tasks, the vast majority of studies examined question answering (84.2%). Other tasks such as summarization (8.9%), conversational dialogue (3.3%), and translation (3.1%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias and toxicity (15.8%), robustness (14.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in internal medicine (42%), surgery (11.4%) and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%) and medical genetics (0.2%) being the least represented.

Conclusions and Relevance Existing evaluations of LLMs mostly focused on accuracy of question answering for medical exams, without consideration of real patient care data. Dimensions like fairness, bias and toxicity, robustness, and deployment considerations received limited attention. To draw meaningful conclusions and improve LLM adoption, future studies need to establish a standardized set of LLM applications and evaluation dimensions, perform evaluations using data from routine care, and broaden testing to include administrative tasks as well as multiple medical specialties.

0. Key Points

  • Question: How are healthcare applications of large language models (LLMs) currently evaluated?

  • Findings: Studies rarely used real patient care data for LLM evaluation. Administrative tasks such as generating provider billing codes and writing prescriptions were understudied. Natural Language Processing (NLP)/Natural Language Understanding (NLU) tasks like summarization, conversational dialogue, and translation were infrequently explored. Accuracy was the predominant dimension of evaluation, while fairness, bias and toxicity assessments were neglected. Evaluations in specialized fields, such as nuclear medicine and medical genetics were rare.

  • Meaning: Current LLM assessments in healthcare remain shallow and fragmented. To draw concrete insights on their performance, evaluations need to use real patient care data across a broad range of healthcare and NLP/NLU tasks and medical specialties with standardized dimensions of evaluation.

2. Introduction

The adoption of Artificial Intelligence (AI) in healthcare is rising, catalyzed by the emergence of Large Language Models (LLMs) like OpenAI’s ChatGPT 1,2,3,4. Unlike predictive AI, generative AI produces original content such as sound, image, and text5. Within the realm of generative AI, LLMs produce structured, coherent prose in response to text inputs, with broad application in health system operations 6. Prominent applications such as facilitating clinical note-taking have already been implemented by several health systems in the U.S., and there is excitement in the medical community for improving healthcare efficiency, quality, and patient outcomes 7 8. A recent report estimates that LLMs could unlock a substantial portion of the $1 trillion in untapped healthcare efficiency improvements, including an estimated savings ranging from 5 to 10 percent of US healthcare spending or approximately $200 billion to $360 billion annually based on 2019 figures 9 10.

Despite their potential, the performance of LLMs in real-world healthcare settings remains inconsistently evaluated 11 12. For instance, Cadamuro et al. assessed ChatGPT-4’s diagnostic ability by evaluating relevance, correctness, helpfulness, and safety, finding responses to be generally superficial and sometimes inaccurate, lacking in helpfulness and safety 13. In contrast, Pagano et al. also assessed diagnostic ability, but focused solely on correctness, concluding that ChatGPT-4 exhibited a high level of accuracy comparable to clinician responses 14. Thus, we hypothesize that the current evaluation landscape lacks the uniformity, thoroughness, and robustness necessary to effectively guide the deployment of LLMs in a real-world setting.

This systematic review of 519 studies provides a comprehensive characterization of how LLMs have been evaluated in healthcare settings. To accomplish this, we categorize each study along 5 axes: evaluation data type used, healthcare task, NLP/NLU task, dimension of evaluation, and medical specialty. To enable the categorization of the diverse range of applications and their evaluation setups, we use two categorization frameworks: the first describes healthcare applications of LLMs in terms of their constituent healthcare and NLP/NLU tasks, and the second describes dimensions of evaluation and associated metrics. These frameworks are then applied systematically to characterize the current state of evaluations to quantify the variability in LLM application evaluations and identify areas for further exploration. Our results show that evaluations of LLM applications in healthcare have been unevenly distributed both in terms of dimensions of evaluation used and in terms of medical specialty and application.

3. Methods

3.1 Design

A systematic review was conducted following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines as shown in Figure 1 15.

Figure 1:
  • Download figure
  • Open in new tab
Figure 1: PRISMA Flow Diagram

This diagram shows the process of screening and selecting the categorized 519 studies.

3.2 Information sources

Peer-reviewed studies and preprints from January 1 2022, to February 19 2024, were retrieved from PubMed and Web of Science databases, using specific keywords as detailed in Supplement 1. Our search focused on titles and abstracts to identify studies on evaluation of LLMs’ healthcare applications. This two-year period aimed to capture publications evaluating LLM healthcare applications since the public launch of ChatGPT in November 2022. Given our hypothesis that the current landscape lacks the necessary elements needed to truly assess LLM performance in healthcare, we included a broad spectrum of studies. Citations were imported into EndNote 21 (Clarivate) for analysis.

3.3 Categorization framework

Each study was categorized by evaluation data type, healthcare task, NLP/NLU task, dimension of evaluation, and medical specialty. Healthcare task categories were developed using publicly available healthcare task and competency lists and were refined by consulting board-certified MDs 16 17 as outlined in Table 1. NLP/NLU categories and dimension of evaluations were developed using the Holistic Evaluation of Language Models (HELM) and Hugging Face frameworks 18 19 as shown in Tables 2 and 3. Medical specialties were adapted from Accreditation Council for Graduate Medical Education (ACGME) residency programs. 20

View this table:
  • View inline
  • View popup
Table 1: Healthcare task definitions and examples

This table lists the range of healthcare tasks that the 519 studies were categorized into, with definition and example for each task category.

3.4 Eligibility criteria and screening

Screening was conducted by SB, YL, and LOE using the Covidence software (Covidence, 2024) as outlined in Figure 1. Included studies used LLMs for healthcare tasks and evaluated their performance. Excluded articles were those focused on multimodal tasks or basic biological science research with LLMs.

3.5 Data extraction and labeling

We adopted a paired review approach, wherein each study was categorized into evaluation data type, healthcare tasks, NLP/NLU tasks, dimension(s) of evaluation, and medical specialty by at least one human reviewer (SB, YL, or LOE) and GPT-4, based on the title and abstract. Note that GPT-4 was used as a force multiplier while the final categories were assigned by the human reviewers. In instances of disagreements regarding category assignments, the methods sections of the studies were retrieved, and final categories were determined through reviewer consensus. The prompts given to GPT-4 can be found in Supplement 2.

Each study received one or more healthcare tasks, NLP/NLU task, and dimension of evaluation labels as appropriate, hence the percentages sum above 100% in Table 4. In addition, each study could be assigned more than one medical specialty based on the evaluation conducted.

4 Results

749 relevant studies were screened for eligibility. After applying the inclusion and exclusion criteria described in 3.4, 519 studies were included in the analysis using the frameworks developed by the authors.

4.1 Categorization framework for healthcare tasks, NLP/NLU tasks and dimensions of evaluation

We deconstructed each healthcare application of an LLM into its constituent healthcare task (Table 1), i.e. the clinical and non-clinical task it is used for (the “what”), and the NLP/NLU task (Table 2), i.e. the language processing task being performed (the “how”). Examples of a healthcare task are diagnosing a patient’s disease, recommending a treatment for osteoarthritis. Examples of the language-processing job to be accomplished – which is not necessarily specific to the medical domain are summarizing the impression section of a radiology report, answering questions about the symptoms of type 2 diabetes etc.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2: Definition of NLP/NLU tasks

This table lists the range of NLP/NLU tasks that the 519 studies were categorized into, with definition and examples for each task category.

An example of how healthcare tasks and NLP/NLU combine for a healthcare application of LLM is how Gan et al. evaluated LLM performance for mass-casualty triaging 40. The healthcare task (the “what”) is triaging patients while the NLP/NLU tasks (the “how”) are information extraction (extracting detailed patient information from the triage questionnaire scenarios, including age, symptoms, and vital signs), text classification (classifying the triage questionnaire scenarios into different triage levels), and question answering (generating final decision responses to the triage questionnaire).

We initially compiled a list of healthcare tasks using publicly available resources 41 42. Subsequently, through consultation with three board-certified MDs, we refined the list through iterative discussions to establish the final categories for classification, as outlined in Table 1. To compile a list of common NLP/NLU tasks, we referred to sources such as the Holistic Evaluation of Language Models (HELM) study and the Hugging Face task framework to derive 6 categories: 1) Summarization, 2) Question answering, 3) Information extraction, 4) Text classification (such as clinical notes, research articles, and documents), 5) Translation, and 6) Conversational dialogue (Table 2) 43 44.

We categorized the most common dimensions of evaluation used in the reviewed studies based on the list outlined in Table 3. These dimensions include: 1) Accuracy, 2) Calibration and uncertainty, 3) Robustness, 4) Factuality, 5) Comprehensiveness, 6) Fairness, bias, and toxicity, and 7) Deployment considerations. Fairness, bias, and toxicity were grouped together for ease of analysis, due to their infrequent occurrence in the reviewed studies, and relevance to ethical evaluation of LLMs. Additionally, we compiled common metrics for each dimension (eFigure 1) to serve as a starting framework for researchers designing studies to assess LLM performance in healthcare applications.

View this table:
  • View inline
  • View popup
Table 3: Dimensions of evaluation for LLM response

This table lists the range of dimensions of evaluations that the 519 studies were categorized into, with definitions, metrics and reviewer-generated example responses where each dimension is evaluated for a simple input question, “What are the symptoms of Type 2 diabetes?”

4.2 Distribution of studies based on evaluation data type

Among the reviewed studies, 5% evaluated and tested LLMs using real patient care data, while the remaining relied on data such as medical examination questions, clinician-designed vignettes or Subject Matter Expert (SME) generated questions.

4.3 Categorizing articles based on healthcare tasks and NLP/NLU tasks

The studies we examined had a predominant focus on evaluating LLMs for their medical knowledge (Table 4), primarily through assessments such as the USMLE. This trend assumes that because we assess medical professionals’ readiness for entering clinical practice through board-style examinations, mirroring this type of evaluation for LLMs is adequate to certify their fitness-for-use. Making diagnoses, educating patients and making treatment recommendations were the other common healthcare tasks studied. While these tasks represent critical aspects of healthcare delivery, validating the utility of LLMs in supporting them requires assessment with real patient care data. The limited examination of administrative tasks like assigning provider billing codes, writing prescriptions, generating clinical referrals, and clinical notetaking suggests a gap in studying LLMs’ use for high-value, immediately impactful administrative tasks. These tasks are often labor intensive, presenting a ripe opportunity for testing LLMs to enhance efficiency in these areas 45.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 4: Frequency of publications examining each dimension of evaluation across healthcare and NLP/NLU task categories

The first column lists healthcare tasks followed by NLP/NLU tasks (separated by a double line); the first row lists the dimensions of evaluation used in each study examined. The percentages in the last row are the percentage of studies in which a specific dimension was evaluated and the percentages on the last column indicate the percentage of studies in which a specific healthcare task or NLP/NLU task was evaluated.

Among the NLP/NLU tasks, most studies evaluated LLM performance through question answering tasks. These tasks ranged from addressing generic inquiries about symptoms and treatments to tackling board-style questions featuring clinical vignettes. While this initial emphasis is understandable, it underscores a substantial gap in testing LLMs with real patient care data, encompassing diverse patient demographics, medical history, medications, and lab results. Approximately a quarter of the studies focused on text classification and information extraction tasks. Tasks such as summarization, conversational dialogue, and translation remained underexplored. This gap is significant because condensing patient records into concise summaries, translating medical content into simpler languages or the patient’s native language, and facilitating conversations through chatbots are often touted benefits of using LLMs and could substantially alleviate physician burden.

4.4 Categorizing articles based on the dimensions of evaluation

As seen in Table 4, accuracy and comprehensiveness were overwhelmingly the top two most examined dimensions, whereas factuality, fairness, bias, and toxicity, robustness, deployment considerations, and calibration and uncertainty were infrequently assessed. This suggests a potential gap in assessing the broader capabilities and suitability of LLMs for real-world deployment. While accuracy and comprehensiveness are crucial for ensuring the reliability and effectiveness of LLMs in healthcare tasks, dimensions like fairness, bias, and toxicity are equally vital for addressing ethical concerns and ensuring equitable outcomes. Similarly, robustness and deployment considerations are essential for assessing the sustainability of integrating LLMs into healthcare systems. The limited assessment of calibration and uncertainty raises questions about the extent to which researchers are addressing the need for LLMs to provide uncertainty quantifications, particularly in healthcare scenarios.

4.5 Distribution of studies by medical specialty

We categorized studies according to the Accreditation Council for Graduate Medical Education (ACGME) residency programs, augmented to include additional categories to capture studies investigating applications in dental specialties, treatment of genetic disorders and generic healthcare applications 46. Notably, over a fifth of the studies were categorized as generic, indicating a significant focus on healthcare applications that are relevant to many specialties, rather than a specific specialty. Among the specialties, internal medicine, surgery, and ophthalmology were the top specialties. Nuclear medicine, physical medicine, and medical genetics were the least prevalent specialties in studies, accounting for 12 studies in total. The exact percentage of studies in different specialties are outlined in eTable 2. The distribution of studies across specialties underscores the potential for LLMs to contribute to a wide range of medical specialties, but also signals opportunities for further exploration within less represented areas such as nuclear medicine, physical medicine, and medical genetics.

5 Discussion

Our systematic review of 519 studies summarizes existing evaluations of LLMs across medical specialties. Studies ranged widely in the underlying healthcare task, NLP/NLU task, and dimension of evaluation. Based on the results, we identified six limitations in the current efforts and suggest how to address them in future. These limitations demonstrate an urgent need to develop nationwide consensus-driven guidance for evaluating LLMs in medicine, in a manner similar to the creation of the blueprint for trustworthy AI by The Coalition for Health AI for traditional AI models47.

The need for evaluations based on real patient care data

One striking finding is that only 5% of the studies used real patient care data for evaluation, with most studies using a mix of medical exam questions, patient vignettes and subject matter expert generated questions 48 49 50. Our recent JAMA special communication pointed out that testing LLMs with hypothetical medical questions is like assessing a car’s performance with multiple-choice questions before certifying it for road use51. Real patient care data encompasses the complexities of clinical practice, providing a more thorough evaluation of LLM performance that will closely mirror real-world performance 52 53 54 55.

Real-world LLM evaluations provide valuable insights that may be overlooked in simulations or synthetic environments. For instance, while LLMs have been touted for potentially saving time and enhancing clinician experience, Garcia et al. found that the mean utilization rate for drafting patient messaging responses in an EHR system was only 20%, resulting in a reduction in burnout score but no time savings 56.

Given the importance of using real patient care data, systems need to be created to ensure their use in evaluating LLMs’ healthcare applications. The Office of the National Coordinator for Health Information Technology (ONC) recently passed HT-1, the first federal regulation to set specific reporting requirements for developers of AI tools57. ONC and other regulators should look to embed a mandate for the use of patient care data in the evaluation process of LLM tools into its requirements.

The need to standardize the task formulations and dimensions of evaluation

There is a lack of consensus on which dimensions of evaluation to examine for a given healthcare task or NLP/NLU task. For instance, for a medical education task, Ali et al. tested the performance of GPT-4 on a written board examination focusing on output accuracy as the sole dimension 58. Another study tested the performance of ChatGPT on the USMLE, focusing on output accuracy, factuality and comprehensiveness as primary dimensions of evaluation 59.

To address this challenge, we need to establish shared definitions of tasks and corresponding dimensions of evaluation. Similar to how efforts such as Holistic Evaluation of Language Models (HELM) define the dimensions of evaluation of an LLM that matter in general, a framework specific for healthcare is necessary to define the core dimensions of evaluation to be assessed across studies. Doing so enables better comparisons and cumulative learning from which reliable conclusions can be drawn for future technical work and policy guidance.

Prioritize immediately impactful, administrative applications

Current research predominantly focuses on medical knowledge tasks, such as answering medical exam questions (44.5%), or complex healthcare tasks, as well as making diagnoses (19.5%) and making treatment recommendations (9.2%). However, there are many administrative tasks in healthcare that are often labor-intensive, requiring manual input and contributing to physician burnout 60. Particularly, areas such as assigning provider billing codes (1 study), writing prescriptions (1 study), generating clinical referrals (3 studies), and clinical note-taking (4 studies); all of which remain under-researched and could greatly benefit from a systematic evaluation of using LLMs for those tasks 61 62 63 64.

The need to bridge gaps in LLM utilization across clinical specialties

The substantial representation of generic healthcare applications, accounting for over a fifth of the studies, underscores the potential of LLMs in addressing needs applicable to many specialties, such as summarizing medical reports. In contrast, the scarcity of research in particular specialties like nuclear medicine (3 studies), physical medicine (2 studies), and medical genetics (1 study) suggests an untapped potential for using LLMs in these complex medical domains that often present intricate diagnostic challenges and demand personalized treatment approaches65 66 67 68. The lack of LLM-focused studies in these areas may indicate the need for increased awareness, collaboration, or specialized adaptation of such models to suit the unique demands of these specialties.

The need for a realistic accounting of financial impact

Generative AI is projected to create $200 billion to $360 billion in healthcare cost savings through productivity improvements 69. However, the implementation of these tools could pose a significant financial burden to health systems. In a recent review by Sahni and Carrus, defining the cost and benefit of deploying AI was highlighted as one of the greatest challenges 70. It is key for health systems to capture this, to accurately estimate and budget for increased implementation and computing costs 71.

Within this review, only one study conducted a financial impact or cost-effectiveness analysis. Rau et al. investigated the use of ChatGPT to develop personalized imaging, demonstrating "an average decision time of 5 minutes and a cost of €0.19 for all cases, compared to 50 minutes and €29.99 for radiologists" 72. However, this analysis was a parallel implementation of the LLM solution compared with the traditional radiologist approach, thus not providing a realistic assessment of the added value of LLM integration into existing clinical workflows and its corresponding financial impact.

While the dearth of real-world testing is understandable given the infancy of LLM applications in healthcare, it is imperative to establish realistic assessments of these tools before reallocating resources from other healthcare initiatives. Notably, such assessments should estimate the total cost of implementation, which includes not only the cost to run the model but also expenses associated with monitoring, maintenance, and any necessary infrastructure adjustments.

The need to better define and quantify bias

Recent studies have highlighted a concerning trend of LLMs perpetuating race-based medicine in their responses 73. This phenomenon can be attributed to the tendency of LLMs to reproduce information from their training data, which may contain human biases 74.To improve our methods for evaluating and quantifying bias, we need to first collectively establish what it means to be unbiased.

While efforts to assess racial and ethical biases exist, only 15.8% of studies have conducted any evaluation that delves into how factors such as race, gender, or age impact bias in the model’s output 75 76 77. Future research should place greater emphasis on such evaluations, particularly as policymakers develop best practices and guidance for model assurance. Mandating these evaluations as part of a “model report card” could be a proactive step towards mitigating harmful biases perpetuated by LLMs 78.

The need to publicly report failure modes

The analysis of failure modes has long been regarded as fundamental in engineering and quality management, facilitating the identification, examination, and subsequent mitigation of failures79. The FDA has databases for adverse event reporting in pharmaceuticals and medical devices, but there is currently no analogous place for reporting failure modes for AI systems, let alone LLMs, in healthcare 80 81.

In the ‘Conclusion’ sections of many studies, only a select few researched why the deployment of the LLM did not produce satisfactory results (e.g. ineffective prompt engineering) 82. A deeper examination of failure modes and why the exercise was deemed unsuccessful or inaccurate (e.g. the reference data was factually incorrect or outdated), is necessary to further improve the use of LLMs in healthcare settings.

6 Conclusion

The evaluation of LLMs lacks standardized task definitions and dimensions of evaluation. This systematic review underscores the need for evaluating LLMs using real patient care data, particularly on administrative healthcare tasks like generating provider billing codes, writing prescriptions, and clinical note-taking. It highlights the need to expand testing criteria beyond accuracy to include fairness, bias, toxicity, robustness, and deployment considerations across different medical specialties. Establishing shared task definitions and rigorous testing and evaluation standards are crucial for the safe integration of LLMs in healthcare. Realistic financial accounting and robust reporting of failures are essential to accurately assess their value and safety in clinical settings. Broadly, there is an urgent need to develop a nationwide consensus and guidance for evaluating LLMs in healthcare, so that we may realize the tremendous promise these groundbreaking technologies have to offer.

Data Availability

All data produced in the present study are available upon reasonable request to the authors

Author contributions

SB, YL, LOE, NHS conceived of the study, defined the main outcomes and measures. SB, LOE, YL searched the literature to identify the publications to review and categorized the publications. SB, LOE, YL and NHS drafted the manuscript. SB designed the GPT-4 based screening strategy with input from DD. SB developed the NLP/NLU task and dimensions of evaluation framework. LOE developed the healthcare task framework. DD guided the creation and categorization of healthcare tasks. AC and AS guided the creation of NLP/NLU task categorization. JAF guided the creation of the dimensions of evaluation categorization, SK helped select HELM dimensions to reuse. MK refined the medical specialty categorization, and MW critiqued the review methodology and figure organization. LL and HH assessed the usefulness of the frameworks for other analyses. NRS guided LOE and YL on all aspects of performing systematic reviews. AM reviewed and edited the manuscript for framing the discussion. KS and TT assessed the relevance of the results for developing consensus LLM testing and evaluation guidance for CHAI. MAP critiqued the deployment concerns in health systems and reviewed the categories. All authors reviewed, edited and approved of the final manuscript.

Supplement 1. Search terms for PubMed as of 02/19/2024

(("Large Language Model" [Title/Abstract] OR "ChatGPT" [Title/Abstract] OR "Generative AI" [Title/Abstract]) AND ("Health" [Title/Abstract] OR "Medical" [Title/Abstract] OR "Clinical" [Title/Abstract] OR "Medicine" [Title/Abstract]) AND ("Test" [Title/Abstract] OR "Evaluate" [Title/Abstract] OR "Performance" [Title/Abstract] OR "Assess" [Title/Abstract]))

Search terms for Web of science as of 02/19/2024

(TS=("Large Language Model" OR "ChatGPT" OR "Generative AI") AND

TS=("Health" OR "Medical" OR "Clinical" OR "Medicine") AND

TS=("Test" OR "Evaluate" OR "Performance" OR "Assess"))

Supplement 2. Prompts used to extract and assign categories for human review

Prompt 1

"You are assisting in a systematic review of large language models in healthcare. Summarize the {entity_type} mentioned in this research abstract in 25 words"

Prompt 2

“Using the generated summaries, identify and categorize the following text based on {entity_type}:\n\n{text}\n\nCategories:”

Where entity_type can be NLP task, medical specialty or metric and categories is the list of possible values for each entity_type to make categorization into, for the NLP task, metric and medical specialty.

eFigure 1
  • Download figure
  • Open in new tab
eFigure 1 Examples of metrics for each dimension of evaluation

The first row represents the names of the dimensions of evaluation in our designed framework. Under each dimension there are metrics. The bold italicized cells represent metric subclasses for each dimension and regular font cells under each subclass represent the metrics.

View this table:
  • View inline
  • View popup
  • Download powerpoint
eTable 2 Frequency of publications by medical specialty

This table shows the different medical specialties of the 519 studies, along with three additional hcategories: Generic, Dentistry, and Medical Genetics

Acknowledgements

We thank Nicholas Chedid for extensive guidance in the development of the healthcare task categorization.

References

  1. ↵
    Stafie CS, Sufaru IG, Ghiciuc CM et al. Exploring the Intersection of Artificial Intelligence and Clinical Healthcare: A Multidisciplinary Review. Diagnostics. 2023;13(12):1995. doi:10.3390/diagnostics13121995
    OpenUrlCrossRef
  2. ↵
    Kohane IS. Injecting Artificial Intelligence into Medicine. NEJM AI. 2024;1(1). doi:10.1056/aie2300197
    OpenUrlCrossRef
  3. ↵
    Goldberg CB, Adams L, Blumenthal D et al. To Do No Harm — and the Most Good — with AI in Health Care. NEJM AI. 2024;1(3). doi:10.1056/aip2400036
    OpenUrlCrossRef
  4. ↵
    Wachter RM, Brynjolfsson E. Will Generative Artificial Intelligence Deliver on Its Promise in Health Care? JAMA. 2024;331(1):65–69. doi:10.1001/jama.2023.25054
    OpenUrlCrossRef
  5. ↵
    Liu Y, Zhang K, Li Y, et al., Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models. arXiv preprint arXiv:2402.17177. 2024 Feb 27
  6. ↵
    Karabacak M, Margetis K. Embracing Large Language Models for Medical Applications: Opportunities and Challenges. Cureus. 2023 May 21;15(5):e39305. doi: 10.7759/cureus.39305. PMID: 37378099; PMCID: PMC10292051
    OpenUrlCrossRefPubMed
  7. ↵
    Landi H. Abridge clinches $150M to build out generative AI for medical documentation. Fierce Healthcare. Published February 23rd 2024. https://www.fiercehealthcare.com/ai-and-machine-learning/abridge-clinches-150m-build-out-generative-ai-medical-documentation
  8. ↵
    Webster P. Six ways large language models are changing healthcare. Nat Med. 2023;29(12):2969–2971. doi:10.1038/s41591-023-02700-1
    OpenUrlCrossRef
  9. ↵
    Bhasker S, Bruce D, Lamb J et al. Tackling healthcare’s biggest burdens with generative AI. McKinsey. www.mckinsey.com. Published July 10, 2023. https://www.mckinsey.com/industries/healthcare/our-insights/tackling-healthcares-biggest-burdens-with-generative-ai
  10. ↵
    Sahni NR, Stein G, Zemmel R, Cutler D. The Potential Impact of Artificial Intelligence on Health Care Spending. National Bureau of Economic Research. Published January 1, 2023. Accessed March 26, 2024.
  11. ↵
    Shah NH, Entwistle D, Pfeffer MA. Creation and Adoption of Large Language Models in Medicine. JAMA. 2023;330(9):866–869. doi:10.1001/jama.2023.14217
    OpenUrlCrossRef
  12. ↵
    Wornow M, Xu Y, Thapa R, et al. The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med. 2023;6(1):135. Published 2023 Jul 29. doi:10.1038/s41746-023-00879-8
    OpenUrlCrossRef
  13. ↵
    Cadamuro J, Cabitza F, Debeljak Z, et al. Potentials and pitfalls of ChatGPT and natural-language artificial intelligence models for the understanding of laboratory medicine test results. An assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin Chem Lab Med. 2023;61(7):1158–1166. Published 2023 Apr 24. doi:10.1515/cclm-2023-0355
    OpenUrlCrossRef
  14. ↵
    Pagano S, Holzapfel S, Kappenschneider T, et al. Arthrosis diagnosis and treatment recommendations in clinical practice: an exploratory investigation with the generative AI model GPT-4. J Orthop Traumatol. 2023;24(1):61. Published 2023 Nov 28. doi:10.1186/s10195-023-00740-4
    OpenUrlCrossRef
  15. ↵
    Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an Updated Guideline for Reporting Systematic Reviews. British Medical Journal. 2021;372(71). doi:10.1136/bmj.n71
    OpenUrlFREE Full Text
  16. ↵
    USMLE Physician Tasks/Competencies. 2020. https://www.usmle.org/sites/default/files/2021-08/USMLE_Physician_Tasks_Competencies.pdf
  17. ↵
    Norden J, Wang J, Bhattacharyya A. Where Generative AI Meets Healthcare: Updating The Healthcare AI Landscape. AI Checkup. Published June 22, 2023. https://aicheckup.substack.com/p/where-generative-ai-meets-healthcare
  18. ↵
    Liang P, Bommasani R, Lee T et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research. Published online February 1, 2023. Accessed February 2024. https://openreview.net/forum?id=iO4LZibEqW
  19. ↵
    Tasks - Hugging Face. huggingface.co. https://huggingface.co/tasks
  20. ↵
    Residency & Fellowship Programs. Graduate Medical Education. https://med.stanford.edu/gme/programs.html
  21. Ali R, Tang OY, Connolly ID, et al. Performance of CHATGPT and GPT-4 on Neurosurgery Written Board Examinations. Published online March 29, 2023. doi:10.1101/2023.03.25.23287743
    OpenUrlAbstract/FREE Full Text
  22. Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R. Comparison of diagnostic and triage accuracy of Ada Health and WebMD Symptom Checkers, CHATGPT, and physicians for patients in an emergency department: Clinical Data Analysis Study. JMIR mHealth and uHealth. 2023;11. doi:10.2196/49995
    OpenUrlCrossRef
  23. Babayiğit O, Tastan Eroglu Z, Ozkan Sen D, Ucan Yarkac F. Potential use of CHATGPT for patient information in Periodontology: A descriptive pilot study. Cureus. Published online November 8, 2023. doi:10.7759/cureus.48518
    OpenUrlCrossRef
  24. Wilhelm TI, Roos J, Kaczmarczyk R. Large language models for therapy recommendations across 3 clinical specialties: Comparative study. Journal of Medical Internet Research. 2023;25. doi:10.2196/49324
    OpenUrlCrossRef
  25. Srivastava R, Srivastava S. Can Artificial Intelligence Aid Communication? considering the possibilities of GPT-3 in palliative care. Indian Journal of Palliative Care. 2023;29:418–425. doi:10.25259/ijpc_155_2023
    OpenUrlCrossRef
  26. Dağcı M, Çam F, Dost A. Reliability and quality of the nursing care planning texts generated by CHATGPT. Nurse Educator. Published online November 22, 2023. doi:10.1097/nne.0000000000001566
    OpenUrlCrossRef
  27. Huh S. Are chatgpt’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: A descriptive study. Journal of Educational Evaluation for Health Professions. 2023;20:1. doi:10.3352/jeehp.2023.20.1
    OpenUrlCrossRefPubMed
  28. Suppadungsuk S, Thongprayoon C, Krisanapan P, et al. Examining the validity of chatgpt in identifying relevant nephrology literature: Findings and implications. Journal of Clinical Medicine. 2023;12(17):5550. doi:10.3390/jcm12175550
    OpenUrlCrossRef
  29. Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Evaluating chatgpt as an adjunct for radiologic decision-making. medRxiv. Published online February 7, 2023. doi:10.1101/2023.02.02.23285399
    OpenUrlAbstract/FREE Full Text
  30. Barash Y, Klang E, Konen E, Sorin V. CHATGPT-4 assistance in Optimizing Emergency Department radiology referrals and Imaging Selection. Journal of the American College of Radiology. 2023;20(10):998–1003. doi:10.1016/j.jacr.2023.06.009
    OpenUrlCrossRef
  31. Chung EM, Zhang SC, Nguyen AT, Atkins KM, Sandler HM, Kamrava M. Feasibility and acceptability of CHATGPT generated radiology report summaries for cancer patients. DIGITAL HEALTH. 2023;9. doi:10.1177/20552076231221620
    OpenUrlCrossRef
  32. Groza T, Caufield H, Gration D, et al. An evaluation of GPT models for phenotype concept recognition. BMC Medical Informatics and Decision Making. 2024;24(1). doi:10.1186/s12911-024-02439-w
    OpenUrlCrossRef
  33. Razdan S, Siegal AR, Brewer Y, Sljivich M, Valenzuela RJ. Assessing chatgpt’s ability to answer questions pertaining to erectile dysfunction: Can our patients trust it? International Journal of Impotence Research. Published online November 20, 2023. doi:10.1038/s41443-023-00797-z
    OpenUrlCrossRef
  34. Kassab J, Hadi El Hajjar A, Wardrop RM, Brateanu A. Accuracy of online artificial intelligence models in Primary Care Settings. American Journal of Preventive Medicine. Published online February 2024. doi:10.1016/j.amepre.2024.02.006
    OpenUrlCrossRef
  35. Lim B, Seth I, Dooreemeah D, Lee CH. Delving into new frontiers: Assessing chatgpt’s proficiency in revealing uncharted dimensions of general surgery and pinpointing innovations for future advancements. Langenbeck’s Archives of Surgery. 2023;408(1). doi:10.1007/s00423-023-03173-z
    OpenUrlCrossRef
  36. Lossio-Ventura JA, Weger R, Lee AY, et al. A comparison of CHATGPT and fine-tuned open pre-trained transformers (OPT) against widely used sentiment analysis tools: Sentiment analysis of COVID-19 survey data. JMIR Mental Health. 2024;11. doi:10.2196/50150
    OpenUrlCrossRef
  37. Chen Q, Sun H, Liu H, et al. An extensive benchmark study on biomedical text generation and mining with chatgpt. Bioinformatics. 2023;39(9). doi:10.1093/bioinformatics/btad557
    OpenUrlCrossRef
  38. Wang H, Gao C, Dantona C, Hull B, Sun J. DRG-Llama : Tuning llama model to predict diagnosis-related group for hospitalized patients. npj Digital Medicine. 2024;7(1). doi:10.1038/s41746-023-00989-3
    OpenUrlCrossRef
  39. Aiumtrakul N, Thongprayoon C, Arayangkool C, et al. Personalized medicine in urolithiasis: AI chatbot-assisted dietary management of oxalate for Kidney Stone Prevention. Journal of Personalized Medicine. 2024;14(1):107. doi:10.3390/jpm14010107
    OpenUrlCrossRef
  40. ↵
    Gan RK, Ogbodo JC, Wee YZ, Gan AZ, González PA. Performance of Google bard and ChatGPT in mass casualty incidents triage. Am J Emerg Med. 2024;75:72–78. doi:10.1016/j.ajem.2023.10.034
    OpenUrlCrossRef
  41. ↵
    USMLE Physician Tasks/Competencies. 2020. https://www.usmle.org/sites/default/files/2021-08/USMLE_Physician_Tasks_Competencies.pdf
  42. ↵
    Norden J, Wang J, Bhattacharyya A. Where Generative AI Meets Healthcare: Updating The Healthcare AI Landscape. AI Checkup. Published June 22, 2023. https://aicheckup.substack.com/p/where-generative-ai-meets-healthcare
  43. ↵
    Liang P, Bommasani R, Lee T et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research. Published online February 1, 2023. Accessed February 2024. https://openreview.net/forum?id=iO4LZibEqW
  44. ↵
    Tasks - Hugging Face. huggingface.co. https://huggingface.co/tasks
  45. ↵
    Heuer AJ. More Evidence That the Healthcare Administrative Burden Is Real, Widespread and Has Serious Consequences; Comment on "Perceived Burden Due to Registrations for Quality Monitoring and Improvement in Hospitals: A Mixed Methods Study". Int J Health Policy Manag. 2022;11(4):536–538. doi:10.34172/ijhpm.2021.129
    OpenUrlCrossRef
  46. ↵
    Residency & Fellowship Programs. Graduate Medical Education. https://med.stanford.edu/gme/programs.html
  47. ↵
    Coalition for Health AI. Blueprint for Trustworthy AI Implementation Guidance and Assurance for Healthcare. Published April 4 th 2023. https://coalitionforhealthai.org/papers/blueprint-for-trustworthy-ai_V1.0.pdf
  48. ↵
    Savage T, Wang J, Shieh L. A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation. JMIR Med Inform. 2023 Nov 27;11:e49886. doi: 10.2196/49886.
    OpenUrlCrossRef
  49. ↵
    Pagano S, Holzapfel S, Kappenschneider T, et al. Arthrosis diagnosis and treatment recommendations in clinical practice: an exploratory investigation with the generative AI model GPT-4. J Orthop Traumatol. 2023;24(1):61. Published 2023 Nov 28. doi:10.1186/s10195-023-00740-4
    OpenUrlCrossRef
  50. ↵
    Surapaneni KM. Assessing the Performance of ChatGPT in Medical Biochemistry Using Clinical Case Vignettes: Observational Study. JMIR Med Educ. 2023 Nov 7;9:e47191. doi: 10.2196/47191.
    OpenUrlCrossRef
  51. ↵
    Shah NH, Entwistle D, Pfeffer MA. Creation and Adoption of Large Language Models in Medicine. JAMA. 2023;330(9):866–869. doi:10.1001/jama.2023.14217
    OpenUrlCrossRef
  52. ↵
    Pagano S, Holzapfel S, Kappenschneider T et al. Arthrosis diagnosis and treatment recommendations in clinical practice: an exploratory investigation with the generative AI model GPT-4. J Orthop Traumatol. 2023;24(1):61. doi:10.1186/s10195-023-00740-4
    OpenUrlCrossRef
  53. ↵
    Choi HS, Song JY, Shin KH, Chang JH et al. Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat Oncol J. 2023;41(3):209–216. doi:10.3857/roj.2023.00633
    OpenUrlCrossRef
  54. ↵
    Fleming SL, Lozano A, Haberkorn WJ et al. MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records. Dec 2023. arXiv:2308.14089; doi:10.48550/arXiv.2308.14089.
    OpenUrlCrossRef
  55. ↵
    Karabacak M, Margetis K. Embracing Large Language Models for Medical Applications: Opportunities and Challenges. Cureus. 2023;15(5):e39305. Published online May 21 2023. doi:10.7759/cureus.39305
    OpenUrlCrossRefPubMed
  56. ↵
    Garcia P, Ma SP, Shah S et al. Artificial Intelligence–Generated Draft Replies to Patient Inbox Messages. JAMA Netw Open. 2024;7(3):e243201. doi:10.1001/jamanetworkopen.2024.3201
    OpenUrlCrossRef
  57. ↵
    Office of the National Coordinator for Health Information Technology. Health Data, Technology, and Interoperability: Certification Program Updates, Algorithm Transparency, and Information Sharing. Federal Register. January 9, 2024;89(6):[page numbers]. Available from: Federal Register.
  58. ↵
    Ali R, Tang OY, Connolly ID et al. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery. 2023;93(6):1353–1365. doi:10.1227/neu.0000000000002632
    OpenUrlCrossRef
  59. ↵
    Gilson A, Safranek CW, Huang T et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment [published correction appears in JMIR Med Educ. 2024 Feb 27;10:e57594]. JMIR Med Educ. 2023;9:e45312. Published Feb 8 2023. doi:10.2196/45312
    OpenUrlCrossRef
  60. ↵
    Heuer AJ. More Evidence That the Healthcare Administrative Burden Is Real, Widespread and Has Serious Consequences; Comment on "Perceived Burden Due to Registrations for Quality Monitoring and Improvement in Hospitals: A Mixed Methods Study". Int J Health Policy Manag. 2022;11(4):536–538. doi:10.34172/ijhpm.2021.129
    OpenUrlCrossRef
  61. ↵
    Wang H, Gao C, Dantona C et al. DRG-LLaMA : tuning LLaMA model to predict diagnosis-related group for hospitalized patients. NPJ Digit Med. 2024;7(1):1–9. doi:10.1038/s41746-023-00989-3
    OpenUrlCrossRef
  62. ↵
    Aiumtrakul N, Thongprayoon C, Arayangkool C et al. Personalized Medicine in Urolithiasis: AI Chatbot-Assisted Dietary Management of Oxalate for Kidney Stone Prevention. J Pers Med. 2024;14(1):107. doi:10.3390/jpm14010107
    OpenUrlCrossRef
  63. ↵
    Heston TF. Safety of Large Language Models in Addressing Depression. Cureus. 2023;15(12):e50729. doi:10.7759/cureus.50729
    OpenUrlCrossRef
  64. ↵
    Pushpanathan K, Lim ZW, Er Yew SM et al. Language Model Chatbots’ Accuracy, Comprehensiveness, and Self-Awareness in Answering Ocular Symptom Queries. iScience. 2023;26(11):108163. doi:10.1016/j.isci.2023.108163
    OpenUrlCrossRef
  65. ↵
    Currie G, Barry K. ChatGPT in Nuclear Medicine Education. July 2023. J Nucl Med Technol. 2023 Sep;51(3):247–254. doi:10.2967/jnmt.123.265844
    OpenUrlAbstract/FREE Full Text
  66. ↵
    Zhang L, Tashiro S, Mukaino M et al. Use of artificial intelligence large language models as a clinical tool in rehabilitation medicine: a comparative test case. September 2023. J Rehabil Med. 2023;55:jrm13373–jrm13373. doi:10.2340/jrm.v55.13373
    OpenUrlCrossRef
  67. ↵
    Walton N, Gracefo S, Sutherland N et al. Evaluating ChatGPT as an Agent for Providing Genetic Education. bioRxiv (Cold Spring Harbor Laboratory). Published online October 29, 2023. doi:10.1101/2023.10.25.564074
    OpenUrlAbstract/FREE Full Text
  68. ↵
    Chin HL, Goh DLM. Pitfalls in clinical genetics. Singapore Med J. 2023;64(1):53–58. doi:10.4103/singaporemedj.smj-2021-329
    OpenUrlCrossRef
  69. ↵
    Sahni NR, Stein G, Zemmel R, Cutler D. The Potential Impact of Artificial Intelligence on Health Care Spending. National Bureau of Economic Research. Published January 1, 2023. Accessed March 26, 2024.
  70. ↵
    Sahni NR, Carrus B. Artificial Intelligence in U.S. Health Care Delivery. July 2023. The New England Journal of Medicine. 2023;389(4):348–358. doi:10.1056/nejmra2204673
    OpenUrlCrossRef
  71. ↵
    Jindal JA, Lungren MP, Shah NH. Ensuring useful adoption of generative artificial intelligence in healthcare. J Am Med Inform Assoc. Published online March 7, 2024. doi:10.1093/jamia/ocae043
    OpenUrlCrossRef
  72. ↵
    Rau A, Rau S, Zoeller D et al. A Context-based Chatbot Surpasses Trained Radiologists and Generic ChatGPT in Following the ACR Appropriateness Guidelines. Radiology. July 2023;308(1). doi:10.1148/radiol.230970
    OpenUrlCrossRefPubMed
  73. ↵
    Omiye JA, Lester JC, Spichak S et al. Large language models propagate race-based medicine. NPJ Digit Med. October 2023; 6(1):195. doi:10.1038/s41746-023-00939-z
    OpenUrlCrossRef
  74. ↵
    Acerbi A, Stubbersfield JM. Large language models show human-like content biases in transmission chain experiments. Proc Natl Acad Sci U S A. 2023;120(44):e2313790120. doi:10.1073/pnas.2313790120
    OpenUrlCrossRef
  75. ↵
    Guleria A, Krishan K, Sharma V et al. ChatGPT: ethical concerns and challenges in academics and research. September 2023. J Infect Dev Ctries. 2023;17:1292–1299. doi:10.3855/jidc.18738
    OpenUrlCrossRef
  76. ↵
    Hanna JJ, Wakene AD, Lehmann CU et al. Assessing Racial and Ethnic Bias in Text Generation for Healthcare-Related Tasks by ChatGPT. medRxiv (Cold Spring Harbor Laboratory). Published online August 28, 2023. doi:10.1101/2023.08.28.23294730
    OpenUrlAbstract/FREE Full Text
  77. ↵
    Levkovich I, Elyoseph Z. Suicide Risk Assessments Through the Eyes of ChatGPT-3.5 Versus ChatGPT-4: Vignette Study. JMIR Mental Health. September 2023;10(1):e51232. doi:10.2196/51232
    OpenUrlCrossRef
  78. ↵
    Heming C, Abdalla M, Ahluwalia M et al. Benchmarking Bias: Expanding Clinical AI Model Card to Incorporate Bias Reporting of Social and Non-Social Factors. Accessed March 2024. https://arxiv.org/pdf/2311.12560.pdf
  79. ↵
    Thomas, D. Revolutionizing Failure Modes and Effects Analysis with ChatGPT: Unleashing the Power of AI Language Models. J Fail. Anal. and Preven. May 2023;23(3):911–913. doi:10.1007/s11668-023-01659-y
    OpenUrlCrossRef
  80. ↵
    Research C for DE and. FDA Adverse Event Reporting System (FAERS) Public Dashboard. FDA. Published online October 29, 2020. https://www.fda.gov/drugs/questions-and-answers-fdas-adverse-event-reporting-system-faers/fda-adverse-event-reporting-system-faers-public-dashboard
  81. ↵
    MAUDE - Manufacturer and User Facility Device Experience. Fda.gov. Published 2012. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm
  82. ↵
    Galido PV, Butala S, Chakerian M et al. A Case Study Demonstrating Applications of ChatGPT in the Clinical Management of Treatment-Resistant Schizophrenia. Cureus. Published online April 26, 2023; 15(4): e38166. doi:10.7759/cureus.38166
    OpenUrlCrossRef
Back to top
PreviousNext
Posted April 16, 2024.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)
Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, Nigam H. Shah
medRxiv 2024.04.15.24305869; doi: https://doi.org/10.1101/2024.04.15.24305869
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)
Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, Nigam H. Shah
medRxiv 2024.04.15.24305869; doi: https://doi.org/10.1101/2024.04.15.24305869

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (430)
  • Allergy and Immunology (754)
  • Anesthesia (221)
  • Cardiovascular Medicine (3287)
  • Dentistry and Oral Medicine (363)
  • Dermatology (277)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1169)
  • Epidemiology (13354)
  • Forensic Medicine (19)
  • Gastroenterology (898)
  • Genetic and Genomic Medicine (5144)
  • Geriatric Medicine (481)
  • Health Economics (782)
  • Health Informatics (3263)
  • Health Policy (1140)
  • Health Systems and Quality Improvement (1189)
  • Hematology (429)
  • HIV/AIDS (1017)
  • Infectious Diseases (except HIV/AIDS) (14619)
  • Intensive Care and Critical Care Medicine (912)
  • Medical Education (476)
  • Medical Ethics (126)
  • Nephrology (522)
  • Neurology (4916)
  • Nursing (262)
  • Nutrition (725)
  • Obstetrics and Gynecology (882)
  • Occupational and Environmental Health (795)
  • Oncology (2518)
  • Ophthalmology (723)
  • Orthopedics (280)
  • Otolaryngology (347)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (542)
  • Pediatrics (1299)
  • Pharmacology and Therapeutics (549)
  • Primary Care Research (556)
  • Psychiatry and Clinical Psychology (4202)
  • Public and Global Health (7492)
  • Radiology and Imaging (1704)
  • Rehabilitation Medicine and Physical Therapy (1010)
  • Respiratory Medicine (980)
  • Rheumatology (479)
  • Sexual and Reproductive Health (497)
  • Sports Medicine (424)
  • Surgery (547)
  • Toxicology (72)
  • Transplantation (235)
  • Urology (205)