Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care using a data-driven convergent synthesis approach. We searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4,349 initial records, 89 studies across 29 medical specialties were included, primarily examining models based on the GPT-3.5 (53.2%, n=66 of 124 different LLMs examined per study) and GPT-4 (26.6%, n=33/124) architectures in medical question answering, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations included 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations included 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. In conclusion, this study is the first review to systematically map LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.


Introduction
2][3] One of the main reasons for their popularity is the remarkable ability to mimic human writing, a result of extensive training on massive amounts of text and reinforcement learning from human feedback. 4nce most LLMs are designed as general-purpose chatbots, recent research has focused on developing specialized models for the medical domain, such as Meditron or BioMistral, by enriching the training data of LLMs with medical knowledge. 5,6However, this approach to fine-tuning LLMs requires significant computational resources that are not available to everyone and is also not applicable to closed-source LLMs, which are often the most powerful.Therefore, another approach to improve LLMs for biomedicine is to use techniques such as Retrieval-Augmented Generation (RAG). 7RAG allows information to be dynamically retrieved from medical databases during the model generation process, enriching the output with medical knowledge without the need to train the model.review were resolved through discussion with the author FB.In cases of studies with incomplete data, we have tried to contact the corresponding authors for clarification or additional information.

Data analysis
Due to the diversity of investigated outcomes and study designs we sought to include, including qualitative, quantitative, and mixed methods, a meta-analysis was not practical.Instead, a data-driven convergent synthesis approach was selected for thematic syntheses of LLM applications and limitations in patient care. 19Following Thomas and Harden, FB coded each study's numerical and textual data in Dedoose using free line-by-line coding. 20,21Initial codes were then systematically categorized into descriptive and subsequently into analytic themes, incorporating new codes for emerging concepts within a hierarchical tree structure.Upon completion of the codebook, FB and LH reviewed each study to ensure consistent application of codes.Discrepancies were resolved through discussion with the author KKB, and the final codebook and analytical themes were discussed and refined in consultation with all contributing authors.

Screening results
Of the 4,349 reports identified, 2,991 underwent initial screening, and 126 were deemed suitable for potential inclusion and underwent full-text screening.Two articles could not be retrieved because the authors or the corresponding title and abstract could not be identified online.Following full-text screening, 35 articles were excluded, and 89 articles were included in the final review.Most studies were excluded because they targeted the wrong discipline (n=10/35, 28.6%) or population (n=7/35, 20%) or were not original research (n=8/35, 22.9%) (see Supplementary Section 2).For example, we evaluated a study that focused on classifying physician notes to identify patients without active bleeding who were appropriate candidates for thromboembolism prophylaxis. 22Although the classification tasks may lead to patient treatment, the primary outcome was informing clinicians rather than directly forwarding this information to patients.We also reviewed a study assessing the accuracy and completeness of several LLMs when answering Methotrexate-related questions. 23is study was excluded because it focused solely on the pharmacological treatment of rheumatic disease.For a detailed breakdown of the inclusion and exclusion process at each stage, please refer to the PRISMA flowchart in Figure 1.

Characteristics of included studies
Table 1 summarizes the characteristics of the analyzed studies, including their setting, results, and conclusions.

Limitations of Large Language Models
The thematic synthesis of limitations resulted in two main concepts: one related to design limitations and one related to output.

Discussion
In this systematic review, we synthesized the current applications and limitations of LLMs in patient care, incorporating a broad analysis across 29 medical specialties and highlighting key limitations in LLM design and output, providing a comprehensive framework and taxonomy for their future implementation and evaluation in healthcare settings.
Most articles examined the use of LLMs based on the GPT-3.5 or GPT-4 architecture for answering medical questions, followed by the generation of patient information, including medical text summarization or translation and clinical documentation.The conceptual synthesis of LLM limitations revealed two key concepts: the first related to design, including 6 second-order and 12 third-order codes, and the second related to output, including 9 second-order and 32 third-order codes.
Although many LLMs have been developed specifically for the biomedical domain in recent years, we found that ChatGPT has been a disruptor in the medical literature on LLMs, with GPT-3.5 and GPT-4 accounting for almost 80% of the LLMs examined in this systematic review.While it was not possible to conduct a metaanalysis of the performance on medical tasks, many authors provided a positive outlook towards the integration of LLMs into clinical practice.However, the use of proprietary models such as ChatGPT in the biomedical field raises concerns because the limited access to the underlying algorithms, training data, and data processing and storage mechanisms makes them untransparent and, thus, significantly limits their applicability in healthcare. 114rthermore, the integration of proprietary models into patient care applications makes one susceptible to performance changes associated with model updates, which may break existing functionalities and lead to .harmful outcomes for patients.Therefore, especially in the biomedical field, open-source models such as BioMistral may offer a viable solution. 6Given the limited number of articles on open-source LLMs in our review, we strongly encourage future studies investigating the applicability of open-source LLMs in patient care.
We identified several key limitations regarding the design and output.Not surprisingly, many reports noted the limitation that the LLMs studied were not optimized for the medical domain.One possible solution to this limitation may be to provide medical knowledge during inference using RAG. 1157][118][119] Although outperformed on specific tasks by specialized medical LLMs, such as Google's MedPaLM-2, this suggests that general-purpose LLMs can comprehend complex medical literature and case scenarios to a degree that meets professional standards. 120Furthermore, given the large amounts of data on which proprietary models such as ChatGPT are trained, it is not unlikely that they have been exposed to more medical data overall than smaller specialized models despite being generalist models.
It should also be noted that passing these exams does not equate to the practical competence required of a healthcare provider. 121In addition, reliance on exam-based assessments carries a significant risk of bias.For example, if the exam questions or similar variants are publicly available and, thus, may be present in the training data, the LLM does not demonstrate any knowledge outside of training data memorization. 122In fact, these types of tests can be misleading in estimating the model's true abilities in terms of comprehension or analytical skills.
Many studies have reported limitations in the output related to comprehensiveness, safety, correctness, reproducibility, and dependence of the output on the input/prompt and environment.Specifically, for correctness, we followed the taxonomy of Currie et al. to classify incorrect outputs more precisely into illusions, delusions, delirium, confabulation, and extrapolation, thus proposing a framework for a more precise and structured error classification to improve the characterization of incorrect outputs and enabling more detailed performance comparisons with other research. 43,112,113On the other hand, a minority of studies have identified biases, for example, reflecting the unequal representation of certain content or the biases inherent in humangenerated text in the training data. 123This may indicate that the implemented safeguards are effective.However, not much is known about the technology and developer policies of proprietary LLMs, and previous work has shown that automated jailbreak generation is possible across various commercial LLM chatbots. 124This also mirrors our concept of data-related limitations, particularly regarding the handling of sensitive health information.Together with the limited transparency about the origin of the training data and the unexplainable and non-deterministic nature of the output, this raises a key question when applying LLMs to the medical .domain: how can we entrust our patients to LLMs if they are neither reliable nor transparent?Given that models like ChatGPT are already publicly accessible and widely used, patients may already refer to them for medical questions in much the same way they use Google Search, making concerns about their early adoption somewhat academic. 125 addition, low health literacy due to the identified limitations in comprehensiveness, including the generation of content with high complexity and an inappropriate reading level, which was above the 6th-grade level recommended by the American Medical Association (AMA) in almost all studies analyzed, may further limit their utility for patient information. 126Overall, this can lead to results that are misleading and harmful, as described in many of the reports in our review.In addition to advances in the development of LLMs and the focus on open source, it will therefore be necessary to develop and implement a well-validated scale to determine the quality and safety of LLM outputs in medical practice, such as the recent effort made to adopt the widely recognized Physician Documentation Quality Instrument (PDQI-9) for the assessment of AI transcripts and clinical summaries. 127nally, the implementation of regulatory mandates like the forthcoming European Union AI Act and the associated challenges faced by generative AI and LLMs, for example, in terms of training data transparency and validation of non-deterministic output, will show which approaches the companies will take to bring these models into compliance with the law.How the notified bodies interpret and enforce the law in practice will likely be decisive for the further development of LLMs in the biomedical sector. 128

Limitations
Our study has limitations.First, our review focused on LLM applications and limitations in patient care, thus excluding research directed at clinicians only.Future studies may extend our synthesis approach to LLM applications that explicitly focus on healthcare professionals.Second, there is a risk that potentially eligible studies were not included in our analysis if they were not present in the 5 databases reviewed or were not available in English.However, we screened nearly 3,000 articles in total and systematically analyzed 89 articles, providing a comprehensive overview of the current state of LLMs in patient care, even if some articles could have been missed.Third, the rapid development and advancement of LLMs make it difficult to keep this systematic review up to date.For example, Gemini 1.5 Pro was published in February 2024, and corresponding articles are not included in this review, which synthesized articles from 2022 to 2023.Continued updates will be essential to monitor emerging areas and limitations in this rapidly evolving field.
In conclusion, this review provides a systematic overview of current LLM applications and limitations in patient care.Our conceptual synthesis provides a structured taxonomy that may lay the groundwork for both the implementation and critical evaluation of LLMs in healthcare settings. .

Acknowledgements
This research is funded by the European Union (101079894).Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Commission.Neither the European Union nor the granting authority can be held responsible for them.The funding had no role in the study design, data collection and analysis, manuscript preparation, or decision to publish.

Data availability
All data generated or analyzed during this study are included in this published article and its supplementary information files.Table 1.Overview of included studies and corresponding authors, year of publication, affiliation countries of authors, study design, medical specialty, purpose of study, large language model (LLM)/tool examined, target user, evaluation/setting, main outcome, and conclusion.

Figure 1 .
Figure 1.Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram.

Figure 2 .Figure 3 .
Figure 2. Schematic illustration of the identified concepts for the application of large language models (LLMs) in patient care.

Table 2 .
18aluation of included studies according to the Mixed Methods Appraisal Tool (MMAT) 2018.18Can'ttell; N, No; Y, Yes; S1, Are there clear research questions?; S2, Do the collected data allow to address the research questions?; 1.1, Is the qualitative approach appropriate to answer the research question?; 1.2, Are the qualitative data collection methods adequate to address the research question?; 1.3, Are the findings adequately derived from the data?; 1.4,Is the interpretation of results sufficiently substantiated by data?; 1.5, Is there coherence between qualitative data sources, collection, analysis and interpretation?; 2.1, Is randomization appropriately performed?; 2.2, Are the groups comparable at baseline?; 2.3, Are there complete outcome data?; 2.4, Are outcome assessors blinded to the intervention provided?; 2.5, Did the participants adhere to the assigned intervention?; 3.1, Are the participants representative of the target population?; 3.2, Are measurements appropriate regarding both the outcome and intervention (or exposure)?; 3.3, Are there complete outcome data?; 3.4, Are the confounders accounted for in the design and analysis?; 3.5, During the study period, is the intervention administered (or exposure occurred) as intended?Notes: Categories 4 and 5 are not listed as no studies with quantitative descriptive or mixed methods study designs were identified.