Diagnostic Performance of Generative AI and Physicians: A Systematic Review and Meta-Analysis.

Background: The rapid advancement of generative artificial intelligence (AI) has revolutionized understanding and generation of human language. Their integration into healthcare has shown potential for improving medical diagnostics, yet a comprehensive diagnostic performance evaluation of generative AI models and the comparison of their diagnostic performance with that of physicians has not been extensively explored. Methods: In this systematic review and meta-analysis, a comprehensive search of Medline, Scopus, Web of Science, Cochrane Central, and medRxiv was conducted for studies published from June 2018 through December 2023, focusing on those that validate generative AI models for diagnostic tasks. Meta-analysis was performed to summarize the performance of the models and to compare the accuracy of the models with that of physicians. The quality of studies was assessed using the Prediction Model Study Risk of Bias Assessment Tool. Results: The search resulted in 54 studies being included in the meta-analysis, with 13 of these also used in the comparative analysis. Eight models were evaluated across 17 medical specialties. The overall accuracy for generative AI models across 54 studies was 57% (95% confidence interval [CI]: 51–63%). The I-squared statistic of 96% signifies a high degree of heterogeneity among the study results. Meta-regression analysis of generative AI models revealed significantly improved accuracy for GPT-4, and reduced accuracy for some specialties such as Neurology, Endocrinology, Rheumatology, and Radiology. The comparison meta-analysis demonstrated that, on average, physicians exceeded the accuracy of the models (difference in accuracy: 14% [95% CI: 8–19%], p-value <0.001). However, in the performance comparison between GPT-4 and physicians, GPT-4 performed slightly higher than non-experts (-4% [95% CI: -10–2%], p-value = 0.173), and slightly underperformed compared to experts (6% [95% CI: -1–13%], p-value = 0.091). The quality assessment indicated a high risk of bias in the majority of studies, primarily due to small sample sizes.


Introduction
2][3][4][5][6][7][8] These advanced computational systems have demonstrated exceptional proficiency in interpreting and generating human language, thereby setting new benchmarks in AI's capabilities.Generative AI, with their deep learning architectures, have rapidly evolved, showcasing a remarkable understanding of complex language structures, contexts, and even images.This evolution has not only expanded the horizons of AI but also opened new possibilities in various fields, including healthcare. 9,10e integration of generative AI models in the medical domain has spurred a growing body of research focusing on their diagnostic capabilities. 11Studies have extensively examined the performance of these models in interpreting clinical data, understanding patient histories, and even suggesting possible diagnoses. 12,13In medical diagnoses, the accuracy, speed, and efficiency of generative AI models in processing vast amounts of medical literature and patient information have been highlighted, positioning them as valuable tools.This research has begun to outline the strengths and limitations of generative AI models in diagnostic tasks in healthcare.
Despite the growing research on generative AIs in medical diagnostics, there remains a significant gap in the literature: a comprehensive meta-analysis of the diagnostic capabilities of the models, followed by a comparison of their performance with that of physicians. 14Such a comparison is crucial for understanding the practical implications and effectiveness of generative AI models in real-world medical settings.While individual studies have provided insights into the capabilities of generative AI models, 12,13 a systematic review and meta-analysis is necessary to aggregate these findings and draw more robust conclusions about their comparative effectiveness against traditional diagnostic practices by physicians.This paper aims to bridge the existing gap in the literature by conducting a meticulous meta-analysis of the diagnostic capabilities of generative AI models in healthcare.Our focus is to provide a comprehensive diagnostic performance evaluation of generative AI models and the comparison of their diagnostic performance with that of physicians.By synthesizing the findings from various studies, we endeavor to offer a nuanced understanding of the effectiveness, potential, and limitations of generative AI models in medical diagnostics.This analysis is intended to serve as a foundational reference for future research and practical applications in the field, ultimately contributing to the advancement of AI-assisted diagnostics in healthcare. .

Protocol and Registration
This systematic review was prospectively registered with PROSPERO (CRD42023494733).Our study adhered to the relevant sections of guidelines from the Preferred Reporting Items for a Systematic Review and Meta-analysis (PRISMA) of Diagnostic Test Accuracy Studies. 15,16All stages of the review (title and abstract screening, full-text screening, data extraction, and assessment of bias) were performed in duplicate by two independent reviewers (H.Takita and D.U.), and disagreements were resolved by discussion with a third independent reviewer (H.Tatekawa).
We searched the following electronic databases for literature from June 2018 through December 2023: Medline, Scopus, Web of Science, Cochrane Central, and medRxiv.June 2018 represents when the first generative AI model was published. 1We included all articles that fulfilled the following inclusion criteria: primary research studies that validate a generative AI for diagnosis.We applied the following exclusion criteria to our search: review articles, case reports, comments, editorials, retracted articles, and those not related to diagnostic performance.

Data Extraction
Titles and abstracts were screened before full-text screening.Data was extracted using a predefined data extraction sheet.A count of excluded studies, including the reason for exclusion, was recorded in a PRISMA flow diagram. 16 extracted information from each study including the first author, model with its version, model task, test dataset type (internal, external, or unknown), 18 medical specialty, accuracy, sample size, and publication status (pre-print or peer-reviewed) for the meta-analysis of generative AI performance.Most generative AI models only presented their training period without any information on which data was used for training.Therefore, when generative AI models are tested with data outside of the training period, the test dataset type is classified as external testing, and when tested with data that was publicly available during the training period, it is classified as unknown.In addition to this, when both the model and the physician's diagnostic performance are presented in the same paper, we extracted both for comparative analysis.We also considered the type of physician involved in relevant studies.We classified physicians as non-experts if they were trainees or residents.In contrast, those beyond this stage in their career were categorized as experts.When a single model used multiple prompts and individual performances were available in one article, we took the average of them.

Quality Assessment
We used the Prediction Model Study Risk of Bias Assessment Tool (PROBAST) to assess papers for bias and applicability. 19This tool uses signaling questions in four domains (participants, predictors, outcomes, and analysis) to provide both an overall and a granular assessment.We did not include some PROBAST signaling questions because they are not relevant to generative AI models.Details of modifications made to PROBAST are in Appendix Table S1 (online).

Statistical Analysis
Initially, we conducted a meta-analysis of generative AI studies reporting accuracy data to estimate the pooled accuracy of the diagnostic performance.Subsequently, a meta-regression analysis was performed on the accuracy of these models to identify sources of heterogeneity across studies, incorporating covariates such as model type, medical specialty, task of the model, type of test dataset, level of bias, and publication status.Secondly, we compared the diagnostic performance of generative AI models with that of physicians.For this analysis, we used the difference in accuracy, calculated by subtracting the physicians' accuracy from that of the models.An inversevariance-weighted random-effects model with the DerSimonian-Laird estimator was utilized to estimate the between-study variance and normal approximation intervals based on summary measures to calculate confidence intervals (CI) for individual study results.The random-effects model (DerSimonian-Laird method) rather than a fixed-effects model was selected at the time of the study protocol because of the expected heterogeneity of the included studies.To assess publication bias, we used a funnel plot, evaluating effect size and standard error as described by Egger et al. 20 Statistical significance was set at a P value of 0.05.All calculations were performed using R (version 4.0.0),utilizing the 'metafor' package.
While comparisons between both expert and non-expert physicians were found for GPT-4, GPT-3.5, GPT-4V, and GPT-3, only comparisons with experts were found for Llama 2, with no comparisons involving non-experts.The studies covered a variety of medical specialties.Ophthalmology was the most frequently studied specialty with 4 articles, followed by Radiology with 3 articles.General medicine and Emergency medicine were evaluated in 2 articles each.Endocrinology and Urology were each represented once.For model tasks, free text tasks were more prevalent with 12 articles, whereas choice tasks were represented in 3 articles.Regarding test types, external testing was more common with 7 articles, compared to 6 articles of unspecified or unknown test types.

Quality Assessment
PROBAST assessment led to an overall rating of 45/54 (83%) studies at high risk of bias, 8/54 (15%) studies at low risk of bias, 10/54 (19%) studies at high concern for generalizability, and 44/54 (81%) studies at low concern for generalizability (Figure 2).The main factors of this evaluation were studies that evaluated models with a small test set and studies that cannot prove external evaluation due to the unknown training data of generative AI models.
Detailed results are shown in Appendix Table S2 (online).

Meta-analysis for generative AI models
The pooled accuracy of generative AI models showed varied performance across different models and medical specialties (Figure 3 and Appendix Figures S1-3 [online]).The overall accuracy for generative AI models was found to be 57% with a 95% CI of 51-63%.The I-squared statistic of 96% signifies a high degree of heterogeneity among the study results.In the meta-regression analysis examining the performance of various generative AI models across different specialties, the results revealed differences in effectiveness (Table 2).For the models, GPT-4 showed statistically significant performance with a coefficient of 26.1 (95% CI: 6.Other specialties such as Pediatrics, Gynecology, Urology, Otolaryngology, Orthopedic surgery, Ophthalmology, and Plastic surgery showed positive coefficients but not significant differences.No significant heterogeneity was observed based on the risk of bias, or based on publication status.Overall, the meta-regression analysis indicates that among various generative AI models, GPT-4 significantly outperforms others in effectiveness, though performance varies considerably across medical specialties, with some showing negative impacts. We assessed publication bias by using a regression analysis to quantify funnel plot asymmetry (Appendix Figures S1 [online]) and it suggested a low risk of publication bias (p = 0.572).

Discussion
In this systematic review and meta-analysis, we analyzed the diagnostic performance of generative AI and physicians.We initially identified 13,966 studies, ultimately including 54 in the meta-analysis and 13 in the comparative analysis with physicians.The study spanned various AI models and medical specialties, with GPT-4 being the most evaluated.Quality assessment revealed a majority of studies at high risk of bias.The meta-analysis showed a pooled accuracy of 57% (95% CI: 51-63%) for generative AI models.Meta-regression analysis highlighted significant differences in effectiveness of different AI models across medical fields.The comparative analysis revealed that physicians generally outperformed AI models, although in non-expert settings, some AI models showed comparable performance.To the best of our knowledge, this is the first meta-analysis of generative AI models in diagnostic tasks.This comprehensive study highlights the varied capabilities and limitations of generative AI in medical diagnostics.
The meta-analysis of generative AI models in healthcare reveals crucial insights for clinical practice.
Despite the overall modest accuracy of 57% for generative AI models in medical applications, the significant performance of GPT-4, suggests its potential utility in certain clinical scenarios.The variation in effectiveness across specialties, particularly the lower effectiveness in fields like Neurology, Endocrinology, Rheumatology, and Radiology underscores the need for cautious implementation and further refinement of AI models in these areas.The data indicates that generative AI models possess a propensity towards knowledge in some medical specialties, and by understanding and utilizing its characteristics, it has the potential to function as a valuable support tool in medical settings.Importantly, the close performance of GPT-4 to physicians in non-expert scenarios highlights the possibility of AI augmenting healthcare delivery in resource-limited settings or as a preliminary diagnostic tool, thereby potentially increasing accessibility and efficiency in patient care. 73,74e comparison between generative AI and physician performances, particularly in the context of medical education, offers intriguing perspectives. 75The overall higher accuracy of physicians compared to AI models emphasizes the irreplaceable value of human judgement and experience in medical decision-making.However, the comparable performance of GPT-4 and physicians in non-expert settings reveals an opportunity for integrating AI into medical training.This could include using AI as a teaching aid for medical students and residents, especially in simulating non-expert scenarios where AI's performance is nearly equivalent to that of healthcare professionals. 76ch integration could enhance learning experiences, offering diverse clinical case studies and facilitating self-.assessment and feedback.Additionally, the narrower performance gap between GPT-4 and physicians even in expert settings suggests that AI could be used to supplement advanced medical education, helping to identify areas for improvement and providing supporting information.This approach could foster a more dynamic and adaptive learning environment, preparing future medical professionals for an increasingly digital healthcare landscape.
Although there are no statistically significant differences among the risks of bias, the PROBAST quality assessment reveals a high risk of bias in 80% of studies. 19This raises significant concerns about the reliability of current generative AI research in healthcare.This highlights the crucial need for rigorous and transparent methodologies, including the necessity of large amounts of external evaluation to assess real-world performance accurately. 77Moreover, the transparency of training data and its collection period is paramount.Without this transparency, it is impossible to determine whether the test dataset is an external dataset or not.It ensures an understanding of the model's knowledge, context, and limitations, aids in identifying potential biases, and facilitates independent replication and validation, which are fundamental to scientific integrity.As generative AI continues to evolve, fostering a culture of rigorous transparency is essential to ensure their safe, effective, and equitable application in clinical settings, 78 ultimately enhancing the quality of healthcare delivery and medical education.
The methodology of this study, while comprehensive, has limitations.This meta-analysis involved primary studies with considerable heterogeneity.The performance of generative AI models might vary significantly in realworld scenarios, which are often more complex than research settings.There were not many studies that compared generative AI and physicians using the same sample.Future research should focus on addressing the identified limitations.This includes conducting studies with more diverse datasets, exploring the performance of generative AI models in varied clinical environments, and examining their impact on different patient demographics.Additionally, longitudinal studies assessing the long-term efficacy and impact of generative AI models in clinical practice would be valuable.
In conclusion, this meta-analysis provides a nuanced understanding of the capabilities and limitations of generative AI in medical diagnostics.While generative AI models, particularly advanced iterations like GPT-4, have shown progressive improvements and hold promise for assisting in diagnosis, their effectiveness remains highly variable across different models and medical specialties.With an overall moderate accuracy of 57%, generative AI models are not yet reliable substitutes for expert physicians but may serve as valuable aids in non-expert scenarios and as educational tools for medical trainees.The findings also underscore the need for continued advancements and specialization in model development, as well as rigorous, externally validated research to overcome the prevalent high risk of bias and ensure generative AIs' effective integration into clinical practice.As the field evolves, continuous learning and adaptation for both generative AI models and medical professionals are imperative, alongside a commitment to transparency and stringent research standards.This approach will be crucial in harnessing the potential of generative AI models to enhance healthcare delivery and medical education while safeguarding against their limitations and biases.

Figure 3 :
Figure 3: Pooled accuracy This figure presents a comparative analysis of pooled accuracy across different studies.Panel A (left) illustrates the pooled accuracy for a range of models.Panel B (right) displays the pooled accuracy for various medical specialties.The dotted vertical line (57%) indicates the average pooled accuracy across all models or specialties, and the gray shaded area (51-63%) represents the 95% confidence interval (CI) for this average.Each data point is accompanied by a horizontal error bar which denotes the 95% CI.The numeric values in parentheses next to each entity represent the lower and upper bounds of the 95% CI for that specific model or specialty.

Figure 4 :
Figure 4: Comparison analysis results This figure demonstrates the differences in accuracy between various AI models and physicians.It specifically compares the performance of AI models against the overall accuracy of physicians, as well as against non-experts and experts separately.Each horizontal line represents the range of accuracy differences for the model compared to the physician category.The percentage values displayed on the right-hand side correspond to these mean differences, with the values in parentheses providing the 95% confidence intervals for these estimates.The dotted vertical line marks the 0% difference threshold, indicating where the model's accuracy is exactly the same as that of the physicians'.Positive values (to the right of the dotted line) suggest that the physicians outperformed the model, whereas negative values (to the left) indicate that the model was more accurate than the physicians.

Table 2 :
Meta regression results