Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Importance Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Few research however has the scale and accuracy that can be turned into clinical practice. The tide may be turned today with the power of large language models (LLMs). In this application, we evaluated the accuracy of medical license exam using the newly released Generative Pre-trained Transformer 4 with vision (GPT-4V), a large multimodal model trained to analyze image inputs with the text instructions from the user. This study is the first to evaluate GPTs for interpreting medical images. Objective This study aimed to evaluate the performance of GPT-4V on medical licensing examination questions with images, as well as to analyze interpretability. Design, Setting, and Participants We used 3 sets of multiple-choice questions with images to evaluate GPT-4V performance. The first set was the United States Medical Licensing Examination (USMLE) from the National Board of Medical Examiners (NBME) sample questions in step1, step2CK, and step3. The second set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The third set was the Diagnostic Radiology Qualifying Core Exam (DRQCE) from the American Board of Radiology. The study (including data analysis) was conducted from September to October 2023. Main Outcomes and Measures The choice accuracy of GPT-4V was compared to two other large language models, GPT-4 and ChatGPT. The GPT-4V explanation was evaluated across 4 qualitative metrics: image misunderstanding, text hallucination, reasoning error, and non-medical error. Results Of the 3 exams with images, NBME, AMBOSS, and DRQCE, GPT-4V achieved accuracies of 86.2%, 62.0%, and 73.1%, respectively. GPT-4V outperformed ChatGPT and GPT-4 by 131.8% and 64.5% on average across various data sets. The model demonstrated a decreasing trend in performance as question difficulty increased in the AMBOSS dataset. GPT-4V achieves an accuracy of 90.7% in the full USMLE exam, outperforming the passing threshold of about 60% accuracy. Among the incorrect answers, 75.9% of responses included misinterpretation of the image. However, 39.0% of them could be easily solved with a short hint. Conclusion In this cross-sectional study, GPT-4V achieved a high accuracy of USMLE that was in the 70th - 80th percentile with AMBOSS users preparing for the exam. The results suggest the potential of GPT-4V for clinical decision support. However, GPT-4V generated explanation revealed several issues. It needs to improve explanation quality for potential use in clinical decision support.


Introduction
Using computers to help make clinical diagnoses and guide treatments has been a goal of artificial intelligence since its inception. 1The adoption of electronic health record (EHR) systems by hospitals in the US has resulted in an unprecedented amount of digital data associated with patient encounters.Computer-assisted clinical diagnostic support system (CDSS) endeavors to enhance clinicians' decisions with patient information and clinical knowledge. 2There is burgeoning interest in CDSS for enhanced imaging 3 , often termed radiomics, in various disciplines such as breast cancer detection 4 , covid detection 5 , diagnosing congenital cataracts 6 , and hidden fracture location 7 .][10] However, most previous imaging CDSS offers only highlight areas deemed significant by AI, [11][12][13][14][15] providing limited insight into the explanation of the diagnosis. 16rge language models (LLMs) could generate explanations as they are trained with reinforcement learning from human feedback to follow user requests to explain a question.
Typical LLM examples include Chat Generative Pre-trained Transformer (ChatGPT), a renowned chatbot released by OpenAI in October 2022, and its successor Generative Pretrained Transformer 4 (GPT-4) in March 2023.The influence of ChatGPT is attributed to its conversational prowess and its performance, which approaches or matches human-level competence in cognitive tasks, spanning various domains including medicine. 17ChatGPT has achieved commendable results in the United States Medical Licensing Examinations, leading to discussions about the readiness of LLM applications for integration into clinical [18][19][20] , and educational [21][22][23] environments.
One limitation of ChatGPT is that it may only read and generate text but is unable to process other data modalities, such as images.This limitation, known as the "single-modality," is a common issue among many LLMs. 24,25Advancements in multimodal LLM promise enhanced capabilities and integration with diverse data sources. 26,27OpenAI's recent introduction of GPT-4V has undeniably made strides toward bridging this divide.GPT-4V, a state-of-the-art multimodal LLM, is equipped with visual processing ability, granting it to understand and describe visual content. 28By incorporating GPT-4V into current imaging CDSS, physicians can ask open-ended questions pertaining to a patient's medical evaluation -taking into account all available information including images, symptoms, lab results, allowing for an interactive experience where AI suggest both decision and explanation to support physicians.
However, the ability of GPT-4V to analyze medical images still remains unknown.GPT-4V must perform comparably to humans on assessments of medical knowledge and reasoning such that users have sufficient confidence in its responses.In this work, we aim to assess GPT-4V performance on medical licensing examination questions with images, as well as to analyze its explanation for healthcare professional interpretability.

Method
This cross-sectional study compared the performance between GPT-4V, GPT-4, and ChatGPT on medical licensing examination questions answering.This study also investigates the quality of GPT-4V explanation in answering these questions.The study protocol was deemed exempt by Institutional Review Board at the VA Bedford Healthcare System and informed consent was waived due to minimal risk to patients.This study was conducted in October 2023.

Medical Exam Data Collection
We obtained study questions from three sources.The United States Medical Licensing Examination (USMLE) consists of three steps required to obtain a medical license in the United States.The USMLE assesses a physician's ability to apply knowledge, concepts, and principles, which is critical to both health and disease management and is the foundation for safe, efficient patient care.The Step1, Step2 clinical knowledge(CK), Step3 of USMLE sample exam released from the National Board of Medical Examiners (NBME) consist of 119, 120, and 137 questions respectively.Each question contained multiple options to choose from.We then selected all questions with images, resulting in 19, 13, and 18 questions from Step1, Step2 CK, and Step3.Discipline includes but is not limited to radiology, dermatology, orthopedics, ophthalmology, and cardiology.
The sample exam only included limited questions with images.Thus, we further collected similar questions from a non-public available and registered required source: AMBOSS, a widely used question bank for medical students, which provides exam performance data given students' performance.The performance of past AMBOSS students enabled us to assess the comparative effectiveness of the model.For each question, AMBOSS associated an expertwritten hint to tip the student to answer the question and a difficulty level that ranges from 1-5.Levels 1, 2, 3, 4, and 5 represent the easiest 20%, 20-50%, 50%-80%, 80%-95%, and 95%-100% of questions respectively. 29We randomly selected 10 questions from each of the 5 difficulty levels.And we repeated this process for Step1, Step2 CK, and Step3.This resulted in a total number 150 of questions.
The Diagnostic Radiology Qualifying Core Exam (DRQCE), offered after 36 months of residency training, is an image-rich exam to evaluate a candidate's core fund of knowledge and clinical judgment across practice domains of diagnostic radiology.DRQCE is not publicly available and requires registration.We collected 26 questions with images from the preparation exam offered by the American Board of Radiology (ABR).Thus, we had a total of 226 questions with images from the three sources.To illustrate GPT-4V's potential as an imaging diagnostic support tool, we modified a patient case report 30 to resemble a typical "curbside consult" question between medical professionals. 31w to Answer Image Questions using GPT-4V Prompt GPT-4V took image and text data as inputs to generate textual outputs.Given that input format (prompt) played a key role in optimizing model performance, we followed the standard prompting guidelines of the visual question-answering task.Specifically, we prompted GPT-4V by first adding the image, then appending context (i.e., patient information) and questions, and finally providing multiple-choice options, each separated by a new line.An example user prompt and GPT-4V response are shown in Figure 1.When multiple sub-images existed in the image, we uploaded multiple sub-images to GPT-4V.We did not use hint in the prompt unless otherwise specified.The response consists of the selected option as an answer, supported by a textual explanation to substantiate the selected decision.When using ChatGPT and GPT-4 models that cannot handle image data, images were omitted from the prompt.Responses were collected from the September 25, 2023 version of models.Each question was manually entered into the ChatGPT website independently (new chat window).

Evaluation Metrics
For answer accuracy, we evaluated the model's performance by comparing the model's choice with the correct choice provided by the exam board or question bank website.We defined accuracy as the ratio of the number of correct choices to the total number of questions.
We also evaluated the quality of the explanation by preference from 3 healthcare professionals (one medical doctor, one registered nurse, and one medical student).For each question from AMBOSS dataset (n=150), we asked the healthcare professionals to choose their preference between an explanation by GPT-4V, an explanation by an expert, or a tie.
Additionally, we also asked healthcare professionals to evaluate GPT-4V explanation from a sufficient and comprehensive perspective. 32,33They determined if the following information exists in the explanation: 1. Image interpretation: GPT-4V tried to interpret the image in the explanation, and such interpretation is sufficient to support its choice.
2. Question information: Explanations contained information related to the textual context (i.e., patient information) of the question, and such information was essential for GPT-4V's choice.
3. Comprehensive explanation: The explanation included comprehensive reasoning for all possible evidence (e.g., symptoms, lab results) that leads to the final answer.
Finally, for each question answered incorrectly, we asked healthcare professionals to check if the explanation contained any of the following errors: 1. Image misunderstanding: if the sentence in the explanation showed an incorrect interpretation of the image.Example: GPT-4V said that a bone in the image was for the hand, but it was the foot.
2. Text hallucination: if the sentence in the explanation contained something that is incorrect.(Other than the image) Example: Claiming Saxenda was just insulin.
3. Reasoning error: if the sentence did not properly convert the information (either image or text) to an answer.Example: GPT-4V identified a patient trip occurring in the last 3 months and despite the fact that chagas disease usually develops 10~20 years after infection, it still diagnosed the patient as having chagas disease.
4. Non-medical error: For any non-medical error, use this class.GPT is known to struggle with tasks requiring precise spatial localization, such as identifying chess positions on the board.It is known to struggle with calculations, such as 1 + 1 =?.

Statistical Analysis
GPT-4V's accuracies on the AMBOSS dataset were compared between different difficulties using unpaired chi-square tests with a significance level of 0.05.All analysis was conducted in Python software (version 3.10.11).

Overall Answer Accuracy
For questions with image, GPT-4V achieved an accuracy of 84.1), passing the standard for the USMLE (about 60%).However, it achieved limited accuracy in several medical disciplines such as anatomy (25.0%), emergency medicine (25.0%), and pathology (50.0%).A grasp of images is essential to correctly answer the majority of questions in these disciplines.

Accuracy Decreases When Difficulty Increases
When asking GPT-4V questions without the hint, it achieved an accuracy of 60%, 64%, and 66% for AMBOSS Step1, Step2CK, and Step3.GPT-4V was in the 72nd, 76th, and 80th percentile with AMBOSS users who were preparing for Step1, Step2CK, and Step3 respectively.Table 2 shows a decreasing trend in performance as question difficulty increased in the AMBOSS dataset (P<0.05).However, the decreasing trend was not observed when the GPT-4V was questioned with the hint.Out of 55 wrong answers without the hint, 17 were corrected by hints.
An example and detailed analysis are provided in the supplementary material figure 1.

Quality of Explanation
We first evaluated the user's preference among GPT-4V generated explanations and expert generated explanations.When GPT-4V answered incorrectly, it was no surprise that healthcare professionals overwhelmingly preferred expert explanations as shown in Table 3.When GPT-4V answered correctly, healthcare professionals favoring experts only exceeded favoring for GPT-4V by 4 votes, out of a total of 95 votes.
We further evaluated the quality of the GPT-4V generated explanation by verifying if explanation includes image and question text interpretation in Table 4.When examining the 95 correct answers, 84.2% (n=80) of the responses contained an interpretation of the image, while 96.8% (n=92) aptly captured the information presented in the question.On the other hand, for the 55 incorrect answers, 92.8% (n=51) interpreted the image, and 89.1% (n=49) depicted the question's details.In terms of explanation comprehensiveness, GPT-4V offered a comprehensive one in 79.0% (n=75) of correct responses.In contrast, only 7.2% (n=4) of the wrong responses had a comprehensive explanation that led to the final choice.
We also evaluated the explanations that lead to GPT-4V in answering incorrectly across 4 metrics as outlined above: image misunderstanding, text hallucination, reasoning error, and non-medical error.Among questions with wrong answers (n=55), we found that 76.3% (n=42) of questions included misunderstanding of the image, 45.5% (n=25) of questions included logic error, 18.2% (n=10) of questions included text hallucination, and no questions included nonmedical errors.

Case Study on Consult Conversation
The consultation conversation, regarding a 45-Year-Old woman with hypertension, fatigue, and altered mental status, is provided in supplementary material figure 2. We found that the interactive design of GPT-4V allowed physicians to seek additional information by posing followup questions.Specifically, GPT-4V initially provided an irrelevant response when asked to interpret the CT scan.However, it was able to adjust its response and accurately identify the potential medical condition depicted in the image after receiving a physician's visual hint -an arrow pointed to a part of the CT scan where physicians desired GPT-4V to analysis.
Through comparing GPT-4V response with the case report, we also found that GPT-4V generally offered responses that were clear and coherent.When asked about differential diagnosis, GPT-4V explained why 3 diseases should be listed (Primary Aldosteronism, Hypertension, and Cushing's Syndrome) along with its explanation which were deemed relevant by a medical doctor.Following a query about the subsequent steps to ascertain the origin of the anomaly, GPT-4V recommended a PET-CT scan.Utilizing the patient's PET-CT scan, it was able to locate a tumor in the mediastinum, lending credence to the suspicion of Cushing's Syndrome.Finally, GPT-4V asked for further studies, such as a biopsy of the mass, to confirm the diagnosis.

Discussion
5][36] GPT-4 was recently introduced as the upgraded model for ChatGPT.Studies showed that GPT-4 outperforms ChatGPT in various medical tasks. 31,37The collective insights illustrated the power of ChatGPT and GPT-4 in medical exam answering and the potential for medical decision support.However, previous evaluations were only limited to questions without images.
In this study on the evaluation of medical exam questions with images, we found that GPT-4V selects more correct choices compared to ChatGPT and GPT-4 as shown in Table 1.Hence, when evaluating all questions in the USMLE sample exam, GPT-4V achieved an accuracy of 90.7% outperforming ChatGPT (58.5%) and GPT-4 (83.8%).The passing standard for the USMLE was typically set at 60%, indicating that the GPT-4V performed at a level similar to a medical graduate in the final year of study.The accuracy of GPT-4V highlights its grasp over biomedical and clinical sciences, essential for medical practice, but also showcases its ability in patient management and problem-solving skills, 38 both of which indicate the potential for clinical routines, such as summarizing radiology reports 39 and differential diagnosis 40,41 .We further explored GPT-4V's potential in CDSS through a consultation conversation.By posing follow-up queries and providing visual hints from physicians, GPT-4V showed promise in CDSS applications, including interpreting CT scan interpretation, differential diagnoses, and follow-up exams recommendation.However, the quality of GPT-4V's explanation to support its clinical decision making remained an open question, bridging the focus of our study to the next segment of our analysis.
In terms of explanation quality, we found that more than 80% of responses from GPT-4V provided an interpretation of the image and question of its answer selection, regardless of correctness.This suggested that GPT-4V consistently took into account both the image and question elements while generating responses.Figure 1 illustrates an example of high-quality explanation that utilizes both text and image in answering a hard question.More than 70% of students answered incorrectly on the first try, because both bacterial pneumonia and pulmonary embolism may involve symptoms such as cough.To differentiate them, GPT-4V correctly interpreted the X-ray with a radiologic sign of Hampton hump, which further increased the suspicion of pulmonary infarction rather than pneumonia. 42To show the need for X-ray as mentioned in the explanation, we removed the image from the input, and GPT-4V switched the answer to bacterial pneumonia while also acknowledging the possibility of pulmonary infarction.This change in response demonstrated the high quality of the GPT-4V explanation, as its explanation about X-ray was not fictional and it truly needed the X-ray to answer this question.
The high quality of GPT-4V explanations was also supported by experts' preference voting.
When comparing explanations generated by experts with ones generated by GPT-4V, experts' preference for the expert over the GPT-4V was minimal (n=4) when GPT-4V correctly answered the question (n=95).4][45][46] Thus, GPT-4V could enhance the effectiveness and trustworthiness of CDSS by providing high-quality, expert-preferred explanations, encouraging broader adoption and more confident utilization among physicians.
We also found that the accuracy of GPT-4V was related to the comprehensiveness of the explanation.GPT-4V offered a comprehensive explanation in 79.0% of correct responses.In contrast, only 7.2% of the wrong answers had a comprehensive explanation.Thus, the absence of key information may be the cause of inaccurate answers.These observations suggested that enhancing the performance of models can be achieved by training GPT-4V with more intricate, clinical-specific data with insights from experienced physicians such as UpToDate, 47 and medical research literature such as PubMed. 26age misunderstanding was the primary reason why GPT-4V answered incorrectly.Out of 55 wrong responses, 42 (76.3%)were due to misunderstanding of the image.In comparison, only 10 (18.2%) of the mistakes were attributed to text misinterpretation.GPT-4V's proficiency in processing images was considerably lagging behind its text-handling capability.To circumvent its image interpretation issue, we tried to additionally prompt GPT-4V with a short hint that described the image.We found that 40.5% (17 out of 42) responses switched to the correct answer.Corrections from the hint indicated that GPT-4V could be easily persuaded.Within a conversational interface, medical professionals can readily guide and refine GPT-4V's initial outputs.This adaptability could be advantageous for physicians, as it allows for real-time adjustments and ensures that the generated information aligns more closely with the clinical context or the specific details of a patient's case. 47With customized hint from physicians, GPT-4V enhances the utility and reliability as an auxiliary tool.
Another drawback of GPT-4V involved its tendency to produce factually inaccurate responses, a problem often referred to as the hallucination effect, which is prevalent among many large language models such as GPT-4V. 48We found that more than 18% of GPT-4V explanations contain hallucinations.Thus when designing clinical support tools for high-risk situations such as patient diagnosing, it is crucial to integrate GPT-4V and a probabilistic model with confidence interval, indicating the reliability of the response. 49This would enhance the reliability of the CDSS response when additional physician review is warranted. 16

Limitations
This study has several limitations.First, our findings are constrained in their applicability due to the modest sample size.We gathered 226 questions that included images, which might not comprehensively represent all medical disciplines.Second, while GPT-4V has demonstrated proficiency in medical license examination, its real-world applicability, especially in dynamic, user-interactive scenarios, remains untested.Therefore, while the results are promising, extrapolating the efficacy of GPT-4V to broader clinical applications requires appropriate benchmarks and further research. 20

Conclusion
While GPT-4V showcased remarkable accuracy across a spectrum of medical disciplines and varying difficulty levels in this study, it is paramount that further refinement be undertaken, particularly in enhancing its explanatory capabilities prior to any clinical assimilation.Medical students and professionals must be acutely aware of its limitations and consistently cross-verify with authoritative sources.Notably, state-of-the-art LLMs like GPT-4V, even with their sophisticated capabilities, are merely on the brink of supplanting physicians.Their performance in specialized examinations, though noteworthy, still exhibits imperfections, which can lead to consequential inaccuracies and uncertainties.Coupled with the well-known ethical concerns, the credibility and readiness of LLMs for clinical settings remain under scrutiny.However, the preliminary results are promising, suggesting that the present technology is poised to influence clinical practices.As research and development persist, we anticipate a more extensive and profound integration of AI in the medical domain.The reason for this choice is the combination of her risk factors (immobility, smoking), her clinical presentation (sharp, stabbing chest pain exacerbated by coughing, blood-tinged sputum), and the fact that the x-ray doesn't clearly support the other diagnoses.
Question: A 55-year-old woman comes to the physician because of a 4-day history of chest pain and coughing up blood.The chest discomfort is characterized as piercing and aggravated by coughing.She encountered symptoms of a sore throat and nasal congestion 11 days prior.She was diagnosed with multiple sclerosis 15 years ago and utilizes a wheelchair for movement.A 40-year history of smoking a pack of cigarettes daily is noted, but she abstains from alcohol.Her current prescriptions include ocrelizumab and dantrolene.Temperature is 37.9°C (100.2°F),pulse is 105/min, and blood pressure is 110/60 mm Hg.Lung auscultation reveals occasional inspiratory crackles in the right lower lobe, while the cardiac exam shows no abnormalities.Neurologic examination shows stiffness and decreased sensation of the lower extremities; there is diffuse hyperreflexia.An x-ray of the chest is provided.Which of the following is the most likely diagnosisthink step by step.First, analyze each option one by one, and then give your final prediction and reasons to be sure we have the right answer.GPT Response Below:Let's go through the information provided and evaluate each option based on the clinical picture and the chest x-ray: **Clinical Information**: -55-year-old woman -Smoked for 40 years (20 pack-years) -Recent upper respiratory tract symptoms -3-day history of chest pain and cough with blood-tinged sputum -Chest pain is sharp, stabbing, and worsened with coughing -History of multiple sclerosis and uses a wheelchair -On ocrelizumab and dantrolene -Vital signs show mild fever and elevated pulse -Auscultation shows crackles in the right lower lung **Chest X-ray**: Hampton hump is a radiologic sign which consists of a shallow wedge-shaped opacity in the periphery of the lung.**Options**: A: **Sarcoidosis** -Sarcoidosis often presents with bilateral hilar lymphadenopathy on chest xray.The given image doesn't seem to show this.Additionally, the acute symptoms don't align well with a diagnosis of sarcoidosis.B: **Pericarditis** -This would present with chest pain that's positional and relieved by sitting forward.The x-ray provided doesn't suggest pericardial effusion.Unlikely.C: **Bacterial pneumonia** -Given her symptoms and the crackles in the right lower lung, this is a possibility.However, the chest x-ray would typically show a lobar consolidation or infiltrate which isn't clearly seen here.D: **Pulmonary embolism** -Given her history of immobility (uses a wheelchair) and the acute onset of sharp, stabbing chest pain exacerbated by coughing, this is a strong contender.The presence of a Hampton Hump should further increase suspicion of a PE.E: **Pneumothorax** -This would present with sudden onset chest pain and shortness of breath.On x-ray, you'd expect to see a clear line indicating the collapsed lung and no lung markings beyond that line.This is not evident in the provided image.**Conclusion**: Based on the provided information and the chest x-ray, the most likely diagnosis is: **D: Pulmonary embolism**.

Table 3 .
Healthcare professionals preferred explanation for 150 AMBOSS questions.

Table 4 .
Qualitative analysis of GPT-4V response quality for AMBOSS questions.Given each exam question, healthcare professionals evaluate GPT-4V explanations for the following criteria.