Abstract
The United States Medical Licensing Examination (USMLE) is a critical step in assessing the competence of future physicians, yet the process of creating exam questions and study materials is both time-consuming and costly. While Large Language Models (LLMs), such as OpenAI’s GPT-4, have demonstrated proficiency in answering medical exam questions, their potential in generating such questions remains underexplored. This study presents QUEST-AI, a novel system that utilizes LLMs to (1) generate USMLE-style questions, (2) identify and flag incorrect questions, and (3) correct errors in the flagged questions. We evaluated this system’s output by constructing a test set of 50 LLM-generated questions mixed with 50 human-generated questions and conducting a two-part assessment with three physicians and two medical students. The assessors attempted to distinguish between LLM and human-generated questions and evaluated the validity of the LLM-generated content. A majority of exam questions generated by QUEST-AI were deemed valid by a panel of three clinicians, with strong correlations between performance on LLM-generated and human-generated questions. This pioneering application of LLMs in medical education could significantly increase the ease and efficiency of developing USMLE-style medical exam content, offering a cost-effective and accessible alternative for exam preparation.
1. Introduction
Every year, over 100,000 medical students take the United States Medical Licensing Examination (USMLE), administered by the National Board of Medical Examiners (NBME).1 This rigorous examination is crucial for ensuring the competence of future physicians. However, generating the exam questions and related preparation materials is a manual process, which is both time-consuming and costly. On average, each student spends over $4,000 on buying USMLE-related study materials.2 The high costs and substantial effort associated with producing these materials are the primary drivers of the cost, and offer a great opportunity for technological intervention.
The adoption of Artificial Intelligence (AI) in healthcare is rapidly increasing, driven by advancements in Generative AI and especially, Large Language Models (LLMs) such as OpenAI’s GPT-4.3,4,5 LLMs have been explored for various use cases in medicine, including generating clinical notes, summarizing patient records, and providing decision support.6,7,8 Numerous studies have demonstrated the proficiency of these models in answering USMLE questions, achieving over 80% accuracy on the USMLE Step 2 Clinical Knowledge (CK) exam.9 Despite their success in answering exam questions, there is limited research on the use of LLMs for generating medical exam questions, particularly for the USMLE. To address this gap, we introduce QUEST-AI, an autonomous system that (1) generates USMLE-style questions, (2) verifies the system-generated questions, and (3) refines any questions identified as incorrect, all powered by LLMs. The system is evaluated with the assistance of physicians and medical students.
We began by prompting GPT-4 to generate 50 questions inspired by sample questions from the USMLE Step 2 Clinical Knowledge (CK) exam. Then, we used aggregated predictions from an ensemble of diverse LLMs to flag incorrect questions. Finally, we prompted GPT-4 again to correct the flawed questions. In order to evaluate the quality of questions generated using our approach, we constructed a test set containing our 50 system-generated questions randomly interspersed with 50 human-generated sample questions. Three physicians and two medical students engaged in a twofold assessment: (1) they attempted to distinguish between the system-generated and human-generated USMLE-style questions, and (2) they assessed the validity of the system-generated questions and answers.
To our knowledge, ours is the first study to generate, verify, and refine USMLE-style questions using LLMs (Figure 1). This shift from answering questions to generating questions represents a novel application of AI in medical education, with the potential to revolutionize exam content development.
2. Related Work
The use of Large Language Models (LLMs) in healthcare and education has seen considerable growth and innovation.
2.1. LLMs in Healthcare
LLMs have become prominent in healthcare due to their advanced natural language processing capabilities. These models excel in handling vast datasets and generating contextually accurate text through significant technical advancements.10 A systematic review by Bedi et al. highlights the various healthcare applications of LLMs.11 Studies have demonstrated their utility in tasks such as diagnosis12, medical report generation13, treatment recommendations14, and clinical referrals.14,15 For instance, Chung et al. assessed the feasibility and acceptability of ChatGPT-generated radiology report summaries for cancer patients.16 Additionally, Fraser et al. studied the diagnostic accuracy of LLMs, providing further evidence of their potential in clinical settings.17
2.2. LLMs in Education
LLMs have the potential to revolutionize education by providing real-time support, correcting errors, and offering explanations or hints, thereby enhancing student engagement and learning efficiency. 18 Unlike traditional algorithms, LLMs generate flexible, context-aware responses, making them effective tools for study assistance.19 They excel in solving questions across diverse subjects such as math 20, law 21 and medicine 4, utilizing advanced techniques like Chain-of-Thought prompting and few-shot demonstration-selection algorithms to enhance problem-solving performance.
In addition to question solving, LLMs are effective in error correction, providing instant feedback that aids early-stage learning. 22 They also function as confusion helpers, offering pedagogical guidance and hints to help students solve problems independently, fostering deeper understanding and self-sufficiency. 23
Recent research has applied LLMs for automatic question generation, including Laverghetta Jr. and Licato’s work on cognitive assessments.24 Using GPT-3, they created Natural Language Inference (NLI) dataset items with a prompting strategy that selected items with the best and worst properties, resulting in improved psychometric qualities. Another study by Tran et al. explored using GPT-3 and GPT-4 to generate high-quality multiple-choice questions (MCQs) for computing courses. They found that GPT-4 successfully generated correct answers for 78.5% of MCQs, highlighting LLMs’ potential to efficiently craft personalized educational content and improve assessment quality in real-time. 25
2.3. LLMs in Medical Education
LLMs have shown some promise in generating multiple-choice questions (MCQs) for medical exams, a task traditionally requiring extensive medical knowledge and effort from educators. A systematic review by Artsi et al. discovered a total of 8 studies that used LLMs like GPT-3.5 and GPT-4 for this purpose 26. They found that while LLMs can produce valid MCQs across various medical disciplines—including neurosurgery, general medicine, physiology, dermatology, internal medicine, surgery, anatomy, and biochemistry—some challenges remain, such as inaccuracies and low complexity of generated questions.
In one study, the authors compared LLM-generated questions to those written by humans, finding human questions superior in quality and validity despite the faster generation time by AI27. In another one, the authors used Chat-GPT 3.5 for dermatology board exams and found only 40% of the generated questions applicable.28 Sevgi et al. and Han et al. generated neurosurgery and general MCQs, respectively, but did not independently evaluate the validity. 29,30
One study specifically focused on generating USMLE Step 1 questions and concluded that ChatGPT can provide assistance in USMLE exams.31 However, this study lacked a detailed evaluation of the generated content. The capability to generate USMLE-style questions can save time and resources, support the growing demand for medical professionals, and keep up with evolving medical knowledge. However, the accuracy and quality of system-generated content must be rigorously evaluated to ensure it meets the required standards for medical examinations.
To address this gap, our study evaluates GPT-4’s ability to generate USMLE Step 2 CK-style exam questions. We provide insights into the practical applications of AI in medical education and its potential to enhance the accessibility and quality of exam preparation materials. By presenting a fully autonomous system for generating, verifying, and refining USMLE-style questions, we aim to demonstrate the capacity of LLMs to generate high-quality exam content, thereby improving the development and accessibility of medical education resources.
3. Methods
3.1. Data collection and generation
We randomly selected a set of 50 human-generated questions from a bank of 120 publicly available USMLE Step 2 CK test sample questions, ensuring that these questions did not include associated images or abstracts32. This was done to maintain a controlled and uniform format for comparison purposes.
For system-generated questions, we employed a prompt chaining approach with GPT-4 as shown in Figure 2. We started with a human-generated USMLE CK test question-answer pair, which was included in the initial prompt to GPT-4. The model then generated an explanation of why the given answer was correct and the others were incorrect. This original question, along with the system-generated explanation, were used in a follow-up prompt instructing GPT-4 to generate another USMLE Step 2 CK-style question in a similar format. This method ensured the generated questions closely matched the format, style, and complexity of the human-generated ones, promoting consistency and reducing deviations from the desired standards.
After generating a set of system-generated questions, we compiled these alongside the human-generated ones and randomly shuffled them to create a comprehensive 100-question set. This randomization was crucial to ensure an unbiased evaluation.
3.2. Evaluation by Physicians
A group of three licensed, practicing physicians and two medical students were tasked with evaluating the 100-question set. They were instructed to:
Choose the single best answer to each question without consulting any external reference.
Guess whether each question was generated by humans or GPT-4.
In a separate task, three physicians reviewed the 50 system-generated exam questions to evaluate their correctness, using any available external references. They recorded the type of errors found in the system-generated questions and the time taken to make their determinations. The two phases of the study, marked by the different tasks performed by the medical specialists, are illustrated in Figure 3.
3.3. Evaluation by LLMs
An ensemble of five LLMs from the Hugging Face hub33 (a public repository of models) was selected for evaluation based on the models’ performance in public open LLM leaderboards34 and community support: Meta-Llama-3-70B-Instruct from Meta, Mixtral-8×22B-Instruct-v0.1 from Mistral AI, Qwen2-72B-Instruct from Alibaba, Phi-3-medium-4k-Instruct from Microsoft, and llama-2-70b-chat from Meta. These selected models were presented with system-generated question-answer pairs and asked to identify the best answer. Using the models’ responses, we constructed a simple classifier to discriminate between valid and invalid system-generated questions based on the proportion of LLMs within the model ensemble that agreed with the answer selected by GPT-4 as being the best answer. We hypothesized that system-generated question-answer pairs for which one or more of the LLMs disagreed with the answer marked by GPT-4 as correct were more likely to be flawed. Conversely, if there was unanimous consensus among the LLMs on the answer deemed correct by GPT-4, then the question-answer pair was less likely to be flawed.
3.4. Categorization and Post-Hoc Editing by GPT-4
GPT-4 was prompted to categorize each question-answer pair in the 100-item set into one of 18 categories outlined in the USMLE content outline35. This categorization aimed to evaluate whether the system-generated questions fell into the same categories as the original human-generated questions used in the prompts. A physician then reviewed these category assignments for each question to verify their validity. We caveat the evaluation of category assignment validity by noting that the ground truth categories per question are not made publicly available by the NBME and the team of physicians and medical students that imputed item categories have no affiliation with NBME/USMLE.
For the system-generated questions deemed incorrect by the ensemble of LLMs and physicians, we conducted a post-hoc editing stage. During this stage, we asked GPT-4 to first identify why a specific question was wrong and then modify or correct the mistakes present in the flagged questions. This was done to assess GPT-4’s capability to improve its responses based on iterative feedback and refinement. A physician manually evaluated each corrected exam item to assess whether the corrections led to a valid result.
The data collection and analysis were conducted in multiple phases from April 2023 to December 2023, and an updated evaluation analysis was completed in July 2024. This evaluation process aimed to provide insights into the practical applications of AI in generating high-quality exam questions and to assess the potential of GPT-4 in contributing to medical education. Data and code for reproducing our results can be found at https://github.com/som-shahlab/gpt4usmle.
4. Results
4.2. Evaluation by Physicians
The average accuracy of the three physicians and two medical students in distinguishing between human-generated and GPT-4-generated questions was 51.8% (95% CI: 46.1% to 57.5%), nearly equivalent to random guessing. This suggests that there were no identifiable systematic differences between the system-generated and human-generated questions. Furthermore, the Pearson correlation between participants’ performance on system-generated and human-generated questions was statistically significant (ρ = 0.952, P-value = 0.012). This strong correlation indicates that system-generated questions are comparable to human-generated questions in their ability to assess the clinical knowledge and skills examined in USMLE-style questions, providing a reliable and consistent measure of performance across both question types.
On a separate task where three physician reviewers were asked to validate the 50 AI generated questions, 32 (64%) questions were deemed “correct” by all reviewers, while 18 (36%) were deemed “incorrect” by at least one reviewer. The reasons for labeling exam items as “incorrect” included “Multiple correct answer choices” (n=9), “AI-chosen answer is incorrect” (n=6), and “No correct answer choice” (n=3). These findings highlight specific areas where the system-generated questions fell short and suggest areas for further refinement in the AI’s question generation capabilities.
Reviewers spent, on average, 3.21 minutes (95% CI 2.73 to 3.69) reviewing each system-generated exam item for correctness. This quick evaluation time highlights a significant potential efficiency advantage, as it is substantially faster than drafting a question from scratch, which typically involves extensive research, drafting, and revision.
4.3. Evaluation by LLMs
All LLMs within our LLM ensemble achieved adequate performance on the human-generated USMLE-style exam questions (see Table 1). Our proposed LLM ensemble classifier was able to discriminate between invalid system-generated questions with an Area under the Receiver-Operator Characteristic curve (AUROC) of 0.79. We considered an item to be classified by the model as “flawed” if any one of the 5 LLMs in the ensemble disagreed with GPT-4 on the best answer choice. Of the 18 system-generated question-answer pairs deemed flawed by clinician reviewers, our approach correctly flagged 15 (Recall = 15/18 = 0.83). Overall, our approach flagged 25 system-generated question-answer pairs as flawed (Precision = 15/25 = 0.60). Of the 25 system-generated questions not flagged by our approach, 22 were deemed valid by clinicians. See Table 2.
4.4. Categorization and Post-Hoc Editing by GPT-4
For the categories assigned to each question by GPT-4, 8 questions were assigned invalid content category labels, while the remaining 92 questions were assigned appropriate labels. This outcome shows that GPT-4 generally performed well in classifying question categories, although it occasionally struggled to differentiate between Behavioral Health and Social Sciences. This challenge might be addressed by clarifying that Behavioral Health pertains to psychiatry and mental health topics, whereas Social Sciences covers medical ethics, interpersonal health, and health system quality improvement.
Additionally, 16 out of the 50 questions matched the category of their corresponding sample question. This suggests that GPT-4 introduces a degree of variability and diversity in its generated questions. Rather than merely replicating existing content, GPT-4 demonstrates the ability to create new and varied material. A breakdown of categories can be seen in the Supplementary section (eTable 1).
For post-hoc editing, the questions deemed incorrect by at least one reviewer were passed through GPT-4. The model was asked to classify why a question-answer pair was incorrect and then to provide a corrected version. Impressively, for 9 out of 18 questions (50%), GPT-4 identified the same reason for incorrectness as the physician reviewers. For 11 of these 18 questions (61%), GPT-4 was able to correct its original mistake, resulting in a valid exam item. This demonstrates GPT-4’s capability not only to generate questions but also to accurately diagnose issues with them and offer corrections.
5. Conclusion
With ever-increasing costs of medical education, medical student debt, and a looming physician shortage37, there is an urgent need for cost-effective and easily accessible medical exam preparation resources. We designed QUEST-AI, a first-of-its-kind system that can improve access to high-quality USMLE-style questions by using LLMs to generate candidate exam questions, flag invalid candidate items, and correct flawed exam items. While performance of the system is not perfect, clinician evaluation suggests that (1) a significant majority of exam items generated using our approach are valid; (2) candidate performance on items generated using our approach correlates strongly with performance on human-generated USMLE-style questions; and (3) our system can be used to generate exam across a variety of content categories. This offers a promising solution for decreasing the cost and time required to generate USMLE-style questions. This in turn could reduce both the costs for exam preparation materials that debt-burdened medical students face and the costs for generating new exam items that non-profit organizations like the National Board of Medical Examiners face.
6. Limitations
There are several important limitations to our system to consider when assessing whether it can be used in medical education.
First, with respect to our evaluation, the medical specialists who attempted to select the best answer on the evaluation set of 50 system-generated and 50 human-generated questions were not MD students (the primary audience that would benefit from such a system); they were practicing MDs who had already passed the USMLE Step 2 CK exam and DO students who would take a different but similar exam as part of their training. This was by design: we wanted to ensure that no assessor would recognize the exam items in the publicly available NBME-provided USMLE-style practice exam. Otherwise, their ability to distinguish between human- and system-generated questions would be overly optimistic. Additional study is needed to understand whether our results translate to the primary population of interest, namely MD students preparing to take the USMLE Step 2 CK exam.
Second, the clinicians who determined whether or not the system-generated exam items were valid were not expert exam writers nor were they affiliated with the NBME. It is quite possible that system-generated exam items deemed valid by our panel of clinicians would be considered invalid by NBME-employed expert exam writers, and vice versa.
Third, there was no threshold for which our LLM ensemble-based flagging system was able to correctly recall all the system-generated exam items deemed invalid (except for if we trivially flagged all the items as invalid). There were 3 of 18 items deemed invalid for which all 5 LLMs in the ensemble agreed with GPT-4’s best answer selection (thus the question was not flagged) but where at least one clinician deemed the overall exam item to be invalid. This suggests that, were this system to be used entirely autonomously, it could generate flawed exam items. This has important ethical implications that should be considered and potentially addressed with improved methods before releasing the tool to the broader public.
Data Availability
All data produced in the present study will be made available upon reasonable request to the authors
7. Funding and Conflicts of Interest
This work is supported by the Mark and Debra Leslie endowment for AI in Healthcare; the Stanford University Department of Medicine; Stanford Healthcare; and the Stanford Medicine Program for AI in Healthcare. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding bodies. N.H.S. is a cofounder of Prealize Health and Atropos Health; reports funding from the Gordon and Betty Moore Foundation; and served on the board of the Coalition for Healthcare AI (CHAI). J.A.J. is the founder of Jindal Neurology, Inc. and paid per diem as a physician with Kaiser Permanente, South San Francisco, CA. The other authors declare no competing financial interests. No proprietary NBME data or information were used in the study.
Supplementary Material
Footnotes
Expanded experiments and more data