Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study

View ORCID ProfileMingxin Liu, View ORCID ProfileTsuyoshi Okuhara, View ORCID ProfileZhehao Dai, View ORCID ProfileWenbo Huang, View ORCID ProfileHiroko Okada, View ORCID ProfileEmi Furukawa, View ORCID ProfileTakahiro Kiuchi
doi: https://doi.org/10.1101/2024.07.09.24310129
Mingxin Liu
1Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo
MA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Mingxin Liu
  • For correspondence: liumingxin98{at}g.ecc.u-tokyo.ac.jp
Tsuyoshi Okuhara
2Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tsuyoshi Okuhara
Zhehao Dai
3Department of Cardiovascular Medicine, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Zhehao Dai
Wenbo Huang
4Department of Clinical Epidemiology and Health Economics, School of Public Health, The University of Tokyo, Tokyo, Japan
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Wenbo Huang
Hiroko Okada
2Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Hiroko Okada
Emi Furukawa
2Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Emi Furukawa
Takahiro Kiuchi
2Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Takahiro Kiuchi
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

Abstract

Purpose This study aims to evaluate the accuracy of medical knowledge in the most advanced LLMs (GPT-4o, GPT-4, Gemini 1.5 Pro, and Claude 3 Opus) as of 2024. It is the first to evaluate these LLMs using a non-English medical licensing exam. The insights from this study will guide educators, policymakers, and technical experts in the effective use of AI in medical education and clinical diagnosis.

Method Authors inputted 790 questions from Japanese National Medical Examination into the chat windows of the LLMs to obtain responses. Two authors independently assessed the correctness. Authors analyzed the overall accuracy rates of the LLMs and compared their performance on image and non-image questions, questions of varying difficulty levels, general and clinical questions, and questions from different medical specialties. Additionally, authors examined the correlation between the number of publications and LLMs’ performance in different medical specialties.

Results GPT-4o achieved highest accuracy rate of 89.2% and outperformed the other LLMs in overall performance and each specific category. All four LLMs performed better on non-image questions than image questions, with a 10% accuracy gap. They also performed better on easy questions compared to normal and difficult ones. GPT-4o achieved a 95.0% accuracy rate on easy questions, marking it as an effective knowledge source for medical education. Four LLMs performed worst on “Gastroenterology and Hepatology” specialty. There was a positive correlation between the number of publications and LLM performance in different specialties.

Conclusions GPT-4o achieved an overall accuracy rate close to 90%, with 95.0% on easy questions, significantly outperforming the other LLMs. This indicates GPT-4o’s potential as a knowledge source for easy questions. Image-based questions and question difficulty significantly impact LLM accuracy. “Gastroenterology and Hepatology” is the specialty with the lowest performance. The LLMs’ performance across medical specialties correlates positively with the number of related publications.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was supported by JSPS KAKENHI Grant Number 24KJ0830.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced in the present work are contained in the manuscript.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license.
Back to top
PreviousNext
Posted July 09, 2024.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study
Mingxin Liu, Tsuyoshi Okuhara, Zhehao Dai, Wenbo Huang, Hiroko Okada, Emi Furukawa, Takahiro Kiuchi
medRxiv 2024.07.09.24310129; doi: https://doi.org/10.1101/2024.07.09.24310129
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study
Mingxin Liu, Tsuyoshi Okuhara, Zhehao Dai, Wenbo Huang, Hiroko Okada, Emi Furukawa, Takahiro Kiuchi
medRxiv 2024.07.09.24310129; doi: https://doi.org/10.1101/2024.07.09.24310129

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Medical Education
Subject Areas
All Articles
  • Addiction Medicine (430)
  • Allergy and Immunology (754)
  • Anesthesia (221)
  • Cardiovascular Medicine (3286)
  • Dentistry and Oral Medicine (363)
  • Dermatology (277)
  • Emergency Medicine (479)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1169)
  • Epidemiology (13353)
  • Forensic Medicine (19)
  • Gastroenterology (898)
  • Genetic and Genomic Medicine (5142)
  • Geriatric Medicine (481)
  • Health Economics (782)
  • Health Informatics (3263)
  • Health Policy (1140)
  • Health Systems and Quality Improvement (1189)
  • Hematology (429)
  • HIV/AIDS (1016)
  • Infectious Diseases (except HIV/AIDS) (14618)
  • Intensive Care and Critical Care Medicine (912)
  • Medical Education (476)
  • Medical Ethics (126)
  • Nephrology (522)
  • Neurology (4916)
  • Nursing (262)
  • Nutrition (725)
  • Obstetrics and Gynecology (882)
  • Occupational and Environmental Health (795)
  • Oncology (2518)
  • Ophthalmology (723)
  • Orthopedics (280)
  • Otolaryngology (347)
  • Pain Medicine (323)
  • Palliative Medicine (90)
  • Pathology (542)
  • Pediatrics (1299)
  • Pharmacology and Therapeutics (549)
  • Primary Care Research (556)
  • Psychiatry and Clinical Psychology (4202)
  • Public and Global Health (7492)
  • Radiology and Imaging (1704)
  • Rehabilitation Medicine and Physical Therapy (1010)
  • Respiratory Medicine (980)
  • Rheumatology (479)
  • Sexual and Reproductive Health (497)
  • Sports Medicine (424)
  • Surgery (547)
  • Toxicology (72)
  • Transplantation (235)
  • Urology (204)