Abstract
Purpose To evaluate the performance and consistency of large language models (LLMs) across brand and generic oncology drug names in various clinical tasks, addressing concerns about potential fluctuations in LLM performance due to subtle phrasing differences that could impact patient care.
Methods This study evaluated three LLMs (GPT-3.5-turbo-0125, GPT-4-turbo, and GPT-4o) using drug names from the HemOnc ontology. The assessment included 367 generic-to-brand and 2,516 brand-to-generic pairs, 1,000 drug-drug interaction synthetic patient cases, and 2,438 immune-related adverse event (irAE) cases. LLMs were tested on drug name recognition, word association, drug-drug interaction (DDI) detection, and irAE diagnosis using both brand and generic drug names.
Results LLMs demonstrated high accuracy in matching brand and generic names (GPT-4o: 97.38% for brand, 94.71% for generic, p < 0.0001). However, they showed significant inconsistencies in word association tasks. GPT-3.5-turbo-0125 exhibited biases favoring brand names for effectiveness (OR 1.43, p < 0.05) and being side-effect-free (OR 1.76, p < 0.05). DDI detection accuracy was poor across all models (<26%), with no significant differences between brand and generic names. Sentiment analysis revealed significant differences, particularly in GPT-3.5-turbo-0125 (brand mean 0.6703, generic mean 0.9482, p < 0.0001). Consistency in irAE diagnosis varied across models.
Conclusions and Relevance Despite high proficiency in name-matching, LLMs exhibit inconsistencies when processing brand versus generic drug names in more complex tasks. These findings highlight the need for increased awareness, improved robustness assessment methods, and the development of more consistent systems for handling nomenclature variations in clinical applications of LLMs.
Key objective This study aimed to assess the consistency of large language models (LLMs) in handling brand and generic oncology drug names across various tasks, including drug-drug interaction detection and adverse event identification.
Knowledge generated LLMs demonstrated high accuracy in matching brand and generic names but showed significant inconsistencies in more complex tasks. Notable, models exhibited significant differences in attributing brand versus generic names to positive terms and sentiment.
Competing Interest Statement
JG is funded by the National Institute of Health through NIH-USA R01CA294033-01. SC is funded by the National Institute of Health through NIH-USA R01CA294033-01. JW is funded by the National Institute of Health through NIH-USA R01CA294033-01. DSB declares funding from the NIH (NIH-USA U54CA274516-01A1 and R01CA294033-01) and the ASTRO-ACS Clinician Scientist Development Grant ASTRO-CSDG-24-1244514. HJLW declares funding from the NIH (U24 CA265879)
Funding Statement
JG is funded by the National Institute of Health through NIH-USA R01CA294033-01. SC is funded by the National Institute of Health through NIH-USA R01CA294033-01. JW is funded by the National Institute of Health through NIH-USA R01CA294033-01. DSB declares funding from the NIH (NIH-USA U54CA274516-01A1 and R01CA294033-01) and the ASTRO-ACS Clinician Scientist Development Grant ASTRO-CSDG-24-1244514. HJLW declares funding from the NIH (U24 CA265879)
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
The full code and dataset are available in our public repository at BittermanLab/OncoRABBITS.