Abstract
Purpose To investigate the performance of LLMs in radiology numerical tasks and perform a comprehensive error analysis.
Materials and Methods We defined six tasks: extracting 1-minimum T-score from DEXA report, 2-maximum common bile duct (CBD) diameter from ultrasound report, and 3-maximum lung nodule size from CT report, and judging 1-presence of a highly hypermetabolic region on a PET report, 2-whether a patient is osteoporotic based on a DEXA report, and 3-whether a patient has a dilated CBD based on an ultrasound report. Reports were extracted from the MIMIC III and our institution’s databases, and the ground truths were extracted manually. The models used were Llama 3.1 8b, DeepSeek R1 distilled Llama 8b, OpenAI o1-mini, and OpenAI GPT-5-mini. We manually reviewed all incorrect outputs and performed a comprehensive error analysis.
Results In extraction tasks, while Llama showed relatively variable results (ranging 86%-98.7%) across tasks, other models performed consistently well (accuracies >95%). In judgement tasks, the lowest accuracies of Llama, DeepSeek, o1-mini, and GPT-5-mini were 62.0%, 91.7%, 91.7%, and 99.0%, respectively, while o1-mini and GPT-5-mini did reach 100% performance in detecting osteoporosis. We found no mathematical errors in the outputs of o1-mini and GPT-5-mini. Answer-only output format significantly reduced performance in Llama and DeepSeek but not in o1-mini or GPT-5-mini.
Conclusion True reasoning models perform consistently well in radiology numerical tasks and show no mathematical errors. Simpler non-true reasoning models may also achieve acceptable performance depending on the task.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
Ali Nowroozi, MD was an NIH T32 postdoctoral fellow for most of this project, under the award number 5T32HL007185-47
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Department of Radiology and Biomedical Imaging, University of California, San Francisco (UCSF), San Francisco, California - IRB approval number 17-22317
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced in the present study are available upon reasonable request to the authors





