Abstract
Objective Fair and safe Large Language Models (LLMs) hold the potential for clinical task-shifting which, if done reliably, can benefit over-burdened healthcare systems, particularly for resource-limited settings and traditionally overlooked populations. However, this powerful technology remains largely understudied in real-world contexts, particularly in the global South. This study aims to assess if openly available LLMs can be used equitably and reliably for processing medical notes in real-world settings in South Asia.
Methods We used publicly available medical LLMs to parse clinical notes from a large electronic health records (EHR) database in Pakistan. ChatGPT, GatorTron, BioMegatron, BioBert and ClinicalBERT were tested for bias when applied to these data, after fine-tuning them to a) publicly available clinical datasets I2B2 and N2C2 for medical concept extraction (MCE) and emrQA for medical question answering (MQA), and b) the local EHR dataset. For MCE models were applied to clinical notes with 3-label and 9-label formats and for MQA were applied to medical questions. Internal and external validation performance was measured for a) and b) using F1, precision, recall, and accuracy for MCE and BLEU and ROUGE-L for MQA.
Results LLMs not fine-tuned to the local EHR dataset performed poorly, suggesting bias, when externally validated on it. Fine-tuning the LLMs to the local EHR data improved model performance. Specifically, the 3-label precision, recall, F1 score, and accuracy for the dataset improved by 21-31%, 11-21%, 16-27%, and 6-10% amongst GatorTron, BioMegatron, BioBert and ClinicalBERT. As an exception, ChatGPT performed better on the local EHR dataset by 10% for precision and 13% for each of recall, F1 score, and accuracy. 9-label performance trends were similar.
Conclusions Publicly available LLMs, predominantly trained in global north settings, were found to be biased when used in a real-world clinical setting. Fine-tuning them to local data and clinical contexts can help improve their reliable and equitable use in resource-limited settings. Close collaboration between clinical and technical experts can ensure responsible and unbiased powerful tech accessible to resource-limited, overburdened settings used in ways that are safe, fair, and beneficial for all.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study was funded by Bill & Melinda Gates Foundation (INV-062576).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
IRB of Shaukat Khanum Memorial Cancer Hospital and Research Centre gave ethical approval for this work (Ethics Review Number EX-17-07-23-01)
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
Aggregate data will be made freely available on study website. No patient-level data sharing is permitted as per ethics approval.