Abstract
Antibiograms are essential tools in antimicrobial stewardship programs (ASPs), guiding empirical antibiotic therapy and tracking antimicrobial resistance (AMR) trends. However, the manual compilation of antibiograms from unstructured microbiology reports is labor-intensive and prone to delays. Here, we present a comparative study of three Natural Language Processing (NLP) approaches for automating data extraction from free-text reports: a rule-based Named Entity Recognition (NER) system, a statistical NER model using the spaCy library, and a transformer-based question-answering (QA) model leveraging DistilBERT. We generated a synthetic dataset of 3,000 microbiology reports to evaluate these methods, focusing on extraction accuracy (precision, recall, F1-score) and computational efficiency. The rule-based NER achieved perfect accuracy (F1 = 1.00) with minimal computational resources, making it highly suitable for real-time deployment. The spaCy model, after domain-specific fine-tuning, demonstrated strong performance (F1 = 1.00), effectively handling linguistic variations. In contrast, the transformer QA model showed moderate accuracy, excelling at extracting organism names but underperforming in detecting contamination status due to contextual ambiguities (F1 = 0.68-0.8). Computational efficiency analysis revealed that the rule-based and spaCy NER models could process reports rapidly with limited resources, while the transformer QA model required substantial computational power, potentially limiting its clinical utility. Additionally, we developed a prototype Expert System using R Shiny employing the rule-based NER to integrate extracted data into a real-time antibiogram dashboard, demonstrating the feasibility of these approaches in practical settings. The STEWEX (Stewardship Expert System) prototype has the capabilities of real-time building of fully functional antibiogram from simulated unstructured reports and simulated antibiotic susceptibility results. In conclusion, our results suggest that while advanced NLP methods offer flexibility, rule-based NER systems provide unparalleled accuracy and efficiency for data extraction from unstructured reports in ASPs, which represents bottle neck in development of antibiogram. Future efforts will focus on validating these approaches using real-world clinical data with the ultimate goal of fully automating antibiogram generation to support data-driven antimicrobial stewardship.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study did not receive any funding
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data used in the study were simulated using R version 4.2. The simulation script is available upon reasonable request to the authors