Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Application of Generative Artificial Intelligence to Utilise Unstructured Clinical Data for Acceleration of Inflammatory Bowel Disease Research

View ORCID ProfileAlex Z Kadhim, View ORCID ProfileZachary Green, View ORCID ProfileIman Nazari, Jonathan Baker, Michael George, Ashley Heinson, View ORCID ProfileMatt Stammers, View ORCID ProfileChristopher M Kipps, View ORCID ProfileR Mark Beattie, View ORCID ProfileJames J Ashton, View ORCID ProfileSarah Ennis
doi: https://doi.org/10.1101/2025.03.07.25323569
Alex Z Kadhim
1Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
2National Institute for Health Research (NIHR) Southampton Biomedical Research Centre, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Alex Z Kadhim
Zachary Green
1Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
3Department of Paediatric Gastroenterology, Southampton Children’s Hospital, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Zachary Green
Iman Nazari
1Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Iman Nazari
Jonathan Baker
3Department of Paediatric Gastroenterology, Southampton Children’s Hospital, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Michael George
4Clinical Informatics Research Unit, University Hospital Southampton NHS Trust, Southampton, UK
5Southampton Emerging Therapies and Technologies (SETT) Centre, University Hospital Southampton NHS Trust, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ashley Heinson
4Clinical Informatics Research Unit, University Hospital Southampton NHS Trust, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Matt Stammers
4Clinical Informatics Research Unit, University Hospital Southampton NHS Trust, Southampton, UK
5Southampton Emerging Therapies and Technologies (SETT) Centre, University Hospital Southampton NHS Trust, UK
6Department of Gastroenterology, University Hospital Southampton NHS Trust, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Matt Stammers
Christopher M Kipps
5Southampton Emerging Therapies and Technologies (SETT) Centre, University Hospital Southampton NHS Trust, UK
7Department of Neurology, University Hospital Southampton NHS Trust, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Christopher M Kipps
R Mark Beattie
3Department of Paediatric Gastroenterology, Southampton Children’s Hospital, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for R Mark Beattie
James J Ashton
1Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
3Department of Paediatric Gastroenterology, Southampton Children’s Hospital, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for James J Ashton
Sarah Ennis
1Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
2National Institute for Health Research (NIHR) Southampton Biomedical Research Centre, Southampton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sarah Ennis
  • For correspondence: S.Ennis{at}soton.ac.uk
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Background Inflammatory bowel disease (IBD) research is a dynamic field. However, the growing volume of electronic health records (EHRs) and research data presents significant challenges. Traditional methods for structuring unstructured medical records are labour-intensive and lack scalability. Large language models (LLMs) may present a solution, yet their usefulness in data standardisation in the context of IBD remains unknown.

Objective To evaluate the use of LLMs in structuring free-text histology and radiology reports from IBD patients, compare their performance to manual clinician curation, and assess the usefulness of fine-tuning and retrieval-augmented generation (RAG).

Design We developed an IBD-specialised LLM-based framework utilising structured prompt engineering and fine-tuning. Reports were manually curated and processed using various LLMs. Performance was assessed and RAG was used to enhance model responses with clinical guidelines from European Crohn’s and Colitis Organisation (ECCO) and the European Society for Paediatric Gastroenterology Hepatology and Nutrition (ESPGHAN).

Results Overall, Llama 3.3 achieved the highest F1 for histology and imaging (1 ± 0 and 0.85 ± 0.29, respectively) in extracting findings and anatomical regions, surpassing other models in structured data generation. Fine-tuning improved the performance of the smaller Llama 3.1 8B model for imaging reports (0.7 ± 0.46 vs 0.82 ± 0.35), enabling better extraction with reduced computational requirements.

Conclusion Our findings demonstrate the feasibility of LLM-based automated structuring of IBD-related medical records. Unstructured data from free text reports can be reliably converted to standardised ontologies with location, severity, and qualifiers. These advancements enable scalable, privacy-compliant AI-driven solutions for data standardisation.

Figure
  • Download figure
  • Open in new tab

What is already known on this topic Traditional methods for structuring unstructured medical records for research are labour-intensive and lack scalability. IBD patients generate vast quantities of longitudinal medical data due to the chronicity of disease. Large language models (LLMs) are well-positioned for data extraction and standardisation purposes.

What this study adds This study demonstrates that Llama 3.3-70B and fine-tuned smaller models (Llama 3.1 8B) can accurately structure IBD-related histology and radiology reports. Additionally, retrieval-augmented generation (RAG) enhances clinical interpretability by incorporating guideline-based context.

How this study might affect research, practice or policy The use of LLMs in structuring EHR data can significantly accelerate IBD research, improve data standardisation, and facilitate privacy-compliant AI-driven solutions for clinical decision support and policy development.

Competing Interest Statement

JJA is a SAB member for Orchard Therapeutics.

Funding Statement

This study was supported by the Institute for Life Sciences, University of Southampton, and the NIHR Southampton Biomedical Research Centre and EPSRC (EP/Y01720X/1). JJA is funded by a NIHR advanced Fellowship (NIHR302478). ZG is funded by a CICRA research training fellowship.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Ethics committee/IRB of University Hospital Southampton gave ethical approval for this work (REC 09/H0504/125)

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

Due to patient privacy restrictions, the underlying clinical data or fine-tuned models cannot be publicly shared. The fully fine-tuned models on 90 histology and imaging reports are available upon request.

https://github.com/UoS-HGIG/

https://huggingface.co/UoS-HGIG

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted March 10, 2025.
Download PDF

Supplementary Material

Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Application of Generative Artificial Intelligence to Utilise Unstructured Clinical Data for Acceleration of Inflammatory Bowel Disease Research
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Application of Generative Artificial Intelligence to Utilise Unstructured Clinical Data for Acceleration of Inflammatory Bowel Disease Research
Alex Z Kadhim, Zachary Green, Iman Nazari, Jonathan Baker, Michael George, Ashley Heinson, Matt Stammers, Christopher M Kipps, R Mark Beattie, James J Ashton, Sarah Ennis
medRxiv 2025.03.07.25323569; doi: https://doi.org/10.1101/2025.03.07.25323569
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Application of Generative Artificial Intelligence to Utilise Unstructured Clinical Data for Acceleration of Inflammatory Bowel Disease Research
Alex Z Kadhim, Zachary Green, Iman Nazari, Jonathan Baker, Michael George, Ashley Heinson, Matt Stammers, Christopher M Kipps, R Mark Beattie, James J Ashton, Sarah Ennis
medRxiv 2025.03.07.25323569; doi: https://doi.org/10.1101/2025.03.07.25323569

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Gastroenterology
Subject Areas
All Articles
  • Addiction Medicine (576)
  • Allergy and Immunology (868)
  • Anesthesia (306)
  • Cardiovascular Medicine (4482)
  • Dentistry and Oral Medicine (449)
  • Dermatology (385)
  • Emergency Medicine (615)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1528)
  • Epidemiology (15278)
  • Forensic Medicine (31)
  • Gastroenterology (1133)
  • Genetic and Genomic Medicine (6645)
  • Geriatric Medicine (671)
  • Health Economics (1006)
  • Health Informatics (4605)
  • Health Policy (1378)
  • Health Systems and Quality Improvement (1623)
  • Hematology (544)
  • HIV/AIDS (1276)
  • Infectious Diseases (except HIV/AIDS) (15961)
  • Intensive Care and Critical Care Medicine (1111)
  • Medical Education (626)
  • Medical Ethics (147)
  • Nephrology (674)
  • Neurology (6695)
  • Nursing (346)
  • Nutrition (1006)
  • Obstetrics and Gynecology (1153)
  • Occupational and Environmental Health (961)
  • Oncology (3369)
  • Ophthalmology (988)
  • Orthopedics (370)
  • Otolaryngology (421)
  • Pain Medicine (437)
  • Palliative Medicine (131)
  • Pathology (669)
  • Pediatrics (1704)
  • Pharmacology and Therapeutics (700)
  • Primary Care Research (717)
  • Psychiatry and Clinical Psychology (5495)
  • Public and Global Health (9285)
  • Radiology and Imaging (2223)
  • Rehabilitation Medicine and Physical Therapy (1375)
  • Respiratory Medicine (1201)
  • Rheumatology (598)
  • Sexual and Reproductive Health (721)
  • Sports Medicine (535)
  • Surgery (722)
  • Toxicology (100)
  • Transplantation (290)
  • Urology (267)