Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Privacy Protection for Chinese Electronic Medical Records Using Large Language Models: Effectiveness Evaluation and Application of LLM Models in Medical Data Tasks

View ORCID ProfileGong Mengchun, View ORCID ProfileOuyang Zihao, Ma Dandan, Cai Endi, Liu Chao, Shi Wenzhao, Zhang Bohan, Ma Lian, Wei Yuna, Jiang Huizhen, Zhou Xiang
doi: https://doi.org/10.1101/2025.07.27.25332177
Gong Mengchun
1Digital Health China Technologies Ltd., Beijing, China
2GMC Lab, School of Biomedical Engineering, Guangdong Medical University, Dongguan, China
3National Clinical Research Centre for Aging and Medicine, Huashan Hospital, Fudan University, Shanghai, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Gong Mengchun
Ouyang Zihao
1Digital Health China Technologies Ltd., Beijing, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ouyang Zihao
Ma Dandan
4Information Center Department, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Sciences, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Cai Endi
1Digital Health China Technologies Ltd., Beijing, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Liu Chao
1Digital Health China Technologies Ltd., Beijing, China
2GMC Lab, School of Biomedical Engineering, Guangdong Medical University, Dongguan, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shi Wenzhao
1Digital Health China Technologies Ltd., Beijing, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zhang Bohan
1Digital Health China Technologies Ltd., Beijing, China
2GMC Lab, School of Biomedical Engineering, Guangdong Medical University, Dongguan, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ma Lian
4Information Center Department, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Sciences, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Wei Yuna
4Information Center Department, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Sciences, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jiang Huizhen
4Information Center Department, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Sciences, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Zhou Xiang
4Information Center Department, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Sciences, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: zx_pumc{at}163.com
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Background The privacy protection of medical patients has remained a critical concern in healthcare information management during the digital era. Conventional approaches have predominantly relied on rule-based protocols and data encryption systems, which typically require substantial involvement of IT professionals for implementation. Recent advancements in Large Language Models (LLMs) have introduced novel approaches for electronic medical records (EMRs) privacy protection, simultaneously enabling clinical practitioners to utilize these tools for specific data tasks.

Objectives This study aims to leverage LLMs through a no-code framework to achieve structured processing of patient privacy data in Chinese EMRs and formulate privacy policies, while evaluating the practical efficacy of LLMs.

Methods This study employs a disease-specific data subset from Peking Union Medical College Hospital (PUMCH), comprising data from approximately 160,000 patients, using a prompt engineering approach to enable LLMs to perform sensitive information annotation in lengthy EMR narratives. Simultaneously, it automates the classification of privacy-level for identified sensitive data and develops targeted protection strategies based on risk tiers, thereby mitigating non-essential exposure of patient privacy during data sharing. The research utilizes the Qwen model, with its entire workflow being exclusively driven by medical natural language prompts and self-evolving knowledge bases, requiring no supplementary programming or code development. These strategies were validated using the hospital’s test text dataset, with primary evaluation metrics focusing on precision rates (including accuracy of information extraction and privacy-level classification) and recall rate assessments for critical sensitive data categories.

Results Utilizing 4 million text entries from PUMCH, we conducted sampled data observation and performed privacy annotation via LLM prompts across seven categories: names, addresses, contact details, national ID numbers, hospital names, sexually transmitted disease (STD) information, and pregnancy-related patient data. Through iterative prompt refinement via error analysis, optimal performance was achieved on the test set, demonstrating an average precision of 97% and recall of 95% across these seven entity types. Furthermore, sensitivity tier classification was implemented for three high-risk categories: addresses, STD information, and pregnancy-related data, attaining average precision of 95% and recall of 90% in sensitivity-level determination.

Discussion We propose a novel codeless privacy protection framework leveraging LLMs, enabling intelligent anonymization of medical data through natural language interaction. This solution employs a three-tiered hierarchical protection mechanism that dynamically adapts privacy strategies to clinical scenario requirements, ensuring data security while maximizing data utility.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was supported by the National Key Research and Development Program of China (2023YFC2706305).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study has been approved by the Ethics Review Committee of Peking Union Medical College Hospital, Chinese Academy of Medical Sciences

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The datasets supporting the findings of this study are available from the corresponding author upon reasonable request. The hospital data cannot be made publicly available due to its sensitive nature.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted July 28, 2025.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Privacy Protection for Chinese Electronic Medical Records Using Large Language Models: Effectiveness Evaluation and Application of LLM Models in Medical Data Tasks
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Privacy Protection for Chinese Electronic Medical Records Using Large Language Models: Effectiveness Evaluation and Application of LLM Models in Medical Data Tasks
Gong Mengchun, Ouyang Zihao, Ma Dandan, Cai Endi, Liu Chao, Shi Wenzhao, Zhang Bohan, Ma Lian, Wei Yuna, Jiang Huizhen, Zhou Xiang
medRxiv 2025.07.27.25332177; doi: https://doi.org/10.1101/2025.07.27.25332177
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Privacy Protection for Chinese Electronic Medical Records Using Large Language Models: Effectiveness Evaluation and Application of LLM Models in Medical Data Tasks
Gong Mengchun, Ouyang Zihao, Ma Dandan, Cai Endi, Liu Chao, Shi Wenzhao, Zhang Bohan, Ma Lian, Wei Yuna, Jiang Huizhen, Zhou Xiang
medRxiv 2025.07.27.25332177; doi: https://doi.org/10.1101/2025.07.27.25332177

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (576)
  • Allergy and Immunology (867)
  • Anesthesia (306)
  • Cardiovascular Medicine (4480)
  • Dentistry and Oral Medicine (449)
  • Dermatology (385)
  • Emergency Medicine (614)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1528)
  • Epidemiology (15276)
  • Forensic Medicine (31)
  • Gastroenterology (1133)
  • Genetic and Genomic Medicine (6643)
  • Geriatric Medicine (671)
  • Health Economics (1006)
  • Health Informatics (4602)
  • Health Policy (1378)
  • Health Systems and Quality Improvement (1622)
  • Hematology (544)
  • HIV/AIDS (1275)
  • Infectious Diseases (except HIV/AIDS) (15959)
  • Intensive Care and Critical Care Medicine (1110)
  • Medical Education (626)
  • Medical Ethics (147)
  • Nephrology (674)
  • Neurology (6692)
  • Nursing (346)
  • Nutrition (1006)
  • Obstetrics and Gynecology (1152)
  • Occupational and Environmental Health (961)
  • Oncology (3369)
  • Ophthalmology (988)
  • Orthopedics (370)
  • Otolaryngology (421)
  • Pain Medicine (437)
  • Palliative Medicine (131)
  • Pathology (668)
  • Pediatrics (1703)
  • Pharmacology and Therapeutics (699)
  • Primary Care Research (717)
  • Psychiatry and Clinical Psychology (5494)
  • Public and Global Health (9284)
  • Radiology and Imaging (2223)
  • Rehabilitation Medicine and Physical Therapy (1375)
  • Respiratory Medicine (1201)
  • Rheumatology (598)
  • Sexual and Reproductive Health (720)
  • Sports Medicine (535)
  • Surgery (720)
  • Toxicology (100)
  • Transplantation (290)
  • Urology (266)