Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Customize Deep Learning-based De-Identification Systems Using Local Clinical Notes - A Study of Sample Size

Xi Yang, Jiang Bian, Yonghui Wu
doi: https://doi.org/10.1101/2020.08.09.20171231
Xi Yang
1Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville FL USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jiang Bian
1Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville FL USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yonghui Wu
1Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville FL USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: yonghui.wu@ufl.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Electronic Health Records (EHRs) are a valuable resource for both clinical and translational research. However, much detailed patient information is embedded in clinical narratives, including a large number of patients’ identifiable information. De-identification of clinical notes is a critical technology to protect the privacy and confidentiality of patients. Previous studies presented many automated de-identification systems to capture and remove protected health information from clinical text. However, most of them were tested only in one institute setting where training and test data were from the same institution. Directly adapting these systems without customization could lead to a dramatic performance drop. Recent studies have shown that fine-tuning is a promising method to customize deep learning-based NLP systems across different institutes. However, it’s still not clear how much local data is required. In this study, we examined the customizing of a deep learning-based de-identification system using different sizes of local notes from UF Health. Our results showed that the fine-tuning could significantly improve the model performance even on a small local dataset. Yet, when the local data exceeded a threshold (e.g., 700 notes in this study), the performance improvement became marginal.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This study was partially supported by a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-2018C3-14754), a grant from the National Cancer Institute, 1R01CA246418 R01, a grant from the National Institute on Aging, NIA R21AG062884, the University of Florida (UF) Informatics Institute Junior SEED Program (00129436), and the Cancer Informatics and eHealth core jointly supported by the UF Health Cancer Center and the UF Clinical and Translational Science Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding institutions.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

IRB exemption as a retrospective data analysis project

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

  • alexgre{at}ufl.edu, bianjiang{at}ufl.edu

  • The paper was accepted for an oral presentation at the 2020 KDD Workshop on Applied Data Science for Healthcare (https://dshealthkdd.github.io/dshealth-2020/)

Data Availability

Data will not be available to public

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted October 26, 2020.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Customize Deep Learning-based De-Identification Systems Using Local Clinical Notes - A Study of Sample Size
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Customize Deep Learning-based De-Identification Systems Using Local Clinical Notes - A Study of Sample Size
Xi Yang, Jiang Bian, Yonghui Wu
medRxiv 2020.08.09.20171231; doi: https://doi.org/10.1101/2020.08.09.20171231
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Customize Deep Learning-based De-Identification Systems Using Local Clinical Notes - A Study of Sample Size
Xi Yang, Jiang Bian, Yonghui Wu
medRxiv 2020.08.09.20171231; doi: https://doi.org/10.1101/2020.08.09.20171231

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (62)
  • Allergy and Immunology (142)
  • Anesthesia (44)
  • Cardiovascular Medicine (408)
  • Dentistry and Oral Medicine (67)
  • Dermatology (47)
  • Emergency Medicine (141)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (171)
  • Epidemiology (4813)
  • Forensic Medicine (3)
  • Gastroenterology (177)
  • Genetic and Genomic Medicine (671)
  • Geriatric Medicine (70)
  • Health Economics (187)
  • Health Informatics (621)
  • Health Policy (314)
  • Health Systems and Quality Improvement (200)
  • Hematology (85)
  • HIV/AIDS (155)
  • Infectious Diseases (except HIV/AIDS) (5281)
  • Intensive Care and Critical Care Medicine (326)
  • Medical Education (91)
  • Medical Ethics (24)
  • Nephrology (73)
  • Neurology (677)
  • Nursing (41)
  • Nutrition (111)
  • Obstetrics and Gynecology (124)
  • Occupational and Environmental Health (203)
  • Oncology (438)
  • Ophthalmology (138)
  • Orthopedics (36)
  • Otolaryngology (88)
  • Pain Medicine (35)
  • Palliative Medicine (15)
  • Pathology (127)
  • Pediatrics (193)
  • Pharmacology and Therapeutics (129)
  • Primary Care Research (84)
  • Psychiatry and Clinical Psychology (768)
  • Public and Global Health (1798)
  • Radiology and Imaging (321)
  • Rehabilitation Medicine and Physical Therapy (138)
  • Respiratory Medicine (255)
  • Rheumatology (86)
  • Sexual and Reproductive Health (68)
  • Sports Medicine (61)
  • Surgery (100)
  • Toxicology (23)
  • Transplantation (28)
  • Urology (37)