Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Building a Best-in-Class Automated De-identification Tool for Electronic Health Records Through Ensemble Learning

View ORCID ProfileKarthik Murugadoss, Ajit Rajasekharan, View ORCID ProfileBradley Malin, View ORCID ProfileVineet Agarwal, Sairam Bade, View ORCID ProfileJeff R. Anderson, Jason L. Ross, View ORCID ProfileWilliam A. Faubion Jr., View ORCID ProfileJohn D. Halamka, View ORCID ProfileVenky Soundararajan, Sankar Ardhanari
doi: https://doi.org/10.1101/2020.12.22.20248270
Karthik Murugadoss
1nference, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Karthik Murugadoss
Ajit Rajasekharan
1nference, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Bradley Malin
2Vanderbilt University Medical Center, Nashville TN, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Bradley Malin
Vineet Agarwal
1nference, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Vineet Agarwal
Sairam Bade
1nference, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jeff R. Anderson
3Mayo Clinic, Rochester MN, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jeff R. Anderson
Jason L. Ross
1nference, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
William A. Faubion Jr.
3Mayo Clinic, Rochester MN, USA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for William A. Faubion Jr.
John D. Halamka
3Mayo Clinic, Rochester MN, USA
MD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for John D. Halamka
Venky Soundararajan
1nference, Cambridge MA, USA
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Venky Soundararajan
  • For correspondence: venky@nference.net sankar@nference.net
Sankar Ardhanari
1nference, Cambridge MA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: venky@nference.net sankar@nference.net
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

The natural language portions of electronic health records (EHRs) communicate critical information about disease and treatment progression. However, the presence of personally identifiable information (PII) in this data constrains its broad reuse. Despite continuous improvements in methods for the automated detection of PII, the presence of residual identifiers in clinical notes requires manual validation and correction. However, manual intervention is not a scalable solution for large EHR datasets. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Upon detection of PII, the system transforms these detected identifiers into plausible, though fictional, surrogates to further obfuscate any leaked identifier. We evaluated the system with a publicly available dataset of 515 notes from the I2B2 2014 de-identification challenge and a dataset of 10,000 notes from the Mayo Clinic. In comparison with other existing tools considered best-in-class, our approach outperforms them with a recall of 0.992 and 0.994 and a precision of 0.979 and 0.967 on the I2B2 and the Mayo Clinic data, respectively. The automated de-identification system presented here can enable the generation of de-identified patient data at the scale required for modern machine learning applications to help accelerate medical discoveries.

Competing Interest Statement

Jeff R. Anderson, John D. Halamka, and William A. Faubion Jr. do not have any conflicts of interest in this project. Bradley Malin is a contracted consultant of the Mayo Clinic. Karthik Murugadoss, Ajit Rajasekharan, Vineet Agarwal, Sairam Bade, Jason L. Ross, Venky Soundararajan, and Sankar Ardhanari are employees of and have a financial interest in nference. Mayo Clinic and nference may stand to gain financially from the successful outcome of the research.

Funding Statement

No external funding

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This research was conducted with approval from the Mayo Clinic Institutional Review Board 20-003235 - CDAP: Data De-identification Methods Development. All analysis of EHRs was performed in the privacy-preserving environment secured and controlled by the Mayo Clinic. nference and the Mayo Clinic subscribe to the basic ethical principles underlying the conduct of research involving human subjects as set forth in the Belmont Report and strictly ensures compliance with the Common Rule in the Code of Federal Regulations (45 CFR 46) on Protection of Human Subjects. For more information, please visit https://www.mayo.edu/research/institutional-review-board/overview

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

2014 I2B2 data is publicly available subject to signed safe-usage and research only. The Mayo EHR clinical notes are not publicly available at this time.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted February 23, 2021.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Building a Best-in-Class Automated De-identification Tool for Electronic Health Records Through Ensemble Learning
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Building a Best-in-Class Automated De-identification Tool for Electronic Health Records Through Ensemble Learning
Karthik Murugadoss, Ajit Rajasekharan, Bradley Malin, Vineet Agarwal, Sairam Bade, Jeff R. Anderson, Jason L. Ross, William A. Faubion Jr., John D. Halamka, Venky Soundararajan, Sankar Ardhanari
medRxiv 2020.12.22.20248270; doi: https://doi.org/10.1101/2020.12.22.20248270
Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
Building a Best-in-Class Automated De-identification Tool for Electronic Health Records Through Ensemble Learning
Karthik Murugadoss, Ajit Rajasekharan, Bradley Malin, Vineet Agarwal, Sairam Bade, Jeff R. Anderson, Jason L. Ross, William A. Faubion Jr., John D. Halamka, Venky Soundararajan, Sankar Ardhanari
medRxiv 2020.12.22.20248270; doi: https://doi.org/10.1101/2020.12.22.20248270

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (271)
  • Allergy and Immunology (559)
  • Anesthesia (135)
  • Cardiovascular Medicine (1774)
  • Dentistry and Oral Medicine (239)
  • Dermatology (173)
  • Emergency Medicine (316)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (662)
  • Epidemiology (10826)
  • Forensic Medicine (8)
  • Gastroenterology (595)
  • Genetic and Genomic Medicine (2963)
  • Geriatric Medicine (289)
  • Health Economics (534)
  • Health Informatics (1935)
  • Health Policy (836)
  • Health Systems and Quality Improvement (746)
  • Hematology (294)
  • HIV/AIDS (634)
  • Infectious Diseases (except HIV/AIDS) (12530)
  • Intensive Care and Critical Care Medicine (696)
  • Medical Education (300)
  • Medical Ethics (89)
  • Nephrology (325)
  • Neurology (2815)
  • Nursing (152)
  • Nutrition (433)
  • Obstetrics and Gynecology (560)
  • Occupational and Environmental Health (600)
  • Oncology (1475)
  • Ophthalmology (444)
  • Orthopedics (172)
  • Otolaryngology (258)
  • Pain Medicine (190)
  • Palliative Medicine (56)
  • Pathology (381)
  • Pediatrics (869)
  • Pharmacology and Therapeutics (368)
  • Primary Care Research (340)
  • Psychiatry and Clinical Psychology (2650)
  • Public and Global Health (5387)
  • Radiology and Imaging (1019)
  • Rehabilitation Medicine and Physical Therapy (598)
  • Respiratory Medicine (727)
  • Rheumatology (330)
  • Sexual and Reproductive Health (293)
  • Sports Medicine (279)
  • Surgery (328)
  • Toxicology (48)
  • Transplantation (150)
  • Urology (127)