Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records

View ORCID ProfileBlythe Adamson, View ORCID ProfileMichael Waskom, View ORCID ProfileAuriane Blarre, View ORCID ProfileJonathan Kelly, View ORCID ProfileKonstantin Krismer, View ORCID ProfileSheila Nemeth, View ORCID ProfileJames Gippetti, View ORCID ProfileJohn Ritten, View ORCID ProfileKatherine Harrison, View ORCID ProfileGeorge Ho, View ORCID ProfileRobin Linzmayer, View ORCID ProfileTarun Bansal, View ORCID ProfileSamuel Wilkinson, View ORCID ProfileGuy Amster, View ORCID ProfileEvan Estola, View ORCID ProfileCorey M. Benedum, View ORCID ProfileErin Fidyk, View ORCID ProfileMelissa Estevez, View ORCID ProfileWill Shapiro, View ORCID ProfileAaron B. Cohen
doi: https://doi.org/10.1101/2023.03.02.23286522
Blythe Adamson
1Flatiron Health, Inc., New York, NY, United States
2The Comparative Health Outcomes, Policy, and Economics (CHOICE) Institute, Department of Pharmacy, University of Washington, Seattle, WA, United States
PhD, MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Blythe Adamson
  • For correspondence: badamson@flatiron.com
Michael Waskom
1Flatiron Health, Inc., New York, NY, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Michael Waskom
Auriane Blarre
1Flatiron Health, Inc., New York, NY, United States
MEng
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Auriane Blarre
Jonathan Kelly
1Flatiron Health, Inc., New York, NY, United States
MEng
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jonathan Kelly
Konstantin Krismer
1Flatiron Health, Inc., New York, NY, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Konstantin Krismer
Sheila Nemeth
1Flatiron Health, Inc., New York, NY, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sheila Nemeth
James Gippetti
1Flatiron Health, Inc., New York, NY, United States
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for James Gippetti
John Ritten
1Flatiron Health, Inc., New York, NY, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for John Ritten
Katherine Harrison
1Flatiron Health, Inc., New York, NY, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Katherine Harrison
George Ho
1Flatiron Health, Inc., New York, NY, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for George Ho
Robin Linzmayer
1Flatiron Health, Inc., New York, NY, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Robin Linzmayer
Tarun Bansal
1Flatiron Health, Inc., New York, NY, United States
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tarun Bansal
Samuel Wilkinson
1Flatiron Health, Inc., New York, NY, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Samuel Wilkinson
Guy Amster
1Flatiron Health, Inc., New York, NY, United States
PhD
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Guy Amster
Evan Estola
1Flatiron Health, Inc., New York, NY, United States
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Evan Estola
Corey M. Benedum
1Flatiron Health, Inc., New York, NY, United States
PhD, MPH
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Corey M. Benedum
Erin Fidyk
1Flatiron Health, Inc., New York, NY, United States
ANP-BC, MBA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Erin Fidyk
Melissa Estevez
1Flatiron Health, Inc., New York, NY, United States
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Melissa Estevez
Will Shapiro
1Flatiron Health, Inc., New York, NY, United States
MS
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Will Shapiro
Aaron B. Cohen
1Flatiron Health, Inc., New York, NY, United States
3Department of Medicine, NYU Grossman School of Medicine, New York, NY, United States
MD, MSCE
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Aaron B. Cohen
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

Background As artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAI’s ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability.

Methods We applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information.

Results We developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates.

Conclusions NLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.

Competing Interest Statement

All authors are employees of Flatiron Health, Inc., which is an independent member of the Roche group, and own stock in Roche.

Funding Statement

This study was sponsored by Flatiron Health, Inc. (Flatiron Health), which is an independent member of the Roche group.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Institutional Review Board (WCG IRB) approval of the study protocol was obtained prior to study conduct, and included a waiver of informed consent.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The data that support the findings of this study have been originated by Flatiron Health, Inc. Requests for data sharing by license or by permission for the specific purpose of replicating results in this manuscript can be submitted to dataaccess{at}flatiron.com.

  • Abbreviations

    AI
    artificial intelligence
    BERT
    bidirectional encoder representations from transformers
    EHR
    electronic health records
    LSTM
    long term short memory
    ML
    machine learning
    NPV
    negative predictive value
    NSCLC
    non–small cell lung cancer
    P&Ps
    Policies and Procedures
    PPV
    positive predictive value
    RWD
    real world data
    RWE
    real-world evidence
  • Copyright 
    The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
    Back to top
    PreviousNext
    Posted March 06, 2023.
    Download PDF
    Data/Code
    Email

    Thank you for your interest in spreading the word about medRxiv.

    NOTE: Your email address is requested solely to identify you as the sender of this article.

    Enter multiple addresses on separate lines or separate them with commas.
    Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records
    (Your Name) has forwarded a page to you from medRxiv
    (Your Name) thought you would like to see this page from the medRxiv website.
    CAPTCHA
    This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
    Share
    Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records
    Blythe Adamson, Michael Waskom, Auriane Blarre, Jonathan Kelly, Konstantin Krismer, Sheila Nemeth, James Gippetti, John Ritten, Katherine Harrison, George Ho, Robin Linzmayer, Tarun Bansal, Samuel Wilkinson, Guy Amster, Evan Estola, Corey M. Benedum, Erin Fidyk, Melissa Estevez, Will Shapiro, Aaron B. Cohen
    medRxiv 2023.03.02.23286522; doi: https://doi.org/10.1101/2023.03.02.23286522
    Reddit logo Twitter logo Facebook logo LinkedIn logo Mendeley logo
    Citation Tools
    Approach to Machine Learning for Extraction of Real-World Data Variables from Electronic Health Records
    Blythe Adamson, Michael Waskom, Auriane Blarre, Jonathan Kelly, Konstantin Krismer, Sheila Nemeth, James Gippetti, John Ritten, Katherine Harrison, George Ho, Robin Linzmayer, Tarun Bansal, Samuel Wilkinson, Guy Amster, Evan Estola, Corey M. Benedum, Erin Fidyk, Melissa Estevez, Will Shapiro, Aaron B. Cohen
    medRxiv 2023.03.02.23286522; doi: https://doi.org/10.1101/2023.03.02.23286522

    Citation Manager Formats

    • BibTeX
    • Bookends
    • EasyBib
    • EndNote (tagged)
    • EndNote 8 (xml)
    • Medlars
    • Mendeley
    • Papers
    • RefWorks Tagged
    • Ref Manager
    • RIS
    • Zotero
    • Tweet Widget
    • Facebook Like
    • Google Plus One

    Subject Area

    • Oncology
    Subject Areas
    All Articles
    • Addiction Medicine (270)
    • Allergy and Immunology (552)
    • Anesthesia (135)
    • Cardiovascular Medicine (1757)
    • Dentistry and Oral Medicine (238)
    • Dermatology (173)
    • Emergency Medicine (312)
    • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (659)
    • Epidemiology (10799)
    • Forensic Medicine (8)
    • Gastroenterology (590)
    • Genetic and Genomic Medicine (2949)
    • Geriatric Medicine (287)
    • Health Economics (534)
    • Health Informatics (1928)
    • Health Policy (836)
    • Health Systems and Quality Improvement (745)
    • Hematology (292)
    • HIV/AIDS (630)
    • Infectious Diseases (except HIV/AIDS) (12517)
    • Intensive Care and Critical Care Medicine (691)
    • Medical Education (299)
    • Medical Ethics (86)
    • Nephrology (324)
    • Neurology (2799)
    • Nursing (151)
    • Nutrition (433)
    • Obstetrics and Gynecology (559)
    • Occupational and Environmental Health (597)
    • Oncology (1466)
    • Ophthalmology (444)
    • Orthopedics (172)
    • Otolaryngology (257)
    • Pain Medicine (190)
    • Palliative Medicine (56)
    • Pathology (381)
    • Pediatrics (866)
    • Pharmacology and Therapeutics (366)
    • Primary Care Research (337)
    • Psychiatry and Clinical Psychology (2640)
    • Public and Global Health (5366)
    • Radiology and Imaging (1013)
    • Rehabilitation Medicine and Physical Therapy (595)
    • Respiratory Medicine (726)
    • Rheumatology (330)
    • Sexual and Reproductive Health (289)
    • Sports Medicine (279)
    • Surgery (327)
    • Toxicology (47)
    • Transplantation (150)
    • Urology (125)