ABSTRACT
Background As artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAI’s ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability.
Methods We applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information.
Results We developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates.
Conclusions NLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.
Competing Interest Statement
All authors are employees of Flatiron Health, Inc., which is an independent member of the Roche group, and own stock in Roche.
Funding Statement
This study was sponsored by Flatiron Health, Inc. (Flatiron Health), which is an independent member of the Roche group.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Institutional Review Board (WCG IRB) approval of the study protocol was obtained prior to study conduct, and included a waiver of informed consent.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
The data that support the findings of this study have been originated by Flatiron Health, Inc. Requests for data sharing by license or by permission for the specific purpose of replicating results in this manuscript can be submitted to dataaccess{at}flatiron.com.
Abbreviations
- AI
- artificial intelligence
- BERT
- bidirectional encoder representations from transformers
- EHR
- electronic health records
- LSTM
- long term short memory
- ML
- machine learning
- NPV
- negative predictive value
- NSCLC
- non–small cell lung cancer
- P&Ps
- Policies and Procedures
- PPV
- positive predictive value
- RWD
- real world data
- RWE
- real-world evidence