PT - JOURNAL ARTICLE AU - Shaan Khurshid AU - Christopher Reeder AU - Lia X. Harrington AU - Pulkit Singh AU - Gopal Sarma AU - Samuel F. Friedman AU - Paolo Di Achille AU - Nathaniel Diamant AU - Jonathan W. Cunningham AU - Ashby C. Turner AU - Emily S. Lau AU - Julian S. Haimovich AU - Mostafa A. Al-Alusi AU - Xin Wang AU - Marcus D.R. Klarqvist AU - Jeffrey M. Ashburner AU - Christian Diedrich AU - Mercedeh Ghadessi AU - Johanna Mielke AU - Hanna M. Eilken AU - Alice McElhinney AU - Andrea Derix AU - Steven J. Atlas AU - Patrick T. Ellinor AU - Anthony A. Philippakis AU - Christopher D. Anderson AU - Jennifer E. Ho AU - Puneet Batra AU - Steven A. Lubitz TI - Cohort Design and Natural Language Processing to Reduce Bias in Electronic Health Records Research: The Community Care Cohort Project AID - 10.1101/2021.05.26.21257872 DP - 2021 Jan 01 TA - medRxiv PG - 2021.05.26.21257872 4099 - http://medrxiv.org/content/early/2021/05/30/2021.05.26.21257872.short 4100 - http://medrxiv.org/content/early/2021/05/30/2021.05.26.21257872.full AB - Background Electronic health records (EHRs) promise to enable broad-ranging discovery with power exceeding that of conventional research cohort studies. However, research using EHR datasets may be subject to selection bias, which can be compounded by missing data, limiting the generalizability of derived insights.Methods Mass General Brigham (MGB) is a large New England-based healthcare network comprising seven tertiary care and community hospitals with associated outpatient practices. Within an MGB-based EHR warehouse of >3.5 million individuals with at least one ambulatory care visit, we approximated a community-based cohort study by selectively sampling individuals longitudinally attending primary care practices between 2001-2018 (n=520,868), which we named the Community Care Cohort Project (C3PO). We also utilized pre-trained deep natural language processing (NLP) models to recover vital signs (i.e., height, weight, and blood pressure) from unstructured notes in the EHR. We assessed the validity of C3PO by deploying established risk models including the Pooled Cohort Equations (PCE) and the Cohorts for Aging and Genomic Epidemiology Atrial Fibrillation (CHARGE-AF) score, and compared model performance in C3PO to that observed within typical EHR Convenience Samples which included all individuals from the same parent EHR with sufficient data to calculate each score but without a requirement for longitudinal primary care. All analyses were facilitated by the JEDI Extractive Data Infrastructure pipeline which we designed to efficiently aggregate EHR data within a unified framework conducive to regular updates.Results C3PO includes 520,868 individuals (mean age 48 years, 61% women, median follow-up 7.2 years, median primary care visits per individual 13). Estimated using reports, C3PO contains over 2.9 million electrocardiograms, 450,000 echocardiograms, 12,000 cardiac magnetic resonance images, and 75 million narrative notes. Using tabular data alone, 286,009 individuals (54.9%) had all vital signs available at baseline, which increased to 358,411 (68.8%) after NLP recovery (31% reduction in missingness). Among individuals with both NLP and tabular data available, NLP-extracted and tabular vital signs obtained on the same day were highly correlated (e.g., Pearson r range 0.95-0.99, p<0.01 for all). Both the PCE models (c-index range 0.724-0.770) and CHARGE-AF (c-index 0.782, 95% 0.777-0.787) demonstrated good discrimination. As compared to the Convenience Samples, AF and MI/stroke incidence rates in C3PO were lower and calibration error was smaller for both PCE (integrated calibration index range 0.012-0.030 vs. 0.028-0.046) and CHARGE-AF (0.028 vs. 0.036).Conclusions Intentional sampling of individuals receiving regular ambulatory care and use of NLP to recover missing data have the potential to reduce bias in EHR research and maximize generalizability of insights.Competing Interest StatementDr. Philippakis receives sponsored research support from Bayer AG, IBM, Intel, and Verily. He has also received consulted fees from Novartis and Rakuten. He is a Venture Partner at GV and is compensated for this work. Dr. Ho receives sponsored research support from Bayer AG and Gilead Sciences. Dr. Ho has received research supplies from EcoNugenics. Dr. Friedman receives sponsored research support from Bayer AG and IBM. Dr. Anderson receives sponsored research support from Bayer AG and has consulted for ApoPharma and Invitae. Dr. Batra receives sponsored research support from Bayer AG and IBM, and consults for Novartis. Dr. Lubitz receives sponsored research support from Bristol Myers Squibb / Pfizer, Bayer AG, Boehringer Ingelheim, and Fitbit, and has consulted for Bristol Myers Squibb / Pfizer and Bayer AG, and participates in a research collaboration with IBM. Dr. Ellinor receives sponsored research support from Bayer AG and IBM Research and he has consulted for Bayer AG, Novartis, MyoKardia and Quest Diagnostics. Dr. Atlas receives sponsored research support from Bristol Myers Squibb / Pfizer and has consulted for Bristol Myers Squibb/Pfizer and Fitbit. Dr. Ashburner has received sponsored research support from Bristol Myers Squibb / Pfizer. Dr. Diedrich, Dr. Mielke, Dr. Eilken, Dr. Derix, and Ms. Ghadessi are employees of Bayer AG.Funding StatementDr. Khurshid is supported by NIH T32HL007208. Dr. Haimovich is supported by NIH R38HL150212. Dr. Atlas is supported by American Heart Association (AHA) grant 18SFRN34250007. Dr. Ashburner is supported by NIH K01HL148506 and AHA 18SFRN34250007. Dr. Ho is supported by NIH R01HL134893, R01HL140224, and K24HL153669. Dr. Lubitz is supported by NIH 1R01HL139731 and AHA 18SFRN34250007. Dr. Ellinor is supported by NIH 1R01HL092577, R01HL128914, K24HL105780, AHA 18SFRN34110082, and by the Foundation Leducq 14CVD01. Dr. Anderson is supported by NIH R01NS103924, U01NS069673, AHA 18SFRN34250007, and AHA-Bugher 21SFRN812095. This work was sponsored by Bayer AG.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Study protocols complied with the tenets of the Declaration of Helsinki and were approved by the Mass General Brigham Institutional Review Board.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesMGB source data contain potentially identifying information and cannot be shared publicly. The JEDI data processing pipeline underlying C3PO is currently located in a private GitHub repository (https://github.com/broadinstitute/jedi), to which access will be granted upon request to the corresponding author. JEDI is in the process of being open-sourced under a BSD 3-Clause License and will be made publicly available on GitHub upon completion.