Abstract
Healthcare systems ideally should be able to draw lessons from historical data, including whether common exposures are associated with adverse clinical outcomes. Unfortunately, structured clinical data, such as encounter diagnostic codes in electronic health records, suffer from multiple limitations and biases, limiting effective learning. We hypothesized that a machine learning approach to automate ascertainment of clinical events and disease history from medical notes would improve upon using structured data and enable the estimation of real-world risks. We sought to test this approach to address a timely goal: estimating the delayed risk of adverse cardiovascular events (i.e. after the index infection) in patients infected with respiratory viruses. Using 4,151 cardiologist-labeled notes as gold standard, we trained a series of neural network models to automate event adjudication for heart failure hospitalization, acute coronary syndrome, stroke, and coronary revascularization and to identify past medical history for heart failure. Though performance varied by task, in nearly all cases, our models surpassed the use of structured data in terms of sensitivity for a given specificity level and enabled principled evaluation of classification thresholds, which is typically impossible to do with diagnostic codes. Deploying our models on more than 17 million notes for 267,596 patients across an extensive integrated delivery network, we found that patients infected with respiratory syncytial virus had a 23% increased risk of delayed heart failure hospitalization over a subsequent 4-year period compared with propensity-score matched patients who had the same test but with negative results (p = 0.003, log-rank). In contrast, we found no such increased risk in patients with a positive influenza viral test compared with a negative test (rate ratio 0.98, p = 0.71). We conclude that convolutional neural network-based models enable accurate clinical labeling at scale, thereby unlocking timely insights from unstructured clinical data.
Competing Interest Statement
R.C.D is supported by grants from the National Institute of Health, the American Heart Association (One Brave Idea, Apple Heart and Movement Study) and GE Healthcare, has received consulting fees from Novartis and Pfizer, and is co-founder of Atman Health. C.A.M. is a consultant for Pfizer and co-founder of Atman Health.
Funding Statement
This work was supported by One Brave Idea, co-founded by the American Heart Association and Verily with significant support from AstraZeneca and pillar support from Quest Diagnostics (to CAM and RCD), SENSHIN Medical Research Foundation (to SG), the Kanae foundation for the promotion of medical science (to SG), the Mower fellowship (to SG).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study complies with all ethical regulations and guidelines. The study protocol was approved by local institutional review boards (IRB) of Mass General Brigham (2019P002651)
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
The data that support the findings of this study are available on request from the corresponding author R.C.D. upon approval of the data sharing committees of the respective institutions. The data are not publicly available due to the presence of information that could compromise research participant privacy.