Abstract
When using tree-based methods to develop predictive analytics and early warning systems for preventive healthcare, it is important to use an appropriate imputation method to prevent learning the missingness pattern. To demonstrate this, we developed a novel simulation that generated synthetic electronic health record data using a variational autoencoder with a custom loss function, which took into account the high missing rate of electronic health data. We showed that when tree-based methods learn missingness patterns (correlated with adverse events) in electronic health record data, this leads to decreased performance if the system is used in a new setting that has different missingness patterns. Performance is worst in this scenario when the missing rate between those with and without an adverse event is the greatest. We found that randomized and Bayesian regression imputation methods mitigate the issue of learning the missingness pattern for tree-based methods. We used this information to build a novel early warning system for predicting patient deterioration in general wards and telemetry units: PICTURE (Predicting Intensive Care Transfers and other UnfoReseen Events). To develop, tune, and test PICTURE, we used labs and vital signs from electronic health records of adult patients over four years (n = 133,089 encounters). We analyzed primary outcomes of unplanned intensive care unit transfer, emergency vasoactive medication administration, cardiac arrest, and death. We compared PICTURE with existing early warning systems and logistic regression at multiple levels of granularity. When analyzing PICTURE on the testing set using all observations within a hospital encounter (event rate = 3.4%), PICTURE had an area under the receiver operating characteristic curve (AUROC) of 0.83 and an adjusted (event rate = 4%) area under the precision-recall curve (AUPR) of 0.27, while the next best tested method—regularized logistic regression—had an AUROC of 0.80 and an adjusted AUPR of 0.22. To ensure system interpretability, we applied a state-of-the-art prediction explainer that provided a ranked list of features contributing most to the prediction. Though it is currently difficult to compare machine learning–based early warning systems, a rudimentary comparison with published scores demonstrated that PICTURE is on par with state-of-the-art machine learning systems. To facilitate more robust comparisons and development of early warning systems in the future, we have released our variational autoencoder’s code and weights so researchers can (a) test their models on data similar to our institution and (b) make their own synthetic datasets.
Highlights
Novel simulation shows that learning missingness patterns in EHR data decreases early warning system performance if missingness pattern changes
Simulation generated synthetic EHR data using variational autoencoder with custom loss function to account for high missing rate
Randomized imputation and Bayesian regression imputation prevented tree-based methods from learning missingness patterns
Using appropriate imputation, we developed PICTURE, an early warning system for patient deterioration
PICTURE performance is comparable to currently used systems and it can explain predictions via feature ranking
Competing Interest Statement
Christopher E. Gillies, Daniel F. Taylor, Fadi Islim, Richard P. Medlin and Kevin R. Ward submitted a patent regarding our machine learning methodologies presented in this paper through the University of Michigan's Office of Technology Transfer.
Funding Statement
The work here was unfunded.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study was approved by the University of Michigan's IRB.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
Conflicts of Interest and Source of Funding: Christopher E. Gillies, Daniel F. Taylor, Fadi Islim, Richard P. Medlin and Kevin R. Ward submitted a patent regarding our machine learning methodologies presented in this paper through the University of Michigan’s Office of Technology Transfer.
Data Availability
The code for our autoencoder is available here: https://github.com/MCIRCC/picture-simulation
Acronyms and Abbreviations
- AUROC
- Area Under the Receiver Operating Characteristic Curve
- AUPR
- Area Under The Precision Recall Curve
- eCART
- Electronic Cardiac Arrest Risk Triage
- EHR
- Electronic Health Records
- ICU
- Intensive Care Unit
- NEWS
- National Early Warning Score
- PICTURE
- Predicting Intensive Care Transfers and other UnfoReseen Events
- PPV
- Positive Predictive Value
- SOFA
- Sequential Organ Failure Assessment Score
- SHAP
- Shapley Additive explanation
- XGBoost
- eXtreme Gradient Boosting
- WDR
- Workup-to-detection ratio