ABSTRACT
Rationale and objectives Acute kidney injury (AKI) is common among hospitalized patients and has a significant impact on morbidity and mortality. While early prediction of AKI has the potential to reduce adverse patient outcomes, it remains a difficult condition to predict and diagnose. The purpose of this study was to evaluate the ability of a machine learning algorithm to predict for AKI KDIGO Stage 2 or 3 up to 72 hours in advance of onset using convolutional recurrent neural nets (CNN) and patient Electronic Health Record (EHR) data.
Methods A CNN prediction system was developed to continuously and automatically monitor for incipient AKI. 7122 patient encounters were retrospectively analyzed from the Medical Information Mart for Intensive Care III (MIMIC-III) database.
New Predictors and Established Predictors New predictor - CNN machine learning-based AKI prediction model. Established predictors - XGBoost AKI prediction model and the Sequential Organ Failure Assessment (SOFA) scoring system.
Outcomes AKI onset.
Analytical Approach The model was trained on routinely-collected patient EHR data. Measurements included Area Under the Receiver Operating Characteristic (AUROC) curve, positive predictive value (PPV), and a battery of additional performance metrics for 72 hour advance prediction of AKI onset.
Results On a hold-out test set, the algorithm attained an AUROC of 0.85 and PPV of 0.25, relative to a cohort AKI prevalence of 5.21%, for long-horizon AKI prediction at a 72-hour window prior to onset.
Conclusions A CNN machine learning-based AKI prediction model outperforms XGBoost and the SOFA scoring system, demonstrating superior performance in predicting acute kidney injury 72 hours prior to onset, without reliance on changes in serum creatinine.
INTRODUCTION
Acute kidney injury (AKI) is a complex syndrome associated with large clinical and financial burdens [1–12]. Despite its prevalence in hospitalized patients [2,13] and reported incidence as high as 70% in the critically ill [13,14], no treatment has been developed to effectively reverse injury to the kidney and restore kidney function [1]. The reasons for this failure have been attributed to delays in diagnosis and intervention [2, 15–23], the complex nature of the AKI syndrome and the staging of its severity [3, 21], and its multiple etiologies [15,16].
Until recently, studies of incidence and outcomes of AKI have produced inconsistent results due to varying definitions of AKI [24–26]. The Risk, Injury, Failure, Loss, End-stage kidney disease (RIFLE) criteria [27], followed by the Acute Kidney Injury Network (AKIN) [28] and most recently the Kidney Disease: Improving Global Outcomes (KDIGO) criteria [29, 30] have provided consensus on an AKI definition. KDIGO guidelines define acute kidney injury as an absolute increase of serum creatinine (SCr) of >0·3 mg/dL within 48 hours or a relative increase of >50% over no more than 7 days [21, 29]. Doubling of SCr at steady state reflects an approximate 50% decrease in kidney function as assessed by glomerular filtration rate (GFR) [31]. Some studies have suggested that changes in SCr even smaller than 0.3 mg/dL within 48 hours are associated with significant increases in the risk of death, dialysis, and other morbidities [6, 21, 32–38], and other studies are consistent with worsening outcomes with increasing AKI stage [5, 24, 39–43]. However, increases of serum creatinine are known to lag kidney injury by hours to days after the initial kidney insult, and therefore recognition of AKI is delayed by reliance on SCr measurements [44,45].
Early AKI detection is critical to improving patient outcomes [46–49]. Given that the components necessary for defining and staging AKI are routinely available in the electronic health record (EHR) [3], a number of automated alerts have been developed to predict AKI events prior to onset. However, these alerts are generally triggered by detecting changes in SCr and/or urine output [17]. Because a range of kidney injury can exist before the loss of kidney function can be estimated with these standard laboratory tests [45,50], there is great interest in developing methods that could be used to detect AKI in patients at an earlier stage [51–57]. In this paper, we describe our methodology for the development of a convolutional neural net prediction system that continuously and automatically monitors for incipient AKI, using patient data extracted from the EHR, without requiring serum creatinine or urine output values.
MATERIALS AND METHODS
Description of data
This study uses data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)-III version 1.3 dataset [58], collected at Beth Israel Deaconess Medical Center in Boston, MA from 2001 to 2012. The MIMIC dataset offers a variety of encounter information from more than 40,000 unique patients and includes both structured (e.g. lab results) and unstructured (e.g. clinician notes) data. Due to differences in the storage of patient procedures information, we restrict our study to data collected from 2008 to 2012 using the MetaVision (iMDSoft) EHR system, and do not include data collected from 2001 to 2008 using the CareVue (Philips) system [59]. Because the collection of the MIMIC data did not affect patient safety and because all data were anonymized in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, the Institutional Review Boards of Beth Israel Deaconess Medical Center and the Massachusetts Institute of Technology have waived the requirement for patient consent.
From the MetaVision EHR MIMIC encounters, we selected for inclusion those stays involving adult patients (i.e. age 18 years or older) with at least one measurement of diastolic blood pressure, systolic blood pressure, temperature, respiratory rate, heart rate, SpO2, and Glasgow Coma Scale. These measurements were selected because they are frequently available and easily collected at the patient bedside, even before clinical suspicion of AKI is present. These were the only direct variables used during training and testing of the algorithm. Serum creatinine was used to determine the gold standard of AKI true positive patients, but was not used as an input in testing. To facilitate the analysis of 72-hour advance prediction of AKI onset with a five-hour window of measurements upon which to base such a prediction, we required patient stay duration to be at least 77 hours in length. For convenience and with minimal restriction, we required that patient encounters lasted no more than 1000 hours. Inclusion criteria are listed in Figure 1 for 24, 28, and 72 hour prediction windows, and the demographic characteristics of encounters meeting the inclusion criteria are reported in Table 1.
Inclusion diagram.
Demographic characteristics of MIMIC III ICU encounters meeting the inclusion criteria of Figure 1. We note that the determination of KDIGO positive or negative was made after the data preprocessing steps described in the Methods section.
Overview of preprocessing, training, and testing
Patient encounters satisfying the inclusion criteria were immediately allocated to training and testing sets. Roughly 90% and 10% of all encounters were randomly allocated to the training (n = 6410 patient encounters) and testing sets (n = 710 patient encounters), respectively, stratifying by positive and negative class to ensure equal representation of classes in both sets. We binned the data by the hour, imputed missing measurements, and standardized measurements on a variable-by-variable basis. KDIGO Stage 2 or Stage 3 classifications were determined for each encounter, along with the corresponding times of KDIGO “onset” where appropriate. Stage 2 AKI is defined in the KDIGO staging system as an increase in SCr to more than 200% to 300% (>2- to 3-fold) from baseline or urine output <0.5 ml/kg per hour for more than 12 hours [29]. Stage 3 AKI is defined as an increase in SCr to more than 300% (>3-fold) from baseline, or ≥ 4.0 mg/dl (≥ 354 mmol/l), or kidney replacement therapy (KRT), or a decrease in estimated glomerular filtration rate (eGFR) to < 35 ml/min per 1.73m2 (if <18 years of age), or urine output < 0.5 mL/kg/hr for ≥ 24 hours or anuria for ≥ 12 hours [29].
A Doc2Vec embedding network was created to vectorize clinical text data. The embedding network was prepared on a large collection of mid-stay clinical notes, ranging from primary complaint to radiology notes, including everything up to, but not including, the discharge summary, from encounters allocated to the training set. The network embedded texts into 250-dimensional numeric vectors, which served as inputs to the classifiers, alongside the structured data associated with the stays. Training data were passed to a convolutional neural network (CNN) structure, with hyperparameters optimized on the training set using the Python-based optimization package Talos. After the end of training on each fold, network performance was evaluated using the hold-out test set. Results were reported as the average test set performance across cross-validation folds.
Structured data preprocessing
Structured data were binned by the hour, with multiple intra-hour measurements of the same variable replaced by their average. Missing measurements were handled separately for training and testing sets using last observation carried forward imputation.
Document vector encoding network and unstructured data preprocessing
To facilitate the use of unstructured text data alongside the structured inputs, we trained a Doc2Vec [60] embedding network with 250 nodes, trained on 238,468 mid-stay clinical notes. Document vectors were produced for the text data available from each encounter, using 125 epochs -- to better ensure the stability of inferred document vector -- and an initial learning rate of 0.01.
Training of neural network classifier
We constructed a classifier using the Python deep learning library, Keras, that uses variants of multi-channel, multi-headed attention together with convolutions to extract information from the quantitative time series data. A separate network for handling the document vector produced by the Doc2Vec network was combined downstream through concatenation in a fully-connected output layer. Model hyperparameters were optimized using the Nadam optimizer [61] as implemented in the Keras library with learning rate of 0.0009 and binary cross-entropy loss. A diagram of this neural network architecture is available as Supplementary Figure 1.
To fit the weights of the network with 10-fold cross-validation, we split the training data into 10 subsets of roughly equal size, and iteratively used 9 subsets for intra-fold training and the final subset for intra-fold testing.
Model parameters were fit over the course of 50 epochs on the 9 intra-fold training subsets, with evaluation on the final subset. For each iterate, we obtained a receiver operating characteristic (ROC) curve, as well as a battery of performance metrics. We then randomly reset the model parameters before performing another iterate. From cross-validation, we obtained an average ROC curve and average performance metrics, along with standard deviation for the performance metrics. Lastly, we evaluated the performance of the trained network on the original, 10% hold-out test set. These results are presented in comparison with an XGBoost [62] classifier and the Sequential Organ Failure Assessment (SOFA) score [63]. Although the SOFA score was not developed for the purpose of long-horizon AKI prediction, it has previously been shown to independently predict AKI risk and outcomes [64,65], and therefore serves as a validated comparison measure for AKI prediction. The XGBoost classifier was trained on the same training sets -- 5-hour windows of quantitative, clinical EHR data -- and evaluated on the same testing set. XGBoost hyperparameters were tuned using a cross-validated grid search on the training data.
RESULTS
The demographic characteristics associated with MIMIC III ICU encounters meeting the inclusion criteria of Figure 1 are provided in Table 1. The study population was 56.11% male, with few (1.74%) patients younger than 30 years of age and a substantial percentage of patients aged 70 years or more (21.03%). More than half (52.50%) of patients had stays lasting between 3 and 5 days, with a substantial percentage of patients experiencing stays of 12 days or longer (17.83%). The overall mortality rate was 34.84%, with 5.21% of encounters meeting the criteria for KDIGO Stage 2 or Stage 3 at some point, and 7.16% of stays meeting some stage of the KDIGO criteria at any point during the stay.
The results from 10-fold cross-validation on the 90% training set are reported in Table 2. The CNN model with the use of the Doc2Vec embeddings of encounter text data outperformed the XGBoost comparator model and the SOFA score for 72-hour advance prediction of KDIGO Stage 2 or Stage 3 onset. We note that, in order to provide non-summative performance metrics (i.e., the metrics other than area under the receiver operating characteristic (AUROC) curve), we selected an operating point for each model or score which provided a sensitivity nearest 0.80. The CNN model performed better (AUROC of 0.85) when text data were made available through Doc2Vec than when these data were unavailable (AUROC of 0.75). In addition, the quality of prediction was higher for KDIGO Stage 2 or Stage 3 onset, as compared with the prediction of onset for any of KDIGO Stages 1-3 (AUROC of 0.81). For corresponding CNN and XGBoost results without oversampling of the minority class, see Supplementary Table 1. The results from 10-fold cross-validation for prediction 48-hours (CNN AUROC of 0.835; PPV 0.236) and 24-hours (CNN AUROC of 0.856; PPV 0.221) prior to onset are reported in Supplementary Table 2 and Supplementary Table 3. Permutation feature importance methods were implemented to provide information on the relative importance of each input variable. Feature importance data are presented in Supplementary Table 4 and Supplementary Figure 2.
Results from 10-fold cross-validation on the MIMIC III data set. The convolutional neural network (CNN) model is compared with an XGBoost classifier, and the Sequential Organ Failure Assessment (SOFA) score. Additional comparison is made to the CNN model without the use of the Doc2Vec network (i.e., without unstructured text data) and for the prediction of KDIGO criteria of any stage. Abbreviations: area under the receiver operating characteristic (AUROC) curve; diagnostic odds ratio (DOR); positive and negative likelihood ratios (LR+ and LR-, respectively); positive and negative predictive value (PPV and NPV, respectively); standard deviation (SD)
The CNN model averaged a positive predictive value (PPV) of 0.25 over cross-validation folds for the 72-hour prediction of KDIGO Stages 2 and 3, compared to average PPVs of 0.13 and 0.11 for XGBoost and the SOFA score, respectively (Table 2). The advantage of the CNN mostly vanished (PPV of 0.16) in the absence of text data through Doc2Vec input. The average PPV was highest when the CNN classifier was given access to Doc2Vec input and tasked with 72-hour prediction of KDIGO Stages 1-3 (PPV of 0.30). Relative to the 5.21% prevalence of KDIGO Stages 2 and 3, positive predictions made by the CNN model enriched for KDIGO Stage 2 or 3 encounters by a factor of 4.80, whereas XGBoost and the SOFA score enriched these encounters by factors of 2.50 and 2.11, respectively.
The ROC curve comparison of 72-hour prediction on the 10% hold-out test set is shown in Figure 2. The CNN model, which was provided text data through Doc2Vec input, performed substantially better than the XGBoost model and the SOFA score. The XGBoost model and SOFA had similar performance on the test set.
ROC curve comparison of prediction performance using a convolutional neural net (CNN) classifier, an XGBoost (XGB) classifier, and the SOFA score, 72 hours prior to AKI onset on the MIMIC III ICU hold out data set. AUROC, Area Under the Receiver Operating Characteristic curve; SOFA, Sequential Organ Failure Assessment score.
DISCUSSION
These experiments demonstrate that a convolutional neural network can predict AKI up to 72 hours in advance of KDIGO Stage 2 or Stage 3 AKI onset, with AUROC performance superior to that of an XGBoost classifier and the SOFA scoring system (Table 2, Figure 2). Because of the ubiquity of the SOFA score, and previous usage in AKI prediction, it serves as a validated comparator for our current approach [64,65]. The XGBoost comparator is similarly important, primarily due to its broad and successful use in applications to other clinical prediction tasks (e.g., the 2019 Physionet Computing in Cardiology Challenge [66]).
The superiority of the CNN classifier to the XGBoost classifier and the commonly-used SOFA score is evidenced by key performance metrics, such as AUROC and PPV (Table 2). The PPV performance improvement is of particular importance. Romero-Brufau et al. have argued that AUROC performance may be misleading for clinicians interested in evaluating the clinical impact of a diagnostic tool, as AUROC does not incorporate information about the prevalence of a condition [67]. In fact, for the same reason, AUROC is useful for comparing the performance of tools retrospectively validated on different datasets. This concern regarding PPV and prevalence is relevant to our study, as we found that the prevalence of KDIGO Stages 2 or 3 is roughly 5% in the cohort. The AUROC is a summative metric which may include ranges of operating points which are irrelevant to a given task, whereas PPV can be focused on a clinically relevant operating point. To produce the metrics in Table 2, we chose operating points for the CNN and comparators which fixed their sensitivities near 0.80. Zeiberg et al. have recently proposed that tools for the prediction of low-prevalence diseases should have PPVs which are at least four times the prevalence, in order to be clinically useful [68]. Of the CNN, XGBoost, and SOFA score, only the CNN would qualify as being clinically useful on the basis of its PPV.
Beyond the text data input through Doc2Vec, CNN predictions were made using only age and 7 routinely collected patient measurements (diastolic blood pressure, systolic blood pressure, temperature, respiratory rate, heart rate, SpO2, and Glasgow Coma Scale) as inputs. Importantly, the CNN model did not rely on SCr to make predictions, distinguishing it from other AKI prediction tools. Creatinine levels can take hours or days to rise to AKI thresholds as defined in the KDIGO staging system [69]; therefore, changes in SCr may reflect pre-existing kidney damage. An AKI prediction tool which does not depend on SCr measurements may better afford clinicians the opportunity to intervene early, to prevent AKI development or progression, or to limit further kidney damage. Additionally, using only commonly collected variables in the EHR for AKI prediction allows automatic screening of a general patient population for impending AKI without requiring specialized evaluation.
This study contributes to the growing body of retrospective machine learning literature for the prediction of AKI [70]. Chiofolo et al. (2019) developed a model for AKI prediction and surveillance in ICU patients at a 6-hr prediction window with an AUROC of 0.88 [71]. Fletchet et al. (2017) developed the AKIpredictor, a prognostic calculator for prediction of AKI in ICU patients during the first week of stay [72]. Their KDIGO Stage 2 and 3 model produced AUROCs between 0.77 and 0.84. The AUROC of 0.84 corresponds to a prediction of KDIGO Stage 2 and 3 after gathering 24 hours of data. As a point of comparison, the CNN model used only 5 hours of data before making a prediction. Recent work by Tomasev et al. pursued a deep learning approach for continuous risk prediction of deterioration in acute kidney injury patients, and evaluated their tool on a Veteran’s Health Administration dataset of 703,782 adult patients. Algorithm performance at a 48-hour prediction window corresponded to a sensitivity of 55.8% and a specificity of 82.7% [73]. This performance is reported to be “in range” required for regulatory approval [74]. While these studies make important contributions to the domain of AKI research, they depend on the use of SCr to make predictions, which is a lagging marker of kidney function, and most make predictions at shorter prediction horizons than the 72-hour window described in this work.
While the MLA described in this study offers substantial lead time in AKI identification (up to 72 hours), and offers improved predictive performance over our previous work [75], it still requires prospective validation. Additionally, we cannot determine from this retrospective study what impact the algorithm might have on clinicians and their provision of care in clinical settings. Algorithm performance is assessed only on US patients older than age 18, with stays in the ICU, which limits the generalizability of our results to other patient populations and levels of care. Lastly, because there have been several proposed consensus definitions for AKI, the algorithm we described may have different results when compared against non-KDIGO definitions, or in settings which utilize a different standard in their diagnostic procedures.
CONCLUSION
A convolutional neural network for AKI prediction outperforms XGBoost and the traditional SOFA scoring system, demonstrating superior performance in predicting acute kidney injury up to 72 hours prior to onset, without reliance on measurements of changes in serum creatinine. The use of clinical text data through a Doc2Vec network substantially strengthens prediction performance. Such a tool may improve prediction and early detection of AKI in clinical settings, thereby allowing for earlier intervention.
Data Availability
The data that support the findings of this study are publicly available from http://www.nature.com/articles/sdata201635.
Disclosures
Author’s Contributions
R.D., S.L., and A.A. conceived and designed this study; S.L. and A.A. performed the modeling and statistical analysis; all authors contributed to acquisition, analysis, or interpretation of data; S.L., J.H., A.S., and E.P. drafted the article; all authors revised the article for important intellectual content; and R.D. obtained funding.
Support
This work was supported by the National Institute on Alcohol Abuse and Alcoholism (NIAAA) [grant ID: 1R43AA02767401]
Financial Disclosures
All authors who have affiliations listed with Dascena (Oakland, California, USA) are employees or contractors of Dascena.
Data Sharing Plan
The data that support the findings of this study are publicly available from http://www.nature.com/articles/sdata201635.