Abstract
Introduction Delirium is a cerebral dysfunction seen commonly in the acute care setting. Delirium is associated with increased mortality and morbidity and is frequently missed in the emergency department (ED) by clinical gestalt alone. Identifying those at risk of delirium may help prioritize screening and interventions.
Objective Our objective was to identify clinically valuable predictive models for prevalent delirium within the first 24 hours of hospitalization based on the available data by assessing the performance of logistic regression and a variety of machine learning models.
Methods This was a retrospective cohort study to develop and validate a predictive risk model to detect delirium using patient data obtained around an ED encounter. Data from electronic health records for patients hospitalized from the ED between January 1, 2014, and December 31, 2019, were extracted. Eligible patients were aged 65 or older, admitted to an inpatient unit from the emergency department, and had at least one DOSS assessment or CAM-ICU recorded while hospitalized. The outcome measure of this study was delirium within one day of hospitalization determined by a positive DOSS or CAM assessment. We developed the model with and without the Barthel index for activity of daily living, since this was measured after hospital admission.
Results The area under the ROC curves for delirium ranged from .69 to .77 without the Barthel index. Random forest and gradient-boosted machine showed the highest AUC of .77. At the 90% sensitivity threshold, gradient-boosted machine, random forest, and logistic regression achieved a specificity of 35%. After the Barthel index was included, random forest, gradient-boosted machine, and logistic regression models demonstrated the best predictive ability with respective AUCs of .85 to .86.
Conclusion This study demonstrated the use of machine learning algorithms to identify the combination of variables that are predictive of delirium within 24 hours of hospitalization from the ED.
INTRODUCTION
Delirium is a global cerebral dysfunction seen in 8% up to 64% of patients in the acute care setting. The prevalence of delirium in the emergency department (ED) and inpatient populations is surprisingly high.1-3 The presence of delirium is associated with a prolonged hospital stay, a higher likelihood of skilled nursing facility placement, and a 2-to 4-fold increase in mortality.4 Despite the mortality rate being comparable to myocardial infarction, the fluctuating nature of symptoms, uncertainty of baseline cognitive function, and limited diagnostic modality lead to diagnostic dilemmas. By clinical gestalt alone, providers miss up to 80% of patients experiencing delirium upon presentation to the ED.5
Unfortunately, delirium continues to be underdiagnosed and undertreated.5 Although several cognitive assessment tools exist, they require additional training and dissemination. 6-8 Early screening and interventional options are emerging and seem promising, as reported by several recent studies.9-11 There is a need to identify an optimal screening strategy for delirium beyond cognitive assessment because, until we have it, delirium will likely remain an elusive diagnosis. An accurate prediction model derived from variables assessed around the time of the ED visit could be a solution to identifying patients who are at risk for delirium and may benefit the most from screening and preventive measures.
Our objective was to identify a clinically valuable predictive models for prevalent delirium within the first 24 hours of hospitalization based on the available data by assessing the performance of logistic regression and a variety of machine learning models.
METHODS
Study Design
This was a retrospective cohort study using data from a tertiary care medical center. Data from electronic health records for patients hospitalized from the ED between January 1, 2014, and December 31, 2019, were extracted from Institute for Clinical Translational Science Data Warehouse. The study was approved by the local institutional review board (IRB), and we adhered to the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD).12
Participants
The study population comprised patients hospitalized from an academic Level 1 trauma center ED with approximately 60,000 visits a year. At this institution, hospitalized patients aged 65 and older were screened for delirium twice daily from the time of hospitalization until discharge. The nursing protocol was in place to assess patients with the Delirium Observation Screening Scale (DOSS), a 13-item screen observation scale for non-intubated patients. If the patient was ventilated, the Confusion Assessment Method for the ICU (CAM-ICU) was used.
Eligible patients were aged 65 or older, admitted to an inpatient unit from the ED, and had at least one DOSS assessment or CAM-ICU recorded within the first 24 hours of hospitalization. If a patient was hospitalized more than once between January 2015 and December 2019, we randomly selected an encounter defined by medical record number.
Outcome
The outcome measure of this study was a positive delirium diagnosis within one day of hospitalization determined by a positive DOSS or CAM assessment. Twenty-four hours was chosen as the cut-off length for study inclusion to optimize the model’s ability to identify delirium cases most related to factors observed around the time of the ED visit. A positive outcome was defined as one or more positive screenings within 24 hours even if a prior assessment was negative.
Predictors
We collected data on patient demographics, medical histories, physiological measurements, medications administered, and lab results. The following variables were used to generate and test a model: age at the time of hospitalization, sex, gender, history of stroke, dementia, severe illness defined by meeting two or more Systemic Inflammatory Response Syndrome (SIRS) criteria, transient ischemic attack (TIA), diagnosis of intracranial hemorrhage in the ED, tachypnea, and visual or hearing impairment. The physiological variables collected at the time of ED evaluation were heart rate, respiratory rate, Body Mass Index (BMI), and temperature. Medications ordered were obtained with a drug flag for opioids and benzodiazepines. We defined the anticholinergic variable as receipt of drugs classified as level 2 or 3 on an updated version of the Anticholinergic Drug Scale (Supplemental table 1 displays anticholinergics received by the sample).13,14 Although there is not an established Activity of Daily Living (ADL) assessment in the ED at this institution, nursing staff in inpatient units recorded a Barthel index, a measurement of the degree of assistance required by a patient determined by 10 variables describing ADL and mobility.15 The Barthel index was included as a continuous variable where a higher Barthel index is indicative of a higher level of independence. ADL was an important predictor for delirium in the literature but the Barthel index is not available at the time of the ED visit, so we examined the models without the Barthel index as the primary analysis and without as the secondary analysis.16
Analysis
We reported summary statistics for the population by positive delirium screening. Means and Standard Deviation (SD) summarized continuous variables; frequency counts and proportions summarized categorical variables.
Missing values were imputed using KNN-imputation which has been shown to outperform other widely used imputation methods.17 Possible outliers were identified with the IQR Extreme Value analysis. To prevent the loss of information about variability in the study, clinical reasoning was used to determine if an outlier reflects the study population. Variance inflation factors were observed to detect multicollinearity, and continuous variables were checked for linearity by examining plots of the continuous independent variables versus the logit of the outcome.
We compared the predictive performance of five machine learning models using the Python Machine Learning library Scikit-Learn.18 The algorithms included Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Gradient Boosting Machine (GBM), and Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM), and K Nearest Neighbor (KNN) with an intention to identify an interpretable model. Cross-validation was implemented for both hyperparameter tuning and model evaluation with AUC as the evaluation metric. This re-sampling method was selected over repeated sub-sampling to prevent any loss of information about the positive class by ensuring every observation appears in both the training and test data.
To avoid an optimistic bias that can result from using the same cross-validation procedure for both hyperparameter tuning and model evaluation, nested cross-validation was employed. In nested cross-validation, k-fold cross-validation for hyperparameter tuning is nested inside the k-fold cross-validation for model evaluation. Using tenfold nested cross-validation, the data was randomly divided into 10 equally-sized subsets. Out of the 10 sets, 9 were used to train the classifier, and the 10th was used for testing. The training set was further partitioned into 5 folds for an inner cross-validation grid search to optimize hyperparameters. This process was repeated until each of the 10 subsets had served as the test set. Similar to a regular cross-validation procedure, evaluation metrics are obtained by averaging the test set scores of the 10 runs. By conducting model selection independently in each trial of the model fitting procedure, the risk of overfitting during hyperparameter tuning is reduced. The final models were selected using a 10-fold cross-validation grid search on all available data.
Sample Size
To examine the importance of sample size on the predictive power of a model, the performance of a gradient boosted model was tested at various sample sizes. Subsets of the data were created with stratified random sampling. At each sample size, 10-fold cross validation with 10 repeats was employed to obtain an AUC estimate and standard error. AUC estimates and confidence intervals were plotted against sample size. Increasing the sample size from 4,500 to 22,500 resulted in a .011 AUC increase. This suggests a sample size larger than the sample used in the analysis would not result in a significant increase in predictive power. However, as the sample size increased, the standard error decreased which creates more confidence in the AUC estimate. (Supplemental figure 1)
RESULTS
Participants
We identified a total of 60,790 unique ED encounters that resulted in hospitalization during the study period. Out of these, 24,132 of them did not have at least one DOSS or CAM-ICU recorded in their health records within the first 24 hours of a hospitalization, reducing the study population to 37,609. An additional 4,056 encounters were excluded for not having a DOSS or CAM-ICU within one day of hospitalization. After removing those with multiple encounters, the remaining 22,269 ED patients met all study criteria, and 4,955 patients had a positive delirium screening within one day of hospitalization (figure 1 and table 1).
The mean age was 76.4 (SD 8.2), and 50.7% were male. A total of 21,036 (94.5%) were white. The mean BMI was 28.5 (SD 6.7). Overall, 3,532 (15.9%) of them had a past history of dementia, 3,349 (15.0%) had a past history of stroke, and 1,889 (8.5%) had a diagnosis of intracranial hemorrhage in the ED. The rest of the variables, vital signs, and medications that were given in the ED are shown in Table 1. Predictors with missing values were respiratory rate (0.3%), heart rate (0.3%), BMI (2.4%), and the Barthel index (2.7%). Lastly, a total of 766 (3.4%) died during hospitalization.
Model Specification
We reported the top 3 of the full prediction models to allow predictions for individuals. Tables 2, 3, and 4 have a list of variables used in the logistic regression, random forest, and gradient boosting machine models. These tables list coefficients for logistic regression and variable importance for the two other models (Tables 2, 3, and 4).
Table 2 outlines the coefficients for the logistic regression model with an inverse regularization parameter of .5. Relative feature importance scores were similar for the gradient-boosted machine and random forest models (Tables 2, 3, and 4). Highly ranked variables included the history of dementia, age, and BMI. The Barthel Index score also highly ranked when it was included. (Supplemental table 2, 3, and 4)
These prediction models were developed to put variables into a model-based probability calculator. This practice can be implemented in any electronic medical record using an equation or an online calculator. For example, we estimated the probability of a positive delirium screening based on the combination of demographic and clinical data (Table 5).
Model Performance
Figures 2 and 3 illustrate the predictive performance of each model with receiver operating characteristic (ROC) curves (Figure 2, 3). The area under the ROC curves ranged from .69 to .77. In this analysis, RF and GBM followed by LR demonstrated the best predictive ability with respective AUCs of .772 (95% CI .77, .774), .77 (95% CI .768, .772), and .767 (95%CI .764, .769). At the 90% sensitivity threshold, RF, GBM and LR models achieved a specificity of 33-35%. When the Barthel index was added to the model, we noticed improved accuracy and AUC (Table 6).
DISCUSSION
This study used several variables from the ED to develop and validate the delirium prediction model within the first 24 hours of hospitalization. The rate of delirium based on the screening was up to 21% within the first 24 hours, which underscores the need to screen for delirium early in the hospital stay, preferably in the ED, and prevent or treat delirium as early as possible. Our study identified a combination of variables that included demographic information, medical history, and medications by using a machine-learning approach to predict delirium. We also explored use of the Barthel index, a measure of functional status, which was measured after hospital admission, since that could inform delirium risk after hospitalization.
We identified that the rate of delirium based on the screening was up to 21% within the first 24 hours. The prevalence of delirium in the emergency department (ED) is estimated to be 8%–17%, but it is difficult to identify without a screening process.1-3 The lack of an effective screening process leads to underdiagnosed and undertreated delirium.5 The prevalence of delirium in the inpatient unit, including intensive care units, is 18%–64% in the literature, which underscores the need for prevention and treatment strategy prior to hospitalization.19 Although several cognitive assessment tools exist, they require training and dissemination, and compliance can be limited without strong merit and support from leadership and stakeholders.6-8 Early screening and interventional options are emerging and seem promising, as reported by several recent studies, but these programs depend on effective screening.9-11 Our study may help to narrow down the high-risk patient population who need further delirium screening.
Our study identified delirium using DOSS and CAM-ICU as the reference standards in the clinical setting. The fluctuating nature of delirium poses a challenge to clinicians, as evidenced by Lewis et al., who reported that the estimated misdiagnosis rate of delirium in the ED was up to 80% in the 1990s, and the rate of misdiagnosis has remained high.5,16 The study’s findings highlight the limitation of clinical gestalt without any additional diagnostic modality or assessment tools. Our inpatient unit routinely uses the DOSS and CAM-ICU for ventilated patients, and validation studies showed that sensitivity and specificity were above 90%.20,21 Because of their superior accuracy over clinical gestalt in the ED, we used DOSS and CAM-ICU as an approximation of the delirium outcome in this study. This approach likely included both ones who presented with delirium in the ED and ones who developed delirium after ED visit. A future study should investigate both the prevalence and the incidence of delirium by screening for delirium in the ED and inpatient unit to explore the distinctive features of those who come to the ED with delirium and those who develop delirium during a hospital stay.
The Area Under Curve (AUC), sensitivity, specificity for the RF, GBM, and LR were 0.76-0.77, 0.90, 0.33-0.35, respectively. These improved significantly after the Barthel index was included (Table 6). The probability of developing delirium based on the three scenarios ranged from 7% to 96% (Table 5). We previously analyzed the diagnostic characteristics of three delirium prediction models that could be applied in the past: delirium risk score, risk prediction model, and susceptibility score.3,16,22 These models were examined in our retrospective hospital-wide data, and AUC ranged from 0.71 to 0.8.23 The use of two delirium screenings, the combination of variables, and the machine learning approach improved our model’s predictive ability only after the Barthel index was included. Our model prioritized higher sensitivity as it can be used as a screening tool, possibly triggered by electronic health records. It requires confirmatory tests for delirium.
Our study reported the importance of variables, such as age and dementia which cannot be modified, heart rate, respiratory rate, and severe illness, which can be modified but may reflect underlying illness which may or may not be modifiable. The Barthel index is an important variable that is potentially modifiable depending on the cause of functional deficits, for example with feeding assistance and early mobilization, but may also reflect the effects of delirium. The importance of benzodiazepines and anticholinergics was not high, but these drugs can be decreased with a deprescribing program.24,25 These findings highlighted several modifiable variables that we can approach in the ED and hospital setting.
Our prediction models were developed by machine learning algorithms. We selected random forest, gradient boosting machine, logistic regression, KNN, decision tree, Gaussian Naïve Bayes, and SVC. The use of GBM, RF demonstrated the highest AUC, and this was similar to the logistic regression-based model. The recent study of the prediction of postoperative delirium by Wong et al. used penalized logistic regression, GBM, artificial neural network with a single hidden layer, and linear support vector machine, and the random forest and the GBM model reported an AUC of 0.855.26 The comparison of the machine learning-driven models added a significant impact on how to identify patients who have a high risk of developing delirium while minimizing bias.
Strengths and Limitations
The strength of our study was that we used either DOSS or CAM-ICU on all hospitalized older adults, so the study was able to provide a large dataset to derive and validate the delirium prediction model with minimal variance. We used the outcome of positive delirium screening in the first 24 hours as the best available proxy for delirium in the ED. Since delirium is missed in the ED and our prediction model will help to identify a high-risk group, which fits with using the model to identify prevalent and impending delirium. Another strength is that the use of interpretable machine learning algorithms with cross-validation enabled us to identify the best model with minimal bias and avoid overfitting. Lastly, the list of probability with the presence or absence of predictors will be informative to clinicians utilizing the tool in the clinical setting.
There are several limitations that are important to recognize. First, our institution is unique in that routine delirium screening is available, but almost half of the admitted older adults did not have delirium screening within the first 24 hours of hospitalization, so it is prone to measurement bias. Second, we took the first Barthel index in the inpatient unit, so it may have measured the functional status after they developed delirium. We thought that Barthel index was unlikely to change within the 24 hours of hospitalization following ED arrival, but reverse causality from the effects of delirium on functional status is possible. The model without the Barthel index still demonstrated accuracy for delirium, but we explored use of the Barthel index in the model since the loss of activity of daily living (ADL) is an important predictor for delirium, and ADL can be estimated in the ED.16 Third, a list of anticholinergics is vast, and the drug flag may not capture all anticholinergics. We reviewed drug names under anticholinergics several times and updated data to overcome this challenge. Fourth, this study was conducted in a primarily white population and would benefit from further validation in a more diverse population.
CONCLUSION
This study demonstrated the use of machine learning algorithms to identify the combination of variables that are predictive of delirium within the 24 hours of hospitalization from the ED. The discovery of a predictive model that clinicians can use as a clinical decision aid could lead to improved detection of delirium and identification of a high-risk group. This contribution is significant because the findings will introduce a clinical decision aid that either clinicians use actively or receive passively from machine learning algorithms, overcoming the limitation of misdiagnosis or under diagnosis by clinical gestalt alone to detect delirium. Our future objective will be to develop a clinical decision aid integrated into electronic medical record to predict delirium in real-time, so ED providers and the inpatient team can focus on delirium screening for high-risk individuals and implement a delirium prevention program.
Data Availability
Data files are available in GitHub as listed below.
Acknowledgement
This study was supported by an award to Sangil Lee by the NIDUS delirium network (No. NIA R24AG054259). Research reported in this publication was supported by the National Center For Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR002537. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health