1 Abstract
Rationale The multidisciplinary mortality and morbidity conference is the core of programs that aim to improve the quality of trauma care and is used to identify and address opportunities for improvement based on reviewing patient cases. Current systems rely on audit filters for review selection, a process that is hampered by high frequencies of false positives.
Objectives To develop, validate, and compare the performance of different machine learning models for predicting opportunities for improvement.
Methods We conducted a registry based study using all patients from the Karolinska university hospital that had been reviewed regarding the presence of opportunity for improvement, a binary consensus decision from the mortality and morbidity conference. We developed eight binary classification models using 45 predictors. Training used an 80%-20% train-test split and 1000 resamples without replacement estimated confidence intervals. Performance (sensitivity, specificity, integrated calibration index, Area under the receiver operating characteristics curve) was also compared to current audit filters.
Measurements and Main Results The dataset included 6310 patients where opportunities for improvement were present among 431 (7%) patients. The audit filters (Area under the receiver operating characteristics curve: 0.624) was outperformed by all machine learning models. The best performing model was LightGBM (Area under the receiver operating characteristics curve: 0.789).
Conclusions Machine learning models outperform the currently used audit filters and could prove to be valuable additions in the screening for opportunities for improvement. More research is needed on how to increase model performance and how to incorporate these models into trauma quality improvement programs.
Impact Our research explores a novel system that predicts opportunities for improvement in trauma patients using machine learning and outperforms the established approach using audit filters. This could allow for reallocation and optimization of review resources as well as a means of possibly identifying new types of opportunities for improvement. The methodology uses state-of-the-art machine learning largely unexplored in a medical context adding to a broader general scientific knowledge reaching outside this project.
2 Introduction
Trauma is a leading cause of death and disease burden globally for people aged 10 to 49 years (1, 2). Multidisciplinary mortality and morbidity conferences are at the core of programs that aim to improve the quality of trauma care and thereby patient outcomes (3, 4). The mortality and morbidity conferences aim to identify and address opportunities for improvement (OFI) by reviewing patient cases. Conducting this review is associated with providing high-quality trauma care (5), including reduced preventable death rates, complication frequencies, and hospitalization duration (6).
A mortality and morbidity conference is conducted through meetings in which representatives from all disciplines and professions involved in trauma care participate (3). During these meetings the care provided in specific patient cases are discussed and compared with optimal trauma care under optimal treatment conditions (7, 8). OFIs have been associated with failures in inital care (9), specifically airway management, fluid resuscitation, hemorrhage control and chest injury management (10–12).
The review process uses audit filters and/or individual review to select trauma patients from a local registry (13). Trauma audit filters “represent sentinel events in patient care which are associated with poor outcomes and/or sub-optimal care”, such as delays in performance of key tests or treatments, or unexpected deaths, (3, 14). When such a “fall out” event occurs, it should ideally trigger review and if appropriate, a correction of a systematic errors and/or individual practitioner feedback (14). The use of trauma audit filters have been associated with high frequencies of false positives, in the range from 24% to 80% in different contexts (10, 11, 15). In order to improve the precision in the selection process, trauma mortality prediction models has been proposed. However, the performance of these models has been poor (15–17) most likely due to the fact that the models were developed to predict mortality and not morbidity or failures in trauma care.
Different machine learning (ML) based prediction models have been used for several outcomes in trauma care (18), but the use of ML for OFI prediction has never been investigated. Therefore, our aim was to develop, validate, and compare the performance of different machine learning models for predicting OFI.
Some of the results of these studies were presented as an oral presentation at the London Trauma Conference 2022
3 Methods
We conducted a registry-based study using all trauma patients included in both the Karolinska University Hospital trauma registry and trauma care quality database between 2014 and 2021 to compare the performance of supervised ML models and audit filters in their ability to predict OFI. The study was approved by Stockholm Research Ethics Review Board, approval number 2021-02541 and 2021-03531.
3.1 Study Setting and Population
The Karolinska University Hospital in Solna, Stockholm, Sweden, is equivalent to a level 1 trauma center and manages approximatly 1500 acute trauma patients each year (19).
The Karolinska University Hospital trauma registry reports to the Swedish Trauma Registry (19) and includes all patients admitted to the Karolinska University Hospital with trauma team activation, regardless of injury severity score, as well as patients admitted without trauma team activation but found to have an injury severity score (ISS) of more than 9. The registry includes data on vital signs, times, injuries and interventions as well as patient demographics according to the european consensus statement, the utstein template (20). The Karolinska University Hospital trauma care quality database includes data relevant to the mortality and morbidity conferences, including audit filters, identified OFIs, and proposed corrective actions.
The mortality and morbidity conferences at Karolinska University hospital includes all professions involved in trauma care, surgery, neurosurgery, orthopaedics, anaesthesia and intensive care, nursing, and radiology. The presence of an OFI is a consensus decision from the conference, which also includes appropriate corrective actions. The process is a multistage process with escalating levels of reviews. Mortality is directly escalated to the multidisciplinary conference which in addition to OFI also decides whether the death was preventable or possibly preventable, which is also classified as an OFI. To identify non-mortality poor outcomes the review process was subsequently improved and formalized during the study period. Between the years 2014-2017, trauma patients were individually reviewed by a specialized trauma nurse to identify possible OFIs to be escalated to the multidisciplinary conference, but the process was not formalized. From 2017, the process was formalized; a brief individual first review by a specialized nurse was performed when data was registered in the trauma registry and the trauma quality database. Audit filters (supplement E1) were applied to the trauma quality data base. All “fall outs” from audit filters and trauma patients with a possible failure of care identified by the nurse during the first review, were reviewed again in a second review by two specialized nurses. If the second review identified a possible OFI, the trauma patient’s process of care was reviewed in the multidisciplinary morbidity conference to decide upon whether or not there was an OFI in trauma care.
3.2 Eligibility Criteria
We included all patients screened for OFI from the trauma registry and trauma care quality database between 2014 and 2021. Patients below the age of 15 were excluded due to their clinical pathway differing compared to adults.
3.3 Outcome
The models’ outcome is the presence of an OFI, as decided by the mortality and morbidity conference, and defined as a binary variable with the levels “Yes - At least one OFI identified” and “No - No OFI identified.” Preventable or possible preventable deaths were also considered an OFI. The data for the outcome was extracted from the trauma care quality database.
3.4 Predictors
All variables from the trauma registry, and included in the revised Utstein template, were considered as potential predictors, including information from the pre-hospital setting, initial care, and subsequent in-hospital care. This information includes, but is not limited to, initial vital signs, times to and types of procedures and interventions, length and level of care, injuries, injury mechanisms and type as well as standard demographics. Both continuous and categorical predictors existed. The final models were built with 45 predictors. For a complete list of predictors, see Table E1 in the online data supplement.
While recommendations regarding the relationship between sample size and the number of predictors for logistic regression exist (21, 22), the optimal number of predictors for other learners are not well researched. To accurately compare all models, we chose to use all predictors regardless of learner, and used a pragmatic approach where we included all available data, leaving a sample size of 6310 patients.
3.5 Statistical Analysis Methods
All statistical analyses were done using R (23). We ran all analyses on an 80%-20% train-test split 1000 times using random resampling without replacement.
Data preprocessing and imputation
We developed a pre-processor to be used on each resample’s data by rescaling continuous features using Yeo-Johnson’s (24) power transformation and recoding categorical features using dummy variables via “one-hot encoding.” Predictors with near-zero variance were removed. Missing continuous predictors were imputed using the mean of the predictor, while categorical predictors were imputed by introducing an unknown category. If the continuous values for blood pressure or respiratory rate were missing but the corresponding categorical “revised trauma score” values were present, imputation was instead done using the mean of all patients in that corresponding “revised trauma score” category. A missing indicator feature was created for each predictor.
The developed pre-processor was then run separately on the training and test sets for each resample. To balance the training sets, we used the Adaptive Synthetic (ADASYN) (25) algorithm, a method to generate synthetic data and thus upsample the OFI outcomes and obtain a 1:1 ratio between the outcome classes.
Model development
Eight ML models were built using the Tidymodels framework (26). We used logistic regression (LR), random forest (RF) (27), decision tree (DT) (28), support vector machine with a radial basis kernel (SVM) (29), XGBoost (30), LightGBM (31), CatBoost (32), and k-nearest neighbor (k-NN) (33). A short description of each learner is included in the online supplement under section E2. All model parameters were hyper-optimized on the first resample’s training data, and the hyperparameters were saved for the following resamples. Hyperparameter optimization was done using a 5-fold validation through a simple grid search of size 30 on all available parameters provided by the Tidymodels framework.
Performance measurements
The performance of the different models, including the audit filters, was assessed and compared in terms of sensitivity, specificity, discrimination, and calibration in the test set. Discrimination was measured using the area under the receiver operating characteristics curve (AUC) and calibration was measured using the integrated calibration index (ICI) (34). ICI was not calculated for the audit filter system due to it not being able to output class probability. We chose the class probability cutoff for all models based on the cutoff that produced the closest sensitivity to the audit filters using each resamples test data. We compared performance between models by calculating differences for each performance metric and model combination. We estimated 95% confidence intervals for all performance metrics as well as differences using the 1000 resamples.
Feature importance
We calculated the feature importance for all models on the first resample by using permutation feature importance (35) on the test set. Four predictors indicating another predictor not being done was incorporated in that predictor. The importance of a feature was thus calculated by taking the average AUC performance when shuffling a feature’s data 5 times and comparing it to the model’s performance on non-shuffled data.
Code availability
The code used in this study is publicly available online at https://github.com/JonteAtter/Predicting-OFI-in-trauma under the MIT License.
4 Results
4.1 Participants
Out of the 11864 patients included in the Karolinska trauma registry 6310 patients have been reviewed regarding the presence of OFI using a combination of audit filters and individual review. This system selected 1316 patients to the morbidity conference and 596 patients to the mortality conference. (Figure 1). A total of 35 out of the 596 deaths reviewed at the mortality conference were considered preventable (n=4) or possible preventable (n=31) rendering OFI whereas 561 were non-preventable and without any identified OFI. Out of the 5714 alive patients, 1316 were selected for inclusion in a morbidity conference via the use of audit filters, individual review and via an unknown process (n=2) where OFI was identified in 396 (7%) patient cases.
A flowchart describing the exclusions made and the process of trauma patient cases from arrival until opportunity for improvement decision. OFI = Opportunity for improvement. Survival/death defined as death by day 30.
Out of the 6310 included patients most were male (n=4383, 69%) with a mean age of 45 (SD 21) with an overall mortality of 599 (9%). The most common highest level of care was a general ward (36%).
Among the 431 patients with OFI compared to the 5879 patients without, the mean age was slightly higher (mean 48 vs 45 years old) and most patients were treated at a intensive care unit (35%) compared to general ward (37%). The mean ISS was also higher (mean:19 SD:11 vs mean:12 SD:13) as was the frequency of in hospital intubation (n=74 [17%] vs n=460 [8%]). Patients with OFI had longer times to definitive treatment compared to patients without OFI (median: 143 vs 99 minutes from hospital arrival). The definitive treatment also differed where patients with OFI had more interventions with the biggest difference being radiological interventions (7% vs 1%) compared to those without OFI. See Table 1 for details of selected patient characteristics.
Demographic and Clinical Characteristics of patients screened for OFI.
The variable “Time to normal BE” had the highest frequency of missing data (n=6107) followed by “Pre-hospital Intubation type” (n=5812) and “Emergency department intubation type” (n=5749). For complete details see Table E1 in the online supplement.
The 6310 patients was randomly split into a training set (n=5048, 80%) and a test set (n=1262, 20%) 1000 times. The first resamples training set had 364 (7%) OFI compared to its test set, which had 67 (5%) OFI. The mortality was 10% in the training set compared to 10% in the test set and the percentage of patients treated in the intensive care unit was 21% in the training set compared to 24% in the test set. This first resample was used for hyperoptimization and feature importance calculations. For further characteristics regarding the training and test set for the first resample see Table E2 in the online data supplement.
4.2 Model Specification
Figure 2 shows the calculated, model agnostic, permuted feature importance for the first resample. Overall, “Emergency procedure” was the most important predictor followed by “ISS.” The highest feature importance for a single model was “ISS,” accounting for 7% of model LR performance.
The calculated, model agnostic, permuted feature importance for the first resample using AUC as the scoring metric. Overall, “Emergency procedure” was the most important predictor followed by “Injury Severity Score” The highest feature importance for a single model was “Injury Severity Score,” accounting for 7% of model logistic regression performance. PH = Pre-hospital; ED = Emergency department, GCS = Glasgow Coma Scale, GOS = glascow outcome scale, RF=random forest, SVM = Support vector machine, LR = logistic regression, DT = decision tree, KNN = k-nearest neighbors, XGB = XGBoost, LGB = LightGBM, CAT = CatBoost.
4.3 Model Performance
We used AUC as an overall performance measure where all ML models outperformed the use of audit filters (AUC: 0.624, sens: 0.932, spec: 0.316). The highest AUC was found in LightGBM (AUC: 0.789, sens: 0.932, spec: 0.42, ICI: 0.036) followed by RF LightGBM (AUC: 0.788, sens: 0.932, spec: 0.422, ICI: 0.028). For details regarding each models receiver operating characteristic curve see figure 3. Specific sensitivity and specificity values based on the class probability cutoff with the closest sensitivity to the audit filters was also calculated. Equal or greater sensitivity could be achieved among all models except k-NN (AUC: 0.726, sens: 0.862, spec: 0.482, ICI: 0.241) and superior specificity was found in all ML-models. Se Table 2 for all performance measures including corresponding false negative and false positive rates for all models as well as their calculated confidence intervals.
ML models and audit filters predictive performance for OFI using AUC, accuracy, ICI, and sensitivity.
Receiver operating characteristic curves for tested models on predicting OFI. Calculated over all resamples. OFI = Opportunity for Improvement; RF=random forest; SVM = Support vector machine; LR = logistic regression; DT = decision tree; k-NN = k-nearest neighbors.
The calculated differences, delta values, in performance between each model are showed in Table E3-E6 in the online data supplement. The largest point estimate difference where calculated between LightGBM and the audit filters för AUC: 0.165 (0.164-0.167), XGBoost and LR for ICI: -0.267 (−0.268–0.266), SVM and k-NN for sens: NA and between k-NN and audit filters för Spec: -0.069 (−0.072–0.067).The width of all CI:s were narrow with values differentiating around a single or a few per mille.
5 Discussion
Our results highlight the possibility of incorporating machine learning into modern trauma care quality programs in hope of identifying and addressing OFI. All developed models outperformed the use of audit filters in terms of AUC. No single ML model outperformed in all metrics but LightGBM showed the highest AUC, followed by random forest. All ML models exept k-NN could performe equal or greater sensetivity and all ML models showed greater speceficity. The methodology uses state-of-the-art machine learning largely unexplored in a medical context adding to a broader general scientific knowledge reaching outside this project.
In our study all ML learners showed relatively similar results however newer models showed a small, but significant, performance boost compared to more established learners such as k-NN, SVM or LR. A systematic review from 2022 on the use of ML for other trauma related predictions found a similar trend (18). They identified 25 studies comparing the performance of LR to other learners in which performance where similar in twelve studies, better for non LR learners in ten, and worse in three.
5.1 Limitations
Opportunity for improvement, while defined as a binary variable, includes a diverse set of outcomes ranging from preventable deaths to bad documentation. The heterogeneity of these outcomes represents a range of clinical events, each probably correlating to different predictors. In addition machine learning models struggle with rare events and despite being a aggregate of all previously identified errors, the OFI frequency is only 6.83%. Hence, OFI is a considerable predictive challenge. Future research should focus on identifying potential subgroups of OFI that proves to be relatively hard to predict and adjust accordingly.
The current screening system for OFI might also introduce bias since the filters would favor the identification of some, but not all, errors. This could skew the training data and subsequent models creating a self-reinforcing cycle in future identifications of OFIs. Fortunately, the current system also allow for patient selection without triggering audit filters reducing this issue. This leakage of OFIs is necessary in order to identify a heterogeneous outcome such as OFI. It is important to note that these models are highly dependent on the training data presented to them. If they were implemented in practice, without individual review and concurrent conference, another form of “leakage” would be needed in order to predict future unknown OFIs.
The selection system at the Karolinska University Hospital is also more likely to identify severe clinical outcomes, including preventable death, and it is also probable that complex patient cases with multiple interventions or longer hospital stays are selected for review at a higher rate. This theory is supported by the fact that “ISS” and “Days in hospital” are predictors in the higher range regarding of overall feature importance. This creates a possible selection bias where the measured OFI may not be representative of all errors or a true indicator of overall trauma care quality, but rather a combined measure of injury severity, previous patient morbidity, and adverse clinical outcomes. It is also likely that the nature of OFI changes over time, partly from general advances in trauma care and partly as a consequence of corrective actions originating from the peer review process itself.
5.2 Interpretation and Implications
While previous research successfully apply some of the learners used in our study to predict mortality and other more homogeneous outcomes (18), the heterogeneous nature of OFI presents a more difficult prediction challenge. It is possible that higher quality data with reduced missing frequency and higher resolution, eg vital signs series instead of single values, additional predictors and complex machine learning systems such as ensembles are needed for more precise and better performing models. The issue with an ever-changing outcome also creates an issue, as future OFIs, at least in theory, change, which makes training on older data problematic. However, these models are built around basic and easy-to-record predictors that all exist in registries following the Utstein template (Uniform Reporting of Data following Major Trauma), allowing for easy application in other settings and systems. Demanding an unreasonable amount and quality of data would therefore come at the cost of reduced external validity and feasibility in broader settings.
It is important to note that a far from perfect performance is to be expected with a heterogeneous outcome such as OFI, and comparing these models to entire systems using a combination of quantitative screening and several human reviews, including a multidisciplinary review, is unfair and was never the goal. Instead, we strive to facilitate human efforts and this is the first time a viable alternative to audit filters is presented. Applying these models could possibly automate part of the resource intensive individual review that is needed when using audit filters allowing for improved resource allocation and optimization. The use of permuted feature importance also visualizes and highlights potentially new clinical patterns correlated to OFI. The models are also based on novel methodology and state-of-the-art machine learning adding to a broader increasing need reaching outside trauma care quality improvement. Hence, the positives of these models should not be understated.
While future research is needed on how to optimally implement and increase performance of these models, our results highlight the possibility of incorporating machine learning into modern trauma care quality programs in hope of identifying and addressing opportunities for improvement among trauma patients.
Data Availability
The data that support the findings of this study are available following the approval of a project suggesting to use the data by the Swedish Ethical Review Authority and the appropriate bodies at the Karolinska University Hospital. More information is available on request from the corresponding author, J. Attergrim.
6 Acknowledgments
The authors thank Liselott Västerbo for her part in collecting and recording data and screening for OFI. They also thank all professionals taking part in the monthly mortality and morbidity conference.
Footnotes
“This article has an online data supplement, which is accessible from this issue’s table of content online at www.atsjournals.org”
Sources: Supported by the Swedish Society of Medicine, grant number SLS-973387, and by “The Swedish Carnegie Hero Fund.” Parts of the results were presented orally and as an abstract at the London Trauma Conference.
Descriptor number: 4.6 ICU Management/Outcome
Recalculated specific sensitivity and specificity values. The class probability cutoff for each model was based on the cutoff that produced the closest sensitivity to the audit filters using each resamples test data. Also added Figure 3, ROC-curves for all models. Rewrote method/result/discussion to represent these changes.