Abstract
Antimicrobial resistance is a growing threat to global health, leading to ineffective treatment of infection and increasing treatment failure, mortality, and healthcare costs. Inappropriate antibiotic therapy is often administered in the Intensive Care Unit (ICU) due to the urgency of treatment, but can lead to poor patient outcomes. In this study, we developed a machine learning model that predicts the appropriateness of antibiotic treatments for ICU inpatients with ICU-acquired blood infection. We analyzed data from electronic medical records (EMRs), including demographics, administered drugs, previous microbiological cultures, invasive procedures, lab measurements and vital signs. Since EMRs have high rates of missing values and since our cohort is relatively small and imbalanced, we introduced novel computational methods to address these issues. The final model achieved an AUROC of 82.8% and an AUPR of 60.6% on the training set and an AUROC score of 77.3% and an AUPR score of 40.4% on the validation set. Our study shows the potential of machine learning models for inappropriate antibiotic treatment prediction.
Introduction
Infectious diseases are considered one of the major health risks worldwide1. Although the development of antimicrobial drugs has transformed the treatment of bacterial infections, the massive increase in antibiotic consumption has led to the emergence of bacterial resistance, thus reducing antibiotics efficacy2–4. Consequently, both the Centers for Disease Control and Prevention and World Health Organization declared antibiotic resistance as a threat to human health5,6, and have created guidelines for appropriate antibiotic administration3,7,8.
Nowadays, culture incubation is the golden standard for bacterial pathogen assessment. The process takes 48 to 72 hours. Typically, a gram stain is completed after 24 hours, organism identification is obtained after another 24 to 48 hours, and antimicrobial susceptibility testing (AST) profile is received after 72 hours9,10. However, since early antibiotic intervention is a critical determinant of patients’ survival, patients are often treated with empiric antibiotic therapy, where antibiotics are administered prior to the receipt of blood culture and AST results. This treatment is based on the clinician’s preliminary evaluation of the patient’s health state, infection history, and local bacterial resistance patterns11,12. Nevertheless, such treatment might be inappropriate, as the antibiotic administered might not be suitable to the pathogen. In particular, ICU-acquired infections are more likely to be resistant to a broad spectrum of antibiotics13.
Recently, it has been shown that inappropriate antibiotic therapy (IAT) is associated with higher incidence of treatment failure, higher mortality rate, and a prolonged hospital stay, which can also result in higher healthcare cost14,15. Moreover, in severe cases of bloodstream infections and in cases of septic shock, IAT was found to be the most important factor in ICU patients’ outcome16. Therefore, it is essential to develop methods for rapid identification of treatment appropriateness in ICU patients. However, to the best of our knowledge, no machine learning model has been developed for prediction of IAT in ICU patients with hospital acquired infection.
In this study, we developed a machine-learning model that predicts the appropriateness of antibiotic empirical treatments based on electronic medical records (EMRs) of ICU patients with hospital acquired infection. Our prediction is made 24 hours after the blood culture was taken and thus approximately 24 hours after the empiric antibiotic has already been administered17. Unlike previous models that tried to make the prediction at the time of culture collection, we assume that at the 24h point the patient’s measurements such as lab measurements and vital signs are already affected by the antibiotic intervention and can give indication whether the antibiotic treatment was appropriate.
In the process of method development, we also devised novel computational methods and a flexible pipeline to deal with challenges that often arise when dealing with EMR data, such as missing values and imbalanced data. The methods are described in detail and can be adopted for other models that use EMRs.
Results
Cohort Description
We used MIMIC-III, an open-access, anonymized database of EMRs of ICU patients, to develop, validate, and test our model. Data from 53,423 distinct ICU stays of adult patients admitted to Beth Israel Deaconess Medical Center (Boston, MA, USA) between 2001 and 2012 are included in the database.18 The dataset contains for each patient stay time-independent (static) features, such as age, gender, ethnicity, weight, height, and a large variety of time-dependent (dynamic) features that are measured during hospitalization, including vital signs, lab measurements, and drug administrations. We used 55 continuous features (Table 1), 7 drug features (Supplementary Table 1) that were created by aggregating 242 drugs into 11 drug categories (Supplementary Table 2), and 39 categorical features (Supplementary Table 3). For all features, only values recorded before the prediction time (henceforth abbreviated as PT), set to 24 hours after the time the blood culture was taken (abbreviated as BCT), were considered. We considered for our cohort only patients with suspected hospital acquired infection (Figure 1A-B). Overall, the training set consisted of 105 patients divided into two classes. The inappropriate treatment group, defined as those who received antibiotic treatment to which the pathogen was resistant, consisted of 22 patients. The remaining 83 patients received antibiotic treatment to which the pathogen was sensitive and therefore were included in the appropriate treatment group.
Administered Antibiotics Analysis
Analysis of blood culture results and the drugs administered to the patients in our cohort revealed that Coagulase-positive Staphylococcus aureus was the most common pathogen, detected in 51% of the patients (69/135). The most common antibiotics administered to patients with that organism were vancomycin (50.7%, 35/69) and levofloxacin (23%, 16/69). Overall, the most common antibiotic administered to patients was vancomycin (57%, 77/135, Supplementary Figure 1).
Furthermore, the AST results of those blood cultures revealed the most common pairing of an organism and the antibiotic tested on it. Of the antibiotics tested on Coagulase-positive Staphylococcus aureus, Gentamicin had 43 cultures (1/43 had a resistant outcome), Oxacillin had 43 (23/43 resistant), and Levofloxacin had 42 (25/42 resistant). The most resistant bacteria was Enterococcus faecium, which was resistant for at least one type of antibiotic 73% of the times it was observed (38/52), and the antibiotic that had the highest incidence of resistance was erythromycin, which experienced resistance 70.8% of the times (34/48) (Supplementary Figure 2).
Model’s Pipeline
In order to develop a robust model that will address the characteristics of our prediction objective, we constructed an extensive pipeline comprised of several steps (Figure 1C), and in each step we evaluated a few alternative techniques. We tested each combination of techniques using five iterations of stratified 5-fold cross-validation over the training set and chose the combination that yielded the highest mean AUPR. Below we describe each step briefly. Full details are provided in the Methods section.
The first step in the pipeline is the removal of values that were deemed outliers. We first excluded values that were not in the human range and then removed values based on two different metrics. Afterwards, we filtered out features with missing rate ≥ 30% and removed features with variance ≤ 0.005.
In the next step, we created time-series features utilizing all data points available before PT. We calculated these features using two sets of timeframes, d and d + 2 days before PT. Missing values in each timeframe were imputed using a linear regression model that was fitted per subject using all the feature values recorded within a larger time-frame, see Data Imputation in Methods. For these features, we evaluated different thresholds for the minimum number of values that are required for the fitting of a linear regression model (n) and we evaluated several timeframes (d = 2 and 3).
The next step was the normalization of the features. We evaluated two approaches: Min-Max scaling and standardization. Then we added a second imputation step to handle missing values that could not be imputed by the linear regression. We used K-Nearest Neighbors (KNN) algorithm19 with k = 5 and tested several distance measures such as Sklearn’s distance method (an Euclidean distance that accounts for missing coordinates), and two new distance measures.
As many of the features were highly correlated, particularly after the addition of the time-series features, we applied two steps of detecting and filtering correlated features. First, we kept only a small number of features derived from the same raw measurement by selecting those with the most significant p-value according to a t-test between the two classes. We tried several numbers of features. Following this step, we filtered highly correlated features based on hierarchical clustering.
After the removal of correlated features, we still had a high-dimensional feature space. Hence, we examined several feature selection methods: (a) Recursive Feature Elimination 20, (b) Taking the features with an importance score higher than the model’s mean feature score (e.g., in the logistic regression model, taking the mean beta coefficient), (c) Taking the K features with the highest mutual information score21 and (d) Taking the K features with highest SHAP values22. We also tested combinations of these four methods and several possible values of K.
Additionally, since our data was imbalanced (roughly 3/4 appropriate and 1/4 inappropriate) we tested the following approaches for oversampling: ADASYN23, SMOTENC24, and BorderlineSMOTE25 with different balancing ratios, and also developed a novel ensemble method for data balancing which we named ‘DataEnsemble’.
The last step was the prediction model selection. Here we evaluated eight different machine learning models: Random Forest26, AdaBoost27, Logistic Regression28, SVM29, SGDclassifier21, LightGBM30, Sklearn’s Gradient Boosting Classifier21,31 and Xgboost32.
Every combination of techniques applied in each step above was tested in iterated cross validation. The final prediction model chosen was Random Forest DataEnsemble. See Methods for the combination and the parameter values chosen.
Appropriate Antibiotic Treatment Model
Our model aimed to predict the risk of administering an inappropriate antibiotic treatment to an ICU inpatient. In order to select the optimal model, we used five iterations of stratified 5-fold cross-validation over the training set. In each iteration, we evaluated the model using the mean area under the receiver-operator characteristics curve (AUROC) and area under the precision-recall curve (AUPR) over all five folds. We then averaged these metrics over the five iterations. Each model was evaluated with and without a novel training approach using balanced cohort (‘DataEnsemble’, see Methods). The Random Forest DataEnsemble model had the best performance (Figure 2 Error! Reference source not found.) with an AUROC of 82.76±1.46% and an AUPR of 60.61±3.76% on the training set (Figure 3). Notably, for seven out of the eight models the DataEnsemble received better median AUPR scores compared to the original model. Thus, Random Forest DataEnsemble was chosen as the final prediction model.
Validation
We retrained the Random-Forest model with the selected parameters on the entire training set and applied it on the validation set (Table 2, Figure 4A-B). A good balance was achieved when using a classification threshold of 0.45 (i.e., classifying all samples with risk score ≥ 0.45 as positive). For that threshold the model achieved a positive predictive value (PPV) of 50%, negative predictive value (NPV) of 86%, sensitivity of 62% and specificity of 82%. In addition, the model achieved an AUROC score of 77.3% and an AUPR score of 40.4%. Those values were lower than those obtained on the training set, however, it is to be expected that a predictor’s performance will be reduced when validated against new samples.
Next, we assessed the significance of utilizing data obtained after administration of antibiotics to the patient (i.e, the time between BCT to PT). To accomplish this, the same Random Forest model DataEnsemble was applied using only data obtained prior to BCT. The resulting model exhibited poor performance, achieving AUROC 58% and AUPR 29.8%, which mirrors the proportion of positive samples in the validation set at 26.67%. Subsequently, the pipeline parameters were optimized for the best mean AUPR on the training set and the model was evaluated on the validation set. The results were similar, yielding AUROC 55.1% and AUPR 27.7%. These findings show the importance of using data from the period following the drug administration to the patient. It is evident that training the model solely on pre-culture data without also using the data after the drug intervention results in near-random predictions, as the model lacks sufficient informative values.
We also wished to assess the contribution of data obtained before antibiotic administration to the prediction. For this goal, we applied the same Random Forest DataEnsemble model using only data obtained after the blood culture was taken (i.e., from BCT to PT). Again, the resulting model exhibited poor performance, achieving AUROC 61.4% and AUPR 31%. Optimizing the pipeline parameters for the best mean AUPR on the training set and evaluating the model on the validation set resulted in similar performance, yielding AUROC 60.2% and AUPR 30.1%. Hence, relying solely on post-culture or pre-culture data for training the model leads to poor predictions, and incorporating information both from before and after drug administration greatly improves prediction quality.
Feature Importance
Analysis of the features created for our model showed that none of the raw lab measurements and vital signs measurement were significant discriminators. However, previous cultures and especially resistant cultures were significantly associated with the inappropriate class (Supplementary Table 3). Moreover, the existence of any Ascites lab test is also associated with the inappropriate classification.
The contribution of each feature to the model’s risk score is estimated using SHAP values22 (Figure 4C). Most of the features that had a substantial impact on the model were time-series features of vital signs and lab measurements. The most important features of the model were the difference between maximum and minimum white blood cell count (WBC) measured during the hospitalization, the standard deviation of INR values measured during the entire hospitalization, R2 of a regression model of arterial pH values in the 5-day timeframe before PT, and the mean corpuscular hemoglobin (MCH) measured 12 hours before PT. Although total WBC count is a common laboratory marker for identifying patients with high risk for bacterial infection (BI), studies have shown that WBC count had only minor discriminatory power in identifying patients with BI33–35.
Discussion
Approximately 70% of patients admitted to the ICU receive antibiotic treatment36. However, the percentage of patients who do not receive adequate therapy within the first 24 hours of a bloodstream infection (BSI) is alarmingly high, reaching 47%37. On the other hand, ill-advised and excessive antibiotic use can contribute to the global antibiotic resistance problem4. In this study, we propose a machine learning algorithm to predict inappropriate empiric antibiotic treatment in patients with ICU-acquired bacteremia. Previous research has focused on utilization of machine learning models for early prediction of ICU-acquired BSI38, outcomes of BSI39, and antibiotic resistance in BSI40,41 and urinary tract infections42,43. Studies39,40 predicted antibiotic susceptibility by creating a specific model for each antibiotic type. In contrast, the problem of antibiotic treatment appropriateness is not concerned with the resistance to each type of antibiotic, but evaluates whether the treatment administered was effective by assessing the patient’s response to it. Due to the limited size of the available cohort, the model described in this study was not specifically trained for individual antibiotic types. Consequently, we developed one general model for predicting the appropriateness of the antibiotic treatment, without the need to specify which antibiotic was administered to the patient. The purpose of this model is to discern the physiological response to an appropriate antibiotic treatment from that of an inappropriate treatment. To the best of our knowledge, no prior studies have addressed the problem of determining the appropriateness of antibiotic treatment.
Our algorithm demonstrated promising performance both in cross-validation and in validation of an independent sample of patients (with AUROC scores of 82.76% and 77.27%, and AUPR scores of 60.61%, and 40.44%, respectively). These results suggest that with the use of readily accessible EMR data, it is possible to predict the appropriateness of an antibiotic treatment 48 hours before the full antibiogram results are available and assist in the clinical assessment of the patient. The substantial reduction in mismatched treatment facilitated by machine learning-based recommendations that take into account the patient’s medical history and records can pave the way for a future framework in which clinicians will routinely consult such algorithms and adjust the antibiotic treatment of patients accordingly. Adoption of the model in clinical practice could lead to a machine learning-guided personalized antibiotic prescription and help reduce treatment failure and overall use of antibiotics, contributing to the global effort to combat antibiotic resistance.
The models’ prediction was primarily driven by the patterns in the time-series features, such as the difference in the median measurement of WBC collected in the 5-day and 3-day time-frames prior to PT (Figure 4C). Some of these features were previously studied in relation to BI and were recognized as significantly associated with it. However, the temporal behavior of most of these features was not checked in relation to BI. Moreover, to the best of our knowledge, no study identified clinical measurements that are most relevant to predicting antibiotic treatment appropriateness. Notably, no raw measurement by itself was statistically significant for discriminating between appropriate and inappropriate treatments. Therefore, examining the features selected by our machine learning model can provide valuable insight into such discrimination. By identifying the predictors with the highest impact on the model’s outcome, doctors can focus on those lab measurements and vital signs.
In addition to time series features, our model utilized known risk factors for antibiotic resistant infections as features, such as previous antibiotic resistant infections, antibiotic that were previously administered, invasive procedures and culture sample sites44–46. Many of these features were also shown as predictive for antibiotic resistance in machine learning models that used EMR data42,47,48.
In this study, we chose to set PT to 24 hours after the blood culture was taken, as the results of the gram-staining are typically retrieved at this time9, and thus at that time clinicians could make adjustments to the patient’s antibiotic treatment. Providing additional information at this time can improve decision-making by the doctors, potentially affecting the patient’s outcome. Moreover, our cohort included only ICU inpatients with microbiological confirmation of a bacterial infection. However, studies have shown that only about 19.5% of inpatients with bacteremia have a positive blood culture49. Therefore, it is plausible that many of the patients with bacteremia will potentially not be considered for our model. Developing a model where PT is instead set to when gram stain results are already available, and the microbiological confirmation of the bacteria exists, can increase the percentage of relevant inpatients considered by the model. Furthermore, our findings demonstrate that data collected during the additional 24 hours lead to a significantly better prediction in comparison to a model trained solely on data obtained prior to the blood culture.
Our study has several limitations. First, in our study we filtered out contaminants, while they might be considered eligible cultures for our prediction since no information is provided to classify them as contaminants at the time of gram stain. Additionally, our model was trained and validated on a relatively small dataset from one medical center, and should be tested on data from other medical centers. Finally, conducting a prospective evaluation is necessary to assess the model performance in practical scenarios.
It is also important to note that the data collected per patient only pertained to the hospitalization during which the blood culture was taken. Future studies could benefit from incorporating the complete medical history and previous hospitalization records of a patient42. In particular, the use of data on previous cultures can enhance the model’s predictive ability, as previous instances of recurrent infections are associated with a higher risk of resistant infection in subsequent hospitalizations50.
In addition to the contribution to predicting antibiotic resistance, this study also proposes a new pipeline for medical decision support. It outlines techniques to address challenges commonly encountered in EMRs, such as limited and imbalanced datasets and high rates of missing values. The key methods described here can serve as starting points for such an approach, but the specific model, parameters, and feature extraction process should be tailored to the medical question and the data.
Methods
Inclusion and Exclusion Criteria
All patients admitted directly to the emergency department or ICU who had blood cultures that were not contaminated (i.e., blood culture results of Coagulase-negative Staphylococcus, Diphtheroids, Bacillus, Aerococcus viridans, Aerococcus, Propionibacterium, Viridans streptococci, Lactobacillus, and Staphylococcus epidermidis) or were not canceled during their hospitalizations were considered for this study’s cohort. Out of those, to identify patients with hospital acquired infection we included only patients who satisfied at least one of the following conditions: (a) they were hospitalized for at least 48 hours in the ICU and had their first blood culture in the ICU collected there after that time. Only the first culture collected in the ICU was used for labeling. (b) their first culture in the entire hospitalization was collected in the ICU and at least 24 hours after hospital admission (Error! Reference source not found.).
The Cohort
We used the MIMIC-III database, containing data of 38,597 distinct adult patients18. Our exclusion criteria resulted in a total of 135 patients, who were split into training and validation sets. Our training set included EMRs of 105 inpatients, of whom 83 received appropriate antibiotic treatment and 22 received inappropriate antibiotic treatment. The validation set included 30 inpatients of whom 22 received appropriate treatment and 8 received inappropriate treatment.
Outcome Definition
Microbiological cultures are routinely drawn in ICU. We defined the blood culture time (BCT) as the time of the culture sampling, and PT as 24 hours after culture time. Only records charted before PT were used by the model.
Antibiogram results are usually available within 72 hours of culture sampling9, so prediction after 24 hours may allow the physician to reconsider the antibiotic empirical treatment 48 hours before the antibiogram results. The 24-hour window enables one to obtain features that help assess the response of different clinical measures to the empiric antibiotic treatment (for example, the ratio between white blood cells levels before and after the empiric antibiotic treatment).
Patient treatments were designated as appropriate (negative class) or inappropriate (positive class) treatment based on the results of the culture, AST and the empirical antibiotic that was administered. The inappropriate class was defined as an antibiotic treatment where the pathogen was either not affected by the antibiotic or resistant to it. Appropriateness was decided by an internal medicine specialist and an infectious disease specialist who reviewed together the antibiotics administered and the antibiogram results for each patient.
Outlier Removal
Inhuman Values
To eliminate measurements that were grossly incorrect due to manual typos or technical errors, we manually defined with clinicians a range of possible values per each feature (including pathological values), and excluded values outside this range. A total of 711 values (0.4% of the values of all features) were excluded in this step, see Supplementary Table 4.
Extreme Values
For the remaining values, we checked two approaches to removing extreme measurements. Both of these methods were calculated on the training set, and were later applied on the validation set. The first method used the IQR. Denote by q0.75 (q0.25) the value at the 75th (25th) percentile and set qdiff = 1.5 × (q0.75 − q0.25). Then only values in the range q0.25 − qdiff < x < q0.75 + qdiff were kept.
The second approach used Z-scores, filtering out values that are more than two standard deviations from the mean of the feature.
We analyzed the percentage of values that were removed after applying both methods. Z-score discarded a mean of 4.12% of the feature values and a median of 3.69%, while IQR removed a mean of 5.44% and a median of 3.67% (Supplementary Table 4). Following these results, we used the Z-score approach.
Normalization
We evaluated two approaches for feature normalization. The first is the normalization of all features to values between 0 and 1 according to the maximum (Xmax) and minimum (Xmin) values of each feature in the training data set.
The second was standardization to a normal distribution with a mean of zero and a standard deviation equal to one.
Both normalizations were fitted on the training set data, and later applied on the validation set as well. The normalization method chosen was standardization as it yielded better results.
Feature Engineering
The features created for the model are composed of six main categories: (1) patient demographics, (2) lab measurements, (3) vital signs, (4) drug administration, (5) previous lab cultures, (6) medical procedures. We tested removal of features with high missing rates, for rates 20%, 30%, 40%, and 50%, and chose to exclude features with missing rate > 30%. Moreover, after the feature engineering process (see below), features with variance < 0.005 were excluded as well.
Demographics
The demographic features included, among others, age, gender, and ethnicity, as well as time since admission to the hospital and to the ICU, and measurements such as weight and BMI.
Lab measurements and vital signs
We used as features the median, standard deviation, minimal value (min), maximal value (max), and their difference (min-max diff) per each time-frame described above. See Supplementary text for more details.
Drugs
We mapped all the drugs into 11 clinically relevant groups (Supplementary Table 2) with the help of a general physician. For each drug group, and each of the time-frames described above, we collected the total number of drugs from the group that the patient received.
Cultures
We extracted binary features indicating the properties of previous culture taken from the patient, when available. See Supplementary text for more details.
Medical procedures
Finally, we added binary features for four categories of invasive procedures that frequently cause infection: Arterial Line, Catheter, Ventilation, and Tubes (Supplementary Table 5), and indicated if the patient has undergone a procedure from each category.
The mean time from the first lab or vital sign measurement to PT was 7.15±4.65 days (Supplementary Figure 3). Therefore, we generated time-series features for lab measurements, vital signs, and drugs for two time-frames: d and d + 2 days before PT. We tested d = 3 and 4, and 3 yielded better results. Additionally, for lab measurements and vital signs, we also used a time-frame of the entire hospitalization period up to PT. See Supplementary text for more details.
Data Imputation
Missing values were observed mainly in lab measurements and vital signs. For repeatedly measured values, a linear regression model was fitted (see ‘Feature engineering’). We imputed the missing values of features in a certain timeframe based on those linear regression models. This strategy assumed that missing values are more accurately imputed using patient-specific measurements rather than values of all patients. Regression was performed per 3 or 5-day time-frame. If a patient was missing max, min, median, or min-max-diff time features in a certain time-frame, we extended the time-frame used to impute these values to 5 and 10 days, respectively. Moreover, the feature value 12 hours before PT was imputed using the 3-day linear regression, and if a regression model was not available for this time-frame, 5-day linear regression was used. Since large regression coefficients can lead to extreme imputed values, all the values produced by this extrapolation method underwent extreme and non-human values removal (Figure 5).
The rest of the time-series features, other continuous features (e.g. last lab measurement recorded), and instances where there were not enough values for the fitting of a linear regression model (see ‘Feature Engineering‘), were imputed based on the KNN algorithm19 with k = 5. In order to prevent vectors with high missing rate from being considered “closer” to all the other vectors, we developed two distance methods in addition to Sklearn’s weighted distance metric and evaluated them to choose the best one (see Supplementary Information).
Removal of Correlated Features
The creation of multiple time-series features in different time frames, as well as the collection of a variety of lab measurements and vital signs that reflect the same trends in patients’ medical condition, created feature redundancy. Two different methods were developed to deal with this problem, using clustering.
In the first method we clustered features based on the type of original measurement they were derived from (e.g., all time-series features derived from heart rate measurements) and filtered only nkeep features from each cluster that had the best p-value for association with the target (nkeep ∈ [1,2,3,4,5]).
In the second method we filtered out features with high correlation to other features. A correlation matrix C of all the features was created and transformed into a distance matrix Mij = 1 − Cij. This matrix M was then used for hierarchical clustering in which the final clusters were formed such that no two features in the cluster had a cophenetic distance greater than 1 minus a correlation threshold. The correlation thresholds 0.55, 0.6, 0.65, 0.7, 0.75, 0.8 were tested and 0.7 was chosen. Out of each cluster, only the feature with the best p-value for association with the target was kept. After comparing the effect of those parameters on the model’s performance, we kept only one feature per each of the raw features (nkeep = 1).
Feature Selection
Four methods of feature selection were evaluated. The first method is Recursive Feature Elimination with Cross Validation (RFECV) 20,21. The second method utilizes the model’s default feature importance method and selects only features with importance higher than the mean importance of all features. The third method is filtration of K features with the best Shap values22. The fourth method selects the K features with the highest mutual information score with the target. The K values 20, 25, 30, 35, 40, 45, 50 were evaluated for those two latter methods, and the best value K = 45 was selected.
In order to increase the robustness of feature selection, we summarized the results of the four feature selection methods tested in two ways. First, we checked the union of all the features selected by the methods. Second, we chose only the features that were selected by at least two of the four methods. The union of the features selected by all four methods using yielded the best results.
Model Development
Data Balancing
Three different methods for oversampling were tested. The first two methods, ADASYN23, and BorderlineSMOTE25 generate synthetic data mostly based on the “most difficult” samples for learning, and they both assume that all features are continuous. The third method, SMOTENC24 distinguishes between continuous and categorical features and samples those features accordingly. Moreover, for each method, we tested different balancing ratios for the inappropriate treatment class, which was the smaller class, taking ratios of 0.3, 0.35, 0.4, 0.45 and 0.5.
Moreover, we developed an ensemble model (DataEnsemble) that is composed of two instances of the same model trained on all positive samples and a different, disjoint subset of negative samples. Therefore, each model in the ensemble is trained on a proportion of 1:2 for positive compared to negative patients. The risk score of this model is the average score of the two models in the ensemble (Figure 6 Error! Reference source not found.). After evaluating all those methods, BorderlineSMOTE with a balancing ratio of 0.3 and utilization of DataEnsemble model were chosen for data balancing.
Another possible way to handle class imbalance is using class weights. Class weights adjust the loss function of the model to penalize the misclassification of the minority more heavily than those of the majority class, thus improving the model’s learning process on the minority class. We evaluated different forms to allocate a high weight to the positive class (Supplementary Table 7), but did not obtain any substantial enhancement in the performance of the model.
Model Selection
In order to choose the best model possible for our data, eight different binary classification models were compared - Random Forest26, AdaBoost27, Logistic Regression28, SVM29, SGDclassifier21, LightGBM30, sklearn’s Gradient Boosting Classifier21,31 and Xgboost32. For each model we created a DataEnsemble model as described above.
Hyperparameter Optimization
After choosing the best model and pipeline parameters using an exhaustive search over the parameter combinations (e.g., data normalization method, see Model’s Pipeline), we used grid search to evaluate the effect of different model hyperparameters (e.g., Random Forest’s max depth) on the results of the model trained on each of the five iterations of 5-fold cross validation on the training set. We tried different parameter combinations (Supplementary Table 6) and chose the combination that yielded the best mean AUPR results.
Data Availability
The MIMIC-III database analyzed in this study is available on PhysioNet repository51.
Code Availability
The code used for data processing and model development is available at https://github.com/Shamir-Lab/ABXAppropriatenessML.
Author Information
These authors contributed equally: Ella Goldschmidt and Ella Rannon.
These authors contributed equally: Dan Coster and Ron Shamir.
Contributions
E.G., E.R., D.C., R.S. conceived and designed the analysis; E.G., E.R. performed the data analysis, model development and model evaluation; E.G., E.R., D.C., R.S. contributed to the study design; A.W., D.B. assisted in the evaluation of the clinical aspects and data interpretation; E.G., E.R., D.C., R.S. wrote the manuscript.
Supplementary Information
Supplementary text
Feature Engineering
For lab measurements and vital signs, if the patient had more than n values of the feature in the relevant time frame, we fitted a linear regression model of the feature’s values over the relevant timeframe and extracted the R2 and coefficient of that model. After evaluating the impact of n = 3,4 and 5 on the model performance, n = 5 was chosen. We also created two features to compare between the values in the time-frames of d + 2 and d days: (1) the ratio between the linear regression coefficient fitted on each time-frame, (2) the difference between the median values in each time-frame. Furthermore, in order to observe the effect of the antibiotic administered to the patient, we added as a feature the ratio of the patient’s first measurement after and before the blood culture is taken.
In addition, for each lab measurement and vital sign, we created a feature of the last value recorded before PT. Since this value can be recorded at any point after the patient was admitted to the hospital, we also created a feature called “12 hours before PT”. This feature is extracted from the time window of 11-13 hours before PT. Values that were not measured between 11 to 13 hours before PT were imputed (see ‘Data Imputation Section’).
Moreover, for each lab measurements and vital sign extracted, we collected the number of measurements recorded divided by the patient’s length of stay until PT. Overall, 24 features were created for each lab test and vital sign. Moreover, an additional feature based on the temperature measurements was created, referring to the proportion of fever measurements (≥ 37.5?) out of all temperature measurements a patient had.
Furthermore, we created a binary feature for each lab measurements and vital signs feature that represents whether the patient’s feature value was imputed (i.e., masking features). In addition, for continuous lab measurements and vital signs, we added a feature that estimates how much a patient’s continuous measurements are irregular by applying Isolation Forest52, a method for anomaly detection.
For culture features we created features for different culture outcomes. The properties included Gram-negative, Gram-positive, the detected organism, specimen site (e.g., blood culture, sputum sample), a pair of specimen site and organism, a pair of antibiotic tested and resistance result (R for resistance, or S for sensitive), and combination of an organism, antibiotic tested and resistance result (e.g. existence of Vancomycin-R-Enterococcus sp.). We also added features for the total number of previous cultures and for total number of cultures that were found to be resistant to any antibiotic. It is important to note that our data contains the time the culture was taken, but not the time the results were retrieved. Therefore, to avoid leakage, we used only information on cultures that were taken three days or more before PT, as it takes up to three days to receive culture results. Since these features were sparse, we filtered out features with < 4% of existing values, keeping 25 features out of the original 197. The list of these 25 features appears below.
List of previous culture features used by our model
‘AMPICILLIN - R (E)’, ‘Culture from MRSA Screen (E)’, ‘Culture from Sputum (E)’, ‘Culture from Swab (E)’, ‘Culture from Urine (E)’, ‘Enterococcus sp. - AMPICILLIN - R (E)’, ‘Enterococcus sp. - PENICILLIN - R (E)’, ‘Enterococcus sp. - VANCOMYCIN - R (E)’, ‘Enterococcus sp. from Swab (E)’, ‘Enterococcus sp. (E)’, ‘GENTAMICIN - S (E)’, ‘LEVOFLOXACIN - R (E)’, ‘OXACILLIN - R (E)’, ‘PENICILLIN - R (E)’, ‘Staphylococcus aureus - OXACILLIN - R (E)’, ‘Staphylococcus aureus from MRSA Screen (E)’, ‘Staphylococcus aureus (E)’, ‘VANCOMYCIN - R (E)’, ‘VANCOMYCIN - S (E)’, ‘Gram Negative (E)’, ‘Gram Positive (E)’, ‘Yeast from Urine (E)’, ‘Yeast (E)’.
R – Resistant culture, S – Sensitive culture, E – existence feature, i.e., whether the patient had that specific result.
Data Imputation
For data imputation, we used KNN and compared three different distance methods. The first metric we used was Sklearn’s weighted distance metric53,54 defined as
The imputed value of the feature l in a patient with feature vector X is , where is the feature vector of its jth nearest neighbor.
In the second distance metric, “Mean Distance Penalty”, we added a penalty to the distance calculation for each feature that is missing in either vector. Define penaltyf as the mean square distance calculated between non missing values of feature f. For efficiency, we used in the computation 10% of the non-missing values of the feature, sampled from evenly-spaced quantiles of the feature. Then define In the third method, named “Normalization by Count of Shared Features”, we normalized the default distance method by the number of not-null feature values shared by the two vectors instead of normalizing by the squared root of this number, as follows: This gives more weight to the number of not-null values than the default method.
We then evaluated the effect on the model performance and the running time of each distance method (Supplementary Table 8). Based on these results, we chose the third distance function.
Acknowledgements
We thank Sarah Amar, MD for helpful inputs. Study supported in part by the Israel Science Foundation (grant No. 3165/19, within the Israel Precision Medicine Partnership program, and grant No. 2206/22) and by the Tel Aviv University Center for AI and Data Science (TAD). EG, ER and DC are supported in part by fellowships from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University. This work was carried out in partial fulfillment of the requirements for the Ph.D. degree of D.C. at the Blavatnik School of Computer Science, Tel Aviv University. All icons used in this paper are designed by Freepik and are available at https://www.flaticon.com/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.