Development of an ensemble machine learning prognostic model to predict 60-day risk of major adverse cardiac events in adults with chest pain

Background: Chest pain is the second leading reason for emergency department (ED) visits and is commonly identified as a leading driver of low-value health care. Accurate identification of patients at low risk of major adverse cardiac events (MACE) is important to improve resource allocation and reduce over-treatment. Objectives: We sought to assess machine learning (ML) methods and electronic health record (EHR) covariate collection for MACE prediction. We aimed to maximize the pool of low-risk patients that are accurately predicted to have less than 0.5% MACE risk and may be eligible for reduced testing. Population Studied: 116,764 adult patients presenting with chest pain in the ED and evaluated for potential acute coronary syndrome (ACS). 60-day MACE rate was 1.9%. Methods: We evaluated ML algorithms (lasso, splines, random forest, extreme gradient boosting, Bayesian additive regression trees) and SuperLearner stacked ensembling. We tuned ML hyperparameters through nested ensembling, and imputed missing values with generalized low-rank models (GLRM). We benchmarked performance to key biomarkers, validated clinical risk scores, decision trees, and logistic regression. We explained the models through variable importance ranking and accumulated local effect visualization. Results: The best discrimination (area under the precision-recall [PR-AUC] and receiver operating characteristic [ROC-AUC] curves) was provided by SuperLearner ensembling (0.148, 0.867), followed by random forest (0.146, 0.862). Logistic regression (0.120, 0.842) and decision trees (0.094, 0.805) exhibited worse discrimination, as did risk scores [HEART (0.064, 0.765), EDACS (0.046, 0.733)] and biomarkers [serum troponin level (0.064, 0.708), electrocardiography (0.047, 0.686)]. The ensemble's risk estimates were miscalibrated by 0.2 percentage points. The ensemble accurately identified 50% of patients to be below a 0.5% 60-day MACE risk threshold. The most important predictors were age, peak troponin, HEART score, EDACS score, and electrocardiogram. GLRM imputation achieved 90% reduction in root mean-squared error compared to median-mode imputation. Conclusion: Use of ML algorithms, combined with broad predictor sets, improved MACE risk prediction compared to simpler alternatives, while providing calibrated predictions and interpretability. Standard risk scores may neglect important health information available in other characteristics and combined in nuanced ways via ML.

characteristics . Effective risk scores will stratify patients across risk 21 levels such that the qualitative "low risk" group will have sufficiently low risk of 22 short-term MACE that those patients can be discharged without additional workup. An 23 ineffective or ill-calibrated risk score would underestimate the risk in the "low risk" group 24 and lead to an overly optimistic early discharge policy that results in increased future 25 MACE. But given multiple risk scores that are well-calibrated, scores with improved 26 discrimination could theoretically result in a larger percentage of low-risk patients. It remains debated whether machine learning methods can exhibit statistically and 29 substantively significant benefits for risk prediction compared to logistic regression, 30 decision trees, or additive risk scores (Goldstein, Navar, and Carter 2016;Goldstein, 31 Navar, Pencina, et al. 2016). A recent meta-analysis, for example, did not find 32 systematic benefit from machine learning in comparison to logistic regression 33 (Christodouloua et al. 2019). Yet there is also optimism about the potential for artificial 34 intelligence methods in medicine (He et al. 2019) in general, as well as cardiology 35 specifically (Johnson et al. 2018). 36 Building on Mark et al. 2018, we sought to assess the performance of machine 37 learning (ML) methods at predicting MACE among emergency department patients 38 with chest pain. Could ML improve upon existing validated risk scores through a more 39 complex integration of predictors that can better estimate MACE risk? To what extent 40 is hyperparameter optimization necessary to achieve strong ML performance?
Our clinical objective was to maximize the pool of low-risk patients that are 42 accurately predicted to have less than 0.5% MACE risk and may be eligible for reduced 43 testing. The primary threshold of 0.5% risk has previously been identified as an 44 acceptable risk by a majority of emergency physicians for early discharge (Than, 45 Herbert, et al. 2013). Using a risk of 0.5% as the test threshold will inherently lead to a 46 negative predictive value of greater than 99.5%, provided that the risk prediction is 47 well-calibrated in the target population. We also examined secondary thresholds of 1.0% 48 and 2.0%.

49
A reasonable assessment of ML performance could only be made in comparison to 50 realistic alternative options. We compared ML performance to simpler indicators of risk: 51 key biomarkers (troponin, electrocardiogram), validated clinical risk scores (History,52 ECG, Age, Risk factors and Troponin [HEART] and Emergency Department 53 Assessment of Chest pain Score [EDACS]), decision trees, and logistic regression. 54 If machine learning can demonstrate improved discriminative performance compared 55 to logistic regression and related methods, along with appropriate calibration, its next 56 hurdle for adoption is to provide interpretability. Clinicians may be willing to forgo 57 maximum predictive accuracy for the sake of understanding how individual predictors 58 influence the output of the algorithm. With analytical effort it may be possible to 59 provide sufficient interpretability for clinicians to accept the complication of machine 60 learning and the benefit of the (potentially) improved predictive accuracy. To facilitate 61 interpretation, we explained the models through prediction-based variable importance 62 ranking and accumulated local effect visualization. If simpler algorithms remain 63 preferred, the ML results can at least approximately the best achievable performance, 64 and so serve as benchmark standards when considering more restrictive algorithms. 65 Certain analytical characteristics would be important to arrange in order for ML to 66 potentially improve upon simpler options. First, it was important to extract a broader 67 set of granular predictor variables than were used by existing scores. Extensive 68 predictor sets give ML the potential to capture interactions and nonlinear relationships 69 that are missed by linear or additive approaches, perhaps relevant only to certain 70 subgroups of patients. Further, ML may statistically identify novel predictors that have 71 been missed by existing scores or the broader literature, or whose predictive impact was 72 too small, in too complex a form, or underrepresented in terms of sample size to be 73 detected by non-ML methods. The expansion of electronic health records (EHRs) also 74 makes broader covariate collection more feasible and relevant than was possible prior to 75 EHRs, while also facilitating more granular measurement of variables (E. H. Kennedy  It is also important for variables be measured on a fine-grained scale, which gives 78 ML the opportunity to detect novel cut-points or thresholds that improve performance. 79 Variables should be kept as their original continuous measurements rather than 80 dichotomized or discretized into qualitative levels (Senn 2005). For example, a predictor 81 such as body mass index (BMI) loses substantial information when it is dichotomized 82 into an indicator of high-BMI or the absence of high-BMI. A single threshold chosen for 83 for that dichotomization may not be optimal for certain subgroups or regions of risk. All adult patients were retrospectively included if they had received cardiac troponin 98 testing in the emergency department and either presented with a chief complaint of 99 chest pain or chest discomfort, or whose ED physician had assigned them a primary or 100 secondary ICD-coded diagnosis of chest pain. The later inclusion criterion is important 101 because patients may complain of "anginal equivalents" (such as shortness of breath) in 102 lieu of overt chest pain (Amsterdam et al. 2010). The initial inclusion pool had a 60-day 103 MACE rate of 8.0%. Patients were excluded if they had a MACE diagnosis in the ED or 104 within 30 days prior to ED visit, alternative non-ACS diagnoses at index visit (e.g. 105 pneumonia, pneumothorax, or traumatic injury), could not be tracked due to lack of 106 active health plan membership during the study (except in cases of death), or had a 107 troponin I > 99th percentile upper limit of normal given the dominant predictive value 108 of elevated troponin values for adverse outcomes in both patients with acute coronary 109 syndromes and in the general population (PMID 20447535, pmid:21139111). Patients 110 were excluded if their smoking status was unknown, which was viewed as a key marker 111 of low-quality data. The final study cohort consisted of 116,764 patients with a 60-day 112 MACE incidence of 1.88%. A fourth-generation troponin assay was used during the 113 study period (AccuTnI+3, Beckman-Couleter, Brea, CA, USA).

115
Our primary outcome was cumulative MACE incidence within 60 days of the index visit. 116 We defined MACE as myocardial infarction, cardiogenic shock, cardiac arrest, or death. 117

118
We used a total of 74 predictors sourced from the electronic health record, including 119 vitals, labs, history, qualitative interpretation of ECG imaging, regular expression-based 120 extraction of features from clinical notes, demographics, and missingness indicators (20). 121 These predictors are detailed in Table ??.  Table ??. We created missingness 124 indicators for each predictor, which marked the observations that were missing a value. 125 Inclusion of missingness indicators often improves predictive performance (Agor et al. 126 2019). That matrix of missingness indicators was analyzed for perfect collinearity, and 127 duplicate indicators were dropped.

128
Missing predictor values were imputed by factorizing the raw data matrix with 129 generalized low-rank models (GLRM) (Schuler et al. 2016;Udell et al. 2016). GLRM is 130 a generalization of principal component analysis and matrix completion methods and is 131 designed for mixed type data frames that include continuous, categorical, ordinal, and 132 binary variables. GLRM decomposes (factorizes) the original data frame into an X 133 matrix of reduced components and Y matrix of archetypes, including possible penalty 134 3/21 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 terms that can induce sparsity (L1) or simply denoise (L2 or quadratic). Multiplying 135 these two factor matrices reconstructs the original data frame, imputing any data 136 entries with missing values. The method used for missingness provides few constraints 137 on the resulting fit and also permits prediction from future data with missing values.

138
The GRLM hyperparameter settings were chosen through a grid search in which 139 each model was trained on 75% of the data and evaluated on the remaining 25% for 140 accuracy at reconstructing the original observed data matrix. Missingness indicators 141 were not included in the GRLM imputation analysis. Our final GLRM settings were: 50 142 components, quadratic regularization on X with weight 4, and L1 regularization on Y 143 with weight 24. Cells with missing data were then replaced with the reconstructed data 144 matrix from GLRM using the optimal settings. 1 145 GLRM imputation greatly increased the number of unique values (cardinality) for 146 continuous variables, which would have a negative performance impact on tree-based 147 algorithms that test every unique value for a potential split. To avoid that performance 148 drop, we using penalized histogram binning to bin imputed predictors with high 149 cardinality into up to 200 unique values (Rozenholc et al. 2010).

150
Multiple imputation was not necessary because our scientific goal was to characterize 151 predictive performance for the unimputed outcome variable, rather than to estimate 152 statistical parameters for covariates that were imputed, such as linear regression 153 coefficients (Steyerberg 2009;Wang et al. 1992).

155
Dozens if not hundreds of other prediction algorithms would be possible to evaluate, but 156 computational time limitations forced us to choose a finite set with reasonable 157 performance expectations. We chose well-known prediction algorithms that have shown 158 strong performance in prior research, including both linear and decision tree-based 159 estimation. The tree-based prediction algorithms were random forest (Breiman 2001), 160 extreme gradient boosting (XGBoost) (Chen et al. 2016), and Bayesian additive 161 regression trees (Chipman et al. 2010). The linear prediction algorithms were 162 generalized additive models (T. J. Hastie et al. 1990) using thin plate splines (Wood 163 2003), and lasso (Tibshirani 1996).

164
Splines have shown competitive performance with tree-based algorithms in prior 165 clinical prediction work due to their ability to identify non-linear, but smooth patterns 166 (Austin 2007). The lasso algorithm (or its generalization the elastic net) is a helpful test 167 of sparsity in the covariates, and a faster & more nuanced variable selection method 168 than best subset or stepwise selection (T. Hastie et al. 2017). Better performance for 169 lasso compared to logistic regression would indicate that feature selection could be 170 helpful for other algorithms, while equal performance could indicate that the extraction 171 of predictors from the EHR was overly restrictive and should be broadened. When evaluating complex algorithms it is important to contextualize their performance 174 by comparing to simpler alternative approaches or benchmarks. If the benchmark 175 algorithms can achieve similar performance then the extra complexity of the statistical 176 machine learning algorithms may not be worthwhile. The improvement of a novel 177 prediction method over standard benchmarks is known as the skill of the prediction 178 method (Brier 1950;Murphy et al. 1977;F. Sanders 1963). In clinical prediction the 179 primary alternatives to statistical machine learning are relatively inflexible fits, which 180 1 Here X refers to the reduced components after GLRM transformation, and Y refers to the complementary matrix that transforms those components back to the original covariate space. It does not refer to the outcome variable.

4/21
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.08.21252615 doi: medRxiv preprint include logistic regression, ordinary least squares, individual decision trees, and 181 stratification on key clinical covariates. We tested each of these options, where key 182 covariates were defined as peak troponin, qualitative ECG reading, EDACS score, and 183 HEART score. As a complement to stratification on different subsets of key covariates, 184 we also evaluated logistic regression and decision trees when restricted to these key 185 covariates. When comparing a variety of algorithms an initial choice is to use cross-validation to 188 select the algorithm with the best out-of-sample performance. A more nuanced decision 189 would be to consider a weighted average of multiple algorithms -creating a team of 190 algorithms whose contribution to the prediction is based on optimizing out-of-sample 191 performance on a certain statistic. That is the nature of stacked ensembles (Breiman 192 1996;Wolpert 1992), sometimes referred to as the Super Learner algorithm (van der 193 Laan et al. 2007). Rather than restrict our prediction machine to a single algorithm, we 194 create a weighted average across all tested algorithms, and select weights based on an 195 optimization goal so that they minimize a chosen performance statistic on test data. We 196 chose to optimize the Brier score (i.e. mean-squared error) in our ensemble, using 197 convex weights based on a non-negative least squares meta-learner. Optimizing on Brier 198 score includes a focus on both discrimination and calibration for the ensemble (Murphy 199 et al. 1977). A convex combination of algorithm weights ensures that predictions fall Prediction algorithms often have a multiple hyperparameter settings that adjust the 204 estimation procedure in different ways. Those hyperparameters are not estimated from 205 the data, but rather must be specified a priori by the analyst. While software 206 implementations will typically provide a default value for each hyperparameter, there is 207 no reason to believe that the default values are effective for the current dataset.

208
Customizing the hyperparameter configuration to the current dataset can allow the 209 algorithms to adapt to the available sample size, number of predictor variables, 210 measurement error in the predictors, sparsity in predictor relevance, and correlation 211 structure of the predictors. Hyperparameters are often chosen by fitting the algorithm 212 with different configurations and selecting the configuration that maximizes accuracy on 213 held-out data, such as through cross-validation. The benefit of hyperparameter tuning 214 is believed to vary by algorithm, which is referred to as the tunability of the algorithm 215 . Random forest, for example, is believed to work well 216 with default hyperparameters but also can benefit from hyperparameter tuning, 217 particularly to reduce overfitting (Probst, Wright, et al. 2019;Segal et al. 2011).

218
Hyperparameter tuning is inherently a computationally intensive process, as it 219 involves fitting the algorithms many different times, and varies based on the number of 220 hyperparameters (dimensionality) as well as number of the unique values tested for each 221 hyperparameter (resolution). Further complexity is involved if one considers that some 222 hyperparameters may be more important than others for a given algorithm. Given the 223 role of hyperparameters in modifying the performance of prediction algorithms, caution 224 is warranted when generalizing algorithm performance characteristics from individual 225 studies (e.g. algorithm X outperforms algorithm Y), particularly when hyperparameters 226 are left at their default values and therefore are not customized to the given dataset.

227
For this work we adopted a hyperparameter tuning approach using nested 228 ensembling. Much as using a weighted ensemble of different algorithms may be 229 5/21 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.08.21252615 doi: medRxiv preprint preferable to selecting the single best-performing algorithm, using a weighted ensemble 230 of hyperparameter settings for a given algorithm may yield improved performance 231 compared to selecting a single set of hyperparameters. With that concept in mind we 232 created small grids of hyperparameter configurations and estimated a SuperLearner 233 ensemble for a given algorithm in which the ensemble weights selected the 234 hyperparameter settings that maximized out-of-sample performance. This ensemble of 235 hyperparameter settings could potentially rely on a single configuration due to the 236 sparsity induced by the convex combination, or the optimization could distribute the 237 weighting across multiple configurations if such a weighting improved performance over 238 a single selected configuration. Another benefit of the nested ensembling is that it limits 239 the number of learners that are analyzed in the outer SuperLearner ensemble, which can 240 conserve power and mitigate overfitting in the meta-learning process (i.e. allocation of 241 weights in the convex combination). 242 We used the ensemble hyperparameter tuning approach for random forests, xgboost, 243 and individual decision trees. The random forest grid consisted of 9 configurations:  We chose area under the precision-recall curve (PR-AUC, also known as average 259 precision) as our primary performance metric for evaluating discrimination, because it 260 highlights performance differences that may be missed by ROC-AUC with imbalanced 261 data (Cook 2007;Saito et al. 2015). We included area under the receiver operating 262 characteristic curve (ROC-AUC or the concordance statistic) as our secondary 263 performance metric, which remains highly popular and interpretable (Janssens et al. 264 2020). As an exploratory metric we also estimated the adjusted Brier score (index of 265 prediction accuracy) which integrates discrimination and calibration into a single metric 266 (Kattan et al. 2018). We visualized improvements in discriminative performance using 267 density plots of the calibration slope (Steyerberg et al. 2010). We did not conduct a 268 reclassification analysis due to recognized limitations (Hilden et al. 2014;Kerr et al. 269 2014; Leening et al. 2014;Pepe et al. 2015). Our clinical use case was centered on a risk threshold of 0.5% to classify patients as "low 272 risk" in order to qualify for early discharge. Because of that scientific goal, it was 273 especially important to compare the model's predicted risks to the observed risks, i.e.

6/21
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021.
its calibration (Lichtenstein et al. 1981) -also known as reliability (Brier 1950;Murphy 275 et al. 1977) or external correspondence (Yates 1982). We assessed the calibration of 276 predicted probabilities in two ways: 1) calibration curve visualization, 2) calculation of 277 the index of prediction accuracy (IPA), a transformation of the Brier score (Kattan 278 et al. 2018). We did not conduct a Hosmer-Lemeshow group-based calibration test due 279 to its recognized limitations and recommendations against its use (Kramer et al. 2007;280 Van . The planned clinical use of the prediction model was first to assess eligibility for early 283 discharge among low-risk patients. Accurately estimating the risk of MACE for patients 284 would allow those low-risk patients to be discharged and avoid additional unnecessary 285 workup, freeing up resources (clinical attention, testing capacity, etc.) for higher risk 286 patients. Low risk was generally defined as being below a 0.5% well-calibrated 287 probability of MACE within 60 days, with less conservative thresholds of 1% and 2% as 288 additional options.

289
Our model needed to balance two trade-offs: 1) false "negatives" in which a patient 290 was identified as low-risk but whose true risk of MACE within 60 days was above the 291 threshold, and 2) false "positives" in which patients were believed to be above the given 292 threshold but whose true risk was less than the threshold. Errors in the first category 293 have a greater cost than those in the second category, because there is a greater 294 potential detriment to those patients who were discharged early but whose true risk 295 exceeded the threshold. Patients incorrectly estimated to be above the risk threshold, 296 but who are truly low risk, have comparatively minor costs of additional workup, use of 297 clinical resources, and potential to be overtreated. Yet these possible errors are not 298 quite the same as false negative or false positives typically used to assess predictive 299 models: we care about the true, but unknown, risk rather than the observed outcome. 300 Under this decision-making calculus a patient whose true risk is correctly predicted to 301 be below the clinical threshold, and is therefore discharged without additional workup, 302 but who ends up having a MACE would still have been managed appropriately.

303
This suggests that the absolute or squared error of the patient's predicted risk versus 304 true risk, particularly near the clinical threshold, would be reasonable loss functions to 305 translate into clinical utility. Miscalibration near the clinical threshold needs to be 306 avoided, whereas miscalibration away from the threshold does not affect the decision.

307
As the expected value of that loss approaches zero we would see that the number of 308 false negatives and false positives (in terms of risk above or below a threshold rather 309 than the observed outcome) also approaches zero. We could target a specific threshold 310 by focusing on patients on the incorrect side of the threshold and averaging the error in 311 their risk prediction, possibly including differential weights for each side of the threshold 312 to account for different costs to the patient. Such a "miscalibration-around-a-threshold" 313 loss function might look as follows: where:

315
• Y is the observed outcome and X is the set of predictors,

316
• i indexes each patient in the sample,

317
• P 0 (Y i | X i ) is the true risk of patient i, 318 7/21 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.08.21252615 doi: medRxiv preprint •f (X i ) is the predicted risk of patient i from a given estimatorf ,

319
• τ is the clinical threshold (e.g. 0.5%), 320 • g is a function such as the identity, squared value, or absolute value function,

321
• ω 1 is the differential cost for low-risk patients who are kept for further workup,

322
• ω 2 is the differential cost for high-risk patients who are incorrectly discharged 323 early, 324 We do not know the true risk for any patients, but we can estimate it within our 325 sample by fitting a semi-parametric smooth function (e.g. lowess) to estimate the true 326 probability of the outcome given the estimated predicted probability, equivalent to what 327 is done during calibration analysis.

328
If multiple decisions were to be made based on the estimated risk, we might sum this 329 loss over each decision. Alternatively we might use a threshold-free loss function, such 330 as: In this work we focus on the threshold-free loss with absolute value as the 333 transformation function g.

335
Beyond the statistical performance of a clinical prediction, it can be important to 336 provide an explanation or overview of how a model generates its predictions.

337
Interpretation is desirable first because it can provide evidence that the model is 338 working as expected, which can improve the trustworthiness of its predictions for 339 clinicians, patients, or collaborators. Interpretation may also lead to scientific insights 340 about how predictors are related to the outcome, which could be conceptualized as 341 causal pathways, data generating processes, or biological mechanisms. Interpretation 342 can further inform the data export and cleaning processes, such as identifying extreme 343 values, data entry errors, or outliers, or suggesting additional predictor variables to 344 incorporate into the model.

345
Methods of interpretation can be model-specific or model-agnostic. For models 346 within the family of linear regression, one might provide the estimated beta coefficients 347 for each predictor, along with their associated confidence intervals and p-values.

348
Interpretation becomes less straightforward as models become more complex, such as 349 with interaction or polynomial terms in a regression, random forest or boosted tree 350 models with hundreds or thousands of non-linear decision trees, or splines in which 351 ranges of a given predictor might have different coefficients.

352
In this work we focus on two complementary forms of model interpretability:

353
variable importance ranking and accumulated local effect plots, as described below. which could formally identify predictors that differed from their expected importance. 360

8/21
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted flawed results because they make a key unrealistic assumption that features are 366 statistically independent of each other (Molnar 2020, p. 5.1.3). Accumulated local effect 367 (ALE) plots are a recently developed method that avoids that limitation of PDPs, by 368 counterfactually modifying observations that lie within a nearby kernel neighborhood of 369 the current predictor's value of interest (Apley et al. 2019). Following the variable 370 importance ranking, we visualize the contribution of high-importance continuous 371 variables using accumulated local effect plots.  Table ??. 384 9/21 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.08.21252615 doi: medRxiv preprint and is displayed in Figure 3 Table ??. 391 Figure 3. Comparison of cross-validated discriminative performance using ROC-AUC metric, with 95% confidence intervals. The simple mean had a standard AUC of 0.5 and is omitted from the plot. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ;   The median predicted risk was 0.64%, with a first quartile of 0.2% and third quartile 405 of 2%. Our primary threshold of scientific interest was 0.5% for possible early discharge. 406 Given those low risk levels, it would be best to "zoom in" our visual calibration review 407 to that region. We show a zoomed calibration plot as Figure 5. 408 12/21 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; Figure 5. Zoomed calibration plot comparing predicted risk to observed risk. Clinical thresholds of 0.5%, 1%, or 2% risk are noted by blue vertical lines.
Finally, we include a exponential-scale calibration plot ( Figure 6) with calibration 409 confidence intervals after grouping patients into 10 groups based on predicted risk, 410 consistent with TRIPOD guideline recommendations (Collins et al. 2015). Due to the 411 substantial class imbalance the exponential scaling of axes allows easier comparison 412 across the probability range, although it may be less intuitive due to the shifting of 413 scales. For example, the width of confidence intervals is counterintuitive for visual 414 comparison due to the dynamic scaling, but the amount of information provided is 415 visually consistent throughout the plot. 416 Figure 6. Exponential-scale calibration plot comparing predicted risk to observed risk with grouped 95% confidence intervals. Clinical thresholds of 0.5%, 1%, or 2% risk are noted by blue vertical lines.

13/21
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. ; As a statistical complement to the visual examination, we also calculated mean 417 absolute error (MAE). MAE is the sample mean of the absolute difference between the 418 smoothed observed risk (Risk 0 ) and the predicted risk (Risk P ).
We found an MAE of 0.19% with a lowess smoothing span of 0.05 (low smoothing), 420 and an MAE of 0.14% with a smoothing span of 0.20 (high smoothing). These statistics 421 indicate that the ensemble risk prediction was typically miscalibrated by about 0.17 422 percentage points. We evaluated the benefit of the more complex GLRM-based imputation by comparing 425 the imputed value to the known value, among variables with missingness. The root 426 mean-squared error metric was calculated for each variable, and for both GLRM and 427 median/mode imputation methods. We could then estimate the percentage improvement 428 in RMSE for the GLRM imputation. Results in Table 2 show a notable improvement in 429 RMSE for every variable, with the exception of the obesity binary variable. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 8, 2021. algorithms: random forest and xgboost. We used the optimal hyperparameter settings 440 from cross-validated analysis. In future work we plan to expand the machine learning in several ways. The 466 ensemble weighting could specifically optimize PR-AUC. Incorporating feature selection 467 may benefit the simpler algorithms by removing unhelpful predictors. Feature 468 engineering might be beneficial as well, such as creation of interaction terms or even 469 incorporation of the principal components from the GLRM imputation. Due to 470 computational limitations we were not able to conduct hyperparameter tuning on the 471 BART learner, which likely would provide some performance benefit. We are optimistic 472 that random search or model-based search (e.g. Bayesian optimization) rather than grid 473 search could provide even stronger tuning of algorithm hyperparameters across a higher 474

16/21
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 8, 2021. ; https://doi.org/10.1101/2021.03.08.21252615 doi: medRxiv preprint number of dimensions. Evaluation of the GLRM imputation could be further 475 contextualized through comparisons to additional imputation methods, especially 476 principal component analysis, k-nearest neighbors,multiple imputation, and 477 variable-specific supervised models (e.g. OLS or random forest). Additional machine 478 learning algorithms could be explored, such as LightGBM, extremely randomized trees, 479 and multivariate adaptive regression splines. The variable importance ranking could be 480 streamlined through a Random Forest-style permutation importance analysis of the 481 SuperLearner ensemble itself, or through a targeted learning method such as vimp 482 (Williamson et al. 2017) or varimpact (Hubbard et al. 2018).

483
The model might also benefit from a broader sample that includes higher risk 484 patients, which were not included in this study. Calibration might be improved through 485 targeted learning-based adjustment (Brooks et al. 2012). Cross-validated estimation of 486 discrimination performance could be improved through cross-validated targeted 487 maximum likelihood estimation (Benkeser et al. 2019). In this work we explored the benefit of complex machine learning algorithms at 490 predicted major adverse cardiac events in patients with chest pain. We found that the 491 ML algorithms were able to achieve improved discrimination compared to simpler 492 baselines such as logistic regression, decision trees, or stratification on individual 493 predictors. Combining multiple algorithms into an ensemble estimator yielded the best 494 performance, and rather than select optimal hyperparameters we created an ensemble of 495 algorithms across different hyperparameters. We demonstrated the surprising 496 effectiveness of generalized low-rank models for imputation of missingness in 497 EHR-sourced patient data. Finally, we provided interpration of how the ensemble's 498 prediction is generated through two methods: ranking the predictors by their 499 contribution to predictive performance, and visualizing the dose-response effect of 500 continuous predictors with accumulated local effect plots.

501
The cleaning and analysis code for this project has been translated to use a public 502 dataset and is available online at