Machine learning models to identify patient and microbial genetic factors associated with carbapenem-resistant Klebsiella pneumoniae infection

Background Among patients colonized with carbapenem-resistant Klebsiella pneumoniae (CRKP), only a subset develop clinical infection. While patient characteristics may influence risk for infection, it remains unclear if the genetic background of CRKP strains contributes to this risk. We applied machine learning to quantify the capacity of patient characteristics and microbial genotypes to discriminate infection and colonization, and identified patient and microbial features associated with infection across multiple healthcare facilities. Methods Machine learning models were built using whole-genome sequences and clinical metadata from 331 patients colonized or infected with CRKP across 21 long-term acute care hospitals. To quantify variation in performance, we built models using 100 different train/test splits of the entire dataset, and urinary and respiratory site-specific subsets, and evaluated predictive performance on each test split using the area under the receiver operating characteristics curve (AUROC). Patient and microbial features predictive of infection were identified as those consistently important for predicting infection based on average change in AUROC when included in the model. Findings We found that patient and genomic features were only weakly predictive of clinical CRKP infection vs. colonization (AUROC IQRs: patient=0.59-0.68, genomic=0.55-0.61, combined=0.62-0.68), and that one feature set did not consistently outperform the other (genomic vs. patient p=0.4). Comparable model performances were observed for anatomic site-specific models (combined AUROC IQRs: respiratory=0.61-0.71, urinary=0.54-0.64). Strong genomic predictors of infection included the presence of the ICEKp10 mobile genetic element carrying an iron acquisition system (yersiniabactin) and a toxin (colibactin), along with disruption of an O-antigen biosynthetic gene in a sub-lineage of the epidemic ST258 clone. Teasing apart sequential evolutionary steps in the context of clinical metadata indicated that altered O-antigen biosynthesis increased association with the respiratory tract, and subsequent acquisition of ICEKp10 was associated with increased virulence. Interpretation Our results support the need for rigorous machine learning frameworks to gain realistic estimates of the performance of clinical models of infection. Moreover, integrating microbial genomic and clinical data using such a framework can help tease apart the contribution of microbial genetic variation to clinical outcomes. Funding Centers for Disease Control and Prevention, National Institutes of Health, National Science Foundation

unclear if the genetic background of CRKP strains contributes to this risk. We applied machine learning 30 to quantify the capacity of patient characteristics and microbial genotypes to discriminate infection and 31 colonization, and identified patient and microbial features associated with infection across multiple 32 healthcare facilities. 33

Methods 34
Machine learning models were built using whole-genome sequences and clinical metadata from 331 35 patients colonized or infected with CRKP across 21 long-term acute care hospitals. To quantify variation 36 in performance, we built models using 100 different train/test splits of the entire dataset, and urinary and 37 respiratory site-specific subsets, and evaluated predictive performance on each test split using the area 38 under the receiver operating characteristics curve (AUROC). Patient and microbial features predictive of 39 infection were identified as those consistently important for predicting infection based on average change 40 in AUROC when included in the model. 41

Findings 42
We found that patient and genomic features were only weakly predictive of clinical CRKP infection vs. 43 colonization (AUROC IQRs: patient=0·59-0·68, genomic=0·55-0·61, combined=0·62-0·68), and that one 44 feature set did not consistently outperform the other (genomic vs. patient p=0·4). Comparable model 45 performances were observed for anatomic site-specific models (combined AUROC IQRs: 46 respiratory=0·61-0·71, urinary=0·54-0·64). Strong genomic predictors of infection included the presence 47 of the ICEKp10 mobile genetic element carrying an iron acquisition system (yersiniabactin) and a toxin 48 (colibactin), along with disruption of an O-antigen biosynthetic gene in a sub-lineage of the epidemic 49 ST258 clone. Teasing apart sequential evolutionary steps in the context of clinical metadata indicated that 50 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . https: //doi.org/10.1101//doi.org/10. /2020 investigating pathogenicity-associated loci in K. pneumoniae clinical isolates. When we searched for 66 "infection" AND "machine learning" AND "genom*" AND "clinical", there was one relevant result -a 67 study that used clinical and bacterial genomic features in a machine learning model to identify clonal 68 differences related to Staphylococcus aureus infection outcome. 69

Added value of this study 70
To our knowledge, this is the first study to integrate clinical and genomic data to study anatomic site-71 specific colonization and infection across multiple healthcare facilities. Using this method, we identified 72 clinical features associated with CRKP infection, as well as a sub-lineage of CRKP with potentially 73 altered niche-specific adaptation and virulence. This method could be used for other organisms and other 74 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2020. become endemic in regional healthcare networks. [3][4][5][6][7] In this background of regional endemicity the risk of 89 patient exposure to CRKP is high, as evidenced by alarmingly high rates of colonization, especially in 90 long-term care settings. 7,8 However, even among critically ill patients residing in long-term care facilities, 91 not all colonized patients develop clinical infections that require antibiotic treatment. 9 Currently, our 92 understanding of the factors that influence whether a colonized patient develops an infection is 93

incomplete. 94
In addition to clinical characteristics of patients, the genetic background of the colonizing strain may also 95 influence the risk of infection, as there is extensive intra-species variation in antibiotic resistance and 96 virulence determinants harbored by K. pneumoniae. 3 To date, most studies of virulence determinants have 97 been carried out in model systems, 10 or examined in human populations without considering patient 98 characteristics or clinical context. 11 One recent study investigated virulence determinants in K. 99 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2020. . https: //doi.org/10.1101//doi.org/10. /2020 pneumoniae clinical isolates while controlling for patient characteristics. 12 However, this was a single-site 100 study with a focus on carbapenem-susceptible K. pneumoniae, thereby not addressing the impact of 101 genomic variation in antibiotic-resistant lineages that circulate in global healthcare systems. 102 Here, we sought to understand the importance of both patient factors and CRKP genetic background in 103 determining whether a patient is infected (vs. colonized) with CRKP, and identify a set of patient and 104 microbial features that are consistent predictors of CRKP infection across long-term care facilities. To 105 accomplish this, we compared patients with CRKP colonization and infection based upon both clinical 106 characteristics and the genomes of their colonizing or infecting strains. To improve the generalizability of 107 our findings, we employed a rigorous machine learning framework and included patients from 21 long-108 term acute care hospitals (LTACHs) across the US. 109 110

Clinical and genomic data 112
We used whole-genome sequences of clinical (non-surveillance) CRKP isolates and associated patient 113 metadata from a prospective observational study performed in 21 LTACHs from across the US over the 114 course of a year (BioProject accession no. PRJNA415194). 13 We included only the first clinical 115 bloodstream, respiratory, or urinary isolate from each patient (n=355; Figure S1A), and subset to only 116 ST258 isolates for the majority of analyses (n=331 ; Table S1; see supplementary material for reasoning). CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Feature sets 127
We studied the association between five different feature sets and infection/colonization in CRKP ST258 128 ( Figure S1C)

Machine learning & model selection 138
We aimed to classify clinical infection (vs. colonization) using each of the different feature sets (see 139 above); we built classifiers using the first clinical isolate from each patient for all isolates, only 140 respiratory isolates, and only urinary isolates. We performed L2 regularized logistic regression using a 141 modified version of the machine learning pipeline presented in Topçuoğlu et al. 27 using caret version 6.0-142 85 28 in R version 3.6.2 29 (Figure S1D1). We randomly split the data into 100 unique ~80/20 train/test 143 splits, keeping all isolates from each LTACH grouped in either the training set or the held-out test set to 144 control for facility-level differences among the isolates (e.g., background of circulating strains within 145 each facility, patient population, and clinician test ordering frequency). For valid comparison, the 146 train/test splits were identical across models generated with different feature sets. Hyperparameters were 147 selected via cross-validation on the training set to maximize the average AUROC across cross-validation 148 folds. See supplementary methods for more details. 149 150 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Model performance 151
We measured model performance using the median test area under the receiver operating characteristic 152 curve (AUROC) and area under the precision recall curve (AUPRC), as well as the interquartile range, 153 across all 100 train/test splits ( Figure S1D2). 154

Features consistently associated with colonization or infection 155
To determine the importance of each feature in predicting colonization vs. infection, we measured how 156 much each feature influenced model performance by calculating a permutation importance ( Figure  157 S1D3). For each combination of feature and data split, we randomly permuted the feature and calculated 158 the 'permuted test AUROC' using the model generated with the training data. Features with a correlation 159 of 1 were permuted together. We performed this permutation test 100 times for each feature/data split 160 pair, and obtained a mean permutation importance for each data split. A mean permutation importance 161 above zero indicates that that feature improved model performance for that data split. We highlight 162 features where the mean test AUROC was above zero in at least 75% of the data splits. In this way, the 163 permutation importance method allows us to take into account the variation we observe across the 100 164 models, which is not possible with standard parametric statistical tests or odds ratios. 165

Data analysis & visualization 166
See supplementary material for details on data analysis and visualization in R version 3.6.2. 29-35 All code 167 and data that is not protected health information is on GitHub (https://github.com/Snitkin-Lab-Umich/ml-168 crkp-infection-manuscript). 169

Role of the funding source 170
The funding source had no role in study design; data collection, analysis, and interpretation; or report 171 writing. All authors had full access to all data in the study and final responsibility for the decision to 172 submit for publication. 173 174 175 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

176
Of the 355 clinical CRKP isolates from 21 LTACHs across the US, 13 we classified 149 (42%) of the 177 isolates as representing infection based on modified NHSN criteria ( Figure S2, Tables S1-3). Stratified 178 by anatomic site, we classified 29/29 (100%) blood isolates as infection, 69/196 (35%) respiratory isolates 179 as infection, and 51/130 (39%) urinary isolates as infection (Table S3). More than 90% of patient isolates 180 were from the epidemic CRKP lineage ST258 (Tables S1). Patients harboring different sequence types of 181 CRKP showed no significant differences in infection/colonization status or anatomic site of isolation, and 182 no substantive differences in clinical characteristics (see supplementary material). Thus, we decided to 183 limit our analysis to ST258 to improve our ability to discern whether genetic variation in this dominant 184 strain is associated with infection. 185 The CRKP epidemic lineage ST258 shows evidence of sub-lineage variation in virulence and 186

anatomic site of isolation 187
We next evaluated if there exist sub-lineages of ST258 with altered virulence properties by looking for 188 clustering of isolates by infection on the whole-genome phylogeny (Figure 1; see supplementary 189 methods). 36 Infection status was non-randomly distributed on the phylogeny (p=0·002), supporting our 190 hypothesis that the genetic background of CRKP influences infection. We performed a similar clustering 191 analysis to look at potential niche-specific adaptation to certain anatomic sites (Figure 1), and found that 192 respiratory (p=0·001) and urinary (p=0·013) isolates cluster on the phylogeny, but blood isolates do not 193 (p=0·21). This analysis indicates that, in addition to patient features, intra-strain variation in virulence and 194 adaptation to the urinary and respiratory tract might influence whether patients develop an infection. 195 Both patient and CRKP genetic characteristics are weakly predictive of infection, with relative 196 performance being highly facility-dependent 197 We next performed machine learning to quantify the ability of patient and microbial genetic 198 characteristics to predict CRKP infection ( Figure S1). To prevent over-or under-fitting and control for 199 facility-level biases, we generated 100 train/test data splits, wherein a given LTACH was only included 200 either in the train or test set. Each LTACH occurred a median of 24 times (range 13-32) in the test data 201 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . https://doi.org/10.1101/2020.07.06.20147306 doi: medRxiv preprint split. In this way, we were able to identify patient and CRKP strain characteristics consistently associated 202 with infection or colonization across data splits, and thus across patient populations in different healthcare 203

facilities. 204
First, we sought to understand if patient and genomic features were individually predictive of CRKP 205 infection. To this end, we independently evaluated patient characteristics as well as three different 206 genomic feature sets for their ability to classify colonization and infection (see methods). Across the 100 207 different train/test splits, we observed that the average predictive performance was weak, with each of the 208 genomic and patient feature sets predictive of infection to a similar degree (all 1st quartile AUROCs > 209 0·5; median range=0·55-0·68; Figure 2A; AUPRC: Figure S3A). Across the 100 different data splits, no 210 one feature set was consistently the most predictive (e.g. Figures 2B, 2C; all comparisons p > 0·30, see 211 supplementary methods for p-value calculation). Furthermore, for each feature set AUROCs were 212 distributed such that the test AUROC ranged from below 0.5 to over 0·7, depending on how the data were 213 split (i.e., which facilities appear in the train/test sets). This variation in model performance across 214 different train/test sets suggests that the association of CRKP strain and patient characteristics with 215 infection or colonization varies across facilities. 216

Integration of patient and CRKP strain features does not improve discriminative performance of 217 overall or anatomic site-specific models 218
To determine if the predictive power of patient and genomic features is additive, and if combining these 219 disparate feature sets improved validation on held-out facilities, we built models including both patient 220 and curated genomic features. The discriminative performance of the models based on the combined 221 feature set was not significantly greater than that of the individual feature sets (Figure 2A, p ≥ 0·20).

222
Thus, despite variation in the predictive capacity of genomic and patient features across facilities ( Figure  223 2C), combining the two sets did not improve overall performance. Focusing on anatomic site-specific 224 models revealed similar trends, where classification performances were similar for respiratory and urinary 225 specific models, and the relative predictive capacity of patient and CRKP strain features varied across 226 facility subsets (Figure S4; AUPRC: Figure S3B). 227 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Some patient and genomic features consistently discriminate colonization and infection 228
After evaluating the predictive capacity of models, we next sought to identify patient and CRKP strain 229 characteristics that are most associated with CRKP infection or colonization. To this end, we identified 230 those patient and genomic features that consistently improved model performance across the 100 different 231 data splits (see methods). Evaluating the importance of features in this way provides insight into those 232 characteristics that generalize across different facility subsets. This approach was taken for both overall 233 and anatomic site-specific models to identify features predictive of different anatomic sites of infection 234 (Figure 3, Figures S5-7). Colibactin is a toxin, 3 and yersiniabactin is an iron scavenging system that has been identified in previous 246 animal and human studies as being associated with virulence. 9,10 The O-antigen of lipopolysaccharide 247 (LPS) is a known antigenic marker, although association with a specific anatomic site has not been 248 noted. 40 249

A sub-lineage of ST258 clade II appears to have sequentially evolved enhanced adaptation for the 250 respiratory tract and increased virulence 251
We noted that kfoC disruption is largely confined to a sub-lineage of ST258 (Figures 4, S10, S11). 252 Consistent with this feature being associated with respiratory infection, the disrupted kfoC lineage is 253 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . https://doi.org/10.1101/2020.07.06.20147306 doi: medRxiv preprint enriched in respiratory isolates (82/118, 69% of isolates in the disrupted kfoC lineage are respiratory 254 isolates vs. 101/213, 47% in all other isolates; Fisher's exact p=0·00011), suggesting that this lineage is 255 associated with increased capacity for respiratory colonization. Furthermore, a subset of isolates in the 256 disrupted kfoC sub-lineage contain the ICEKp10 element. Examination of these genetic events in the 257 context of the whole-genome phylogeny revealed that disruption of kfoC occurred first, followed by at 258 least two different acquisitions of ICEKp10 (Figures 4, S10). a realistic assessment of the predictive capacities of patient and CRKP genetic features, we employed a 275 machine learning framework using multiple facility-level train/test splits. Overall, we found that, while 276 neither patient nor CRKP genetic features have high predictive accuracy on held-out test data, both 277 feature sets were independently associated with infection, with one or the other being more predictive on 278 different facility subsets. Moreover, the integration of clinical and genomic data led to the discovery of an 279 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . https://doi.org/10.1101/2020.07.06.20147306 doi: medRxiv preprint emergent sub-lineage of the epidemic ST258 clone that may have increased adaptation for the respiratory 280 tract, and is more strongly associated with infection. 281 One strength of our machine learning approach is that we were able to measure the variation in 282 discriminative performance across 100 train/test iterations that differed in which facilities were included 283 in train and test sets. We found that performance varied greatly depending on how facilities were 284 allocated to train and test sets, highlighting how smaller studies could overestimate or underestimate the 285 discriminative ability of both their model and individual features. One potential explanation for variation 286 in model performance is that there is facility-level heterogeneity depending on their characteristics (e.g. 287 size, geography, etc.), in which case building sub-models for relevant facility subsets may improve 288 performance. Another possible explanation for variation in model performance may be that the critically-289 ill nature of LTACH patients is such that most patients are actually highly susceptible to infection (i.e. 290 many patients colonized with CRKP may ultimately develop an infection). However, it's noteworthy that 291 despite these potential challenges in creating generalizable models, our analysis did yield predictors of 292 infection and colonization consistent across test sets, and thus across LTACHs. 293 We built classifiers including all genomic features as well as a curated subset of features, and found that 294 both are similarly weakly predictive of infection. However, while the uncurated feature set presented 295 challenges with downstream interpretation, our analyses on the curated genomic features 15 facilitated 296 novel insights into potential evolutionary trajectories of anatomic site-specific adaptation and virulence. 297 For example, we observed that disruption of the O-antigen biosynthetic gene, kfoC, is associated with 298 isolation from the respiratory tract. While we cannot determine from our machine learning analysis if 299 disruption of kfoC is directly causal, the biological plausibility of an altered O-antigen structure mediating 300 evasion of innate immunity and/or other beneficial interactions with the host makes this a strong 301 candidate for followup experiments. Supporting this hypothesis, a previous study found that absence of 302 O-antigen is associated with decreased virulence, but not decreased intrapulmonary proliferation, in a 303 murine model. 41 In addition, we noted that a number of antibiotic resistance determinants were associated 304 with colonization. We hypothesize that this observation could be a consequence of longer duration of 305 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . https://doi.org/10.1101/2020.07.06.20147306 doi: medRxiv preprint residence being associated with increased exposure to off-target antibiotics. 42 Finally, we also saw 306 evidence that, after acquiring yersiniabactin and colibactin on the ICEKp10 element, the disrupted kfoC 307 subclade became more strongly associated with infection, supporting the idea that circulating ST258 sub-308 lineages can evolve to become both hypervirulent and multi-drug resistant. 18,43-45 309 Our study has several important limitations. Specifically, CRKP colonization vs. infection for non-310 bloodstream isolates may be difficult to discriminate based on surveillance criteria and the clinical data 311 that were available. However, we based our definitions on established CDC criteria with modifications 312 used previously. 7 Encouragingly, we were still able to identify consistent predictors of infection, even 313 with potential misclassifications. A second limitation is that we were limited in the patient data included 314 in our model. It is likely that important differences in underlying patient conditions were not captured by 315 the coarse clinical variables we included, and we also did not account for differences in genetic variation 316 in the host. 46 Other limitations include that our study was restricted to LTACH patients, and had non-317 random geographic sampling. While LTACHs have unique structural features, based on prior studies, we 318 expect that the types of patient risk factors considered are likely to generalize to other patient populations. 319 Moreover, our restriction to LTACHs in endemic geographic regions has the benefit of focusing on 320 populations at disproportionate risk for CRKP infection. 8 Finally, while the employed machine learning 321 approach allowed for meaningful assessment of discriminative power using a large number of features, by 322 nature it does not yield estimates of attributable risk. However, features identified as consistently 323 associated with colonization or infection on held-out test data can be evaluated by epidemiologists, 324 clinicians, and biologists to identify potential targets for follow-up epidemiologic or laboratory studies. 325 326

Conclusion 327
We employed a machine learning approach to quantify our ability to discriminate between CRKP 328 colonization and infection using patient and microbial genomic features. This approach highlighted the 329 high degree of variation in predictive accuracy across different facility subsets. Furthermore, despite 330 modest predictive power, we identified several genomic features consistently associated with infection, 331 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . https://doi.org/10.1101/2020 indicating that variation in circulating CRKP strains contributes to infection, even in the context of the 332 critically-ill patient populations residing in LTACHs. Future work should aim to corroborate our findings 333 with larger cohorts, and follow up on strong associations to determine whether they are indeed risk factors 334 for infection. This could ultimately help identify patients at high risk for infection and devise targeted 335 strategies for infection prevention. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . https://doi.org/10.1101/2020 material are those of the authors and do not necessarily reflect the views of the National Science 357 Foundation. 358 359

Conflicts of interest 360
JHH was employed at the University of Pennsylvania during the conduct of this study. She is currently an 361 employee of, and holds shares in, the GSK group of companies. 362

7.
Han JH, Goldstein EJC, Wise J, Bilker WB, Tolomeo P, Lautenbach E. Epidemiology of 381 Carbapenem-Resistant Klebsiella pneumoniae in a Network of Long-Term Acute Care Hospitals. 382 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . https://doi.org/10.1101/2020.07.06.20147306 doi: medRxiv preprint regression. All isolates from a given LTACH were included in either the training split or the testing split for each 500 data split. We built models using five different feature sets, keeping the same 100 data splits. AUROCs of different 501 feature sets were not significantly different. In the right two panels, the curated genomic feature set AUROCs are 502 compared to: (B) the uncurated genomic feature set AUROCs, and (C) the patient feature set AUROCs. Each point 503 is the resulting pair of AUROCs for models built with the same data split, but the two respective feature sets. The 504 dotted lines in all 3 panels indicate the AUROC for choosing an outcome randomly (0·5); anything below the line is 505 worse than random, and anything above the line is better than random. The solid diagonal line in the right two 506 panels is the line y=x; points below the line correspond to a higher curated genomic AUROC for that data split, and 507 points above the line correspond to a higher uncurated genomic AUROC (B), or patient AUROC (C), respectively.

508
The colors in panels (B) and (C) correspond to the colors in panel (A); the points in a given colored area indicate 509 that that feature set had the higher AUROC for that data split. C . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2020. . https://doi.org/10.1101/2020.07.06.20147306 doi: medRxiv preprint Figure 3: Features consistently associated with colonization or infection sometimes differ between the overall, 515 respiratory, and urinary models. Feature-specific improvement in model performance, measured as the mean 516 difference between test and permuted AUROC (see methods), of features found to be consistently associated with 517 colonization or infection in at least one of the following analyses: overall, respiratory-specific, urinary-specific. We 518 consider features to be associated with infection/colonization if the AUROC difference was greater than zero in over 519 75% of the 100 data splits. The vertical solid black line indicates a difference of zero (i.e. the feature provides no 520 improvement to model performance). Horizontal dotted lines separate features associated with urinary but not 521 respiratory isolates (top), both urinary and respiratory (or all) isolates (middle), or respiratory but not urinary isolates 522 (bottom). Bla=Beta lactamase, res=confers resistance to that antibiotic class. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2020.