A multisite genomic epidemiology study of Clostridioides difficile infections in the U.S. supports differential roles of healthcare versus community spread for two common strains

Clostridioides difficile is the leading cause of healthcare-associated infectious diarrhea. However, it is increasingly appreciated that healthcare-associated infections derive from both community and healthcare transmission, and that the primary sites of C. difficile transmission may be strain dependent. We conducted a multisite genomic epidemiology study to assess differential genomic evidence of healthcare vs. community spread for two of the most common C. difficile strains in the U.S.: sequence type (ST) 1 (associated with Ribotype 027) and ST2 (associated with Ribotype 014/020). Isolates recovered from stool specimens collected during standard clinical care at three geographically distinct U.S. medical centers between 2010 and 2018 underwent whole genome sequencing and phylogenetic analyses. ST1 and ST2 isolates both displayed some evidence of phylogenetic clustering by study site, but clustering was stronger and more apparent in ST1, consistent with our healthcare-based study more comprehensively sampling local transmission of ST1 compared to ST2 strains. Analyses of pairwise single nucleotide variant (SNV) distance distributions were also consistent with more evidence of healthcare transmission of ST1 compared to ST2, with 44% of ST1 isolates being within 2 SNVs of another isolate from the same geographic collection site compared to 5.5% of ST2 isolates (p-value = <0.001). Conversely, ST2 isolates were more likely to have close genetic neighbors across disparate geographic sites compared to ST1 isolates, further supporting non-healthcare routes of spread for ST2 and highlighting the potential for misattributing genomic similarity among ST2 isolates to recent healthcare transmission. Finally, we estimated a lower evolutionary rate for the ST2 lineage compared to the ST1 lineage using Bayesian timed phylogenomic analyses, and hypothesize that this may contribute to observed differences in geographic concordance among closely related isolates. Together, these findings suggest that ST1 and ST2, while both common causes of C. difficile infection in hospitals, show differential reliance on community and hospital spread. This conclusion supports the need for strain-specific criteria for interpreting genomic linkages and emphasizes the importance of considering differences in the epidemiology of circulating strains when devising interventions to reduce the burden of  infections.

geographic concordance among closely related isolates. Together, these findings suggest that 51 ST1 and ST2, while both common causes of C. difficile infection in hospitals, show differential 52 reliance on community and hospital spread. This conclusion supports the need for strain-53 specific criteria for interpreting genomic linkages and emphasizes the importance of 54 considering differences in the epidemiology of circulating strains when devising interventions to 55 reduce the burden of C. difficile infections. 56 57 58 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

INTRODUCTION 65
Clostridioides difficile is a gram-positive spore-forming anaerobic bacterium that is a dominant 66 cause of infectious diarrhea, colitis, and colitis-associated death in the United States [1,2]. 67 While C. difficile infection (CDI) is classically considered nosocomial [3], recent molecular 68 epidemiologic research suggests that less than 40% of CDI cases are linkable to other 69 symptomatic CDI cases within the same hospital [4][5][6]. This insight has disrupted the paradigm 70 of C. difficile as an exclusively nosocomial pathogen and expanded interest into the roles of 71 alternative routes of C. difficile transmission, including community-based acquisition with 72 difficile isolates were recovered from the speciemens, and DNA was extracted from a single 109 colony as previous described [14][15][16]. Isolates underwent molecular typing via ribotyping at UM 110 and TMC [17], and MLST at MSKCC [18]. DNA from a sample of isolates that were typed as 111 RT027 or RT014/020 at UM and TMC and ST1 or ST2 at MSKCC was extracted for whole genome 112 sequencing. The Institutional Review Boards at each of the study sites approved the study 113 protocols. sequences to identify single nucleotide variants (SNVs) and build phylogenetic trees were 123 executed as previously described [19]. Briefly, raw sequencing reads were trimmed using 124 Trimmomatic to remove low quality bases and adapter sequences. Trimmed reads were then 125 mapped to existing complete reference genomes within the same ST (R20291 for ST1 [GenBank 126 accession number FN545816], and W0022a for ST2 [GenBank accession number CP025046]) 127 with the Burrows-Wheeler short read aligner [20][21][22]. PCR duplicates were discarded and 128 variants were called using SAMtools mpileup and bcftools [23]. Gubbins was used to remove 129 variant sites located in putative recombinant regions [24]. In silico multilocus sequence typing 130 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

Evaluation of phylogenetic clustering 140
To compare the level of clustering by geographic collection site between newly sequenced ST1 141 and ST2 isolates, we overlaid geographic collection site onto the maximum-likelihood whole 142 genome phylogenies and applied a previously described approach for formal clustering 143 assessment [30]. First, we tabulated the number of isolates in a "pure" subtree of each 144 phylogeny-defined as a subtree made up of 2 or more isolates collected from a single 145 geographic site that was found in >90% of bootstrapped phylogenies. To determine whether 146 this number was different than would be observed by chance given the phylogenetic topology 147 and location frequency, we calculated an empirical p-value by randomizing geographic labels 148 and re-calculating this number 1000 times. 149 150 151 152 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint

Evaluation of evidence of recent transmission 153
Evidence of recent transmission was assessed using pairwise SNV distance matrices and two 154 analytic approaches. First, we compared the lower tail (5 th percentile) of the distribution of 155 pairwise SNV distances of pairs of isolates collected from the same collection site to that same 156 metric among pairs of isolates collected from different collection sites by calculating a 5 th 157 percentile SNV-distance ratio (5 th percentile SNV distance within sites/5 th percentile SNV 158 distance between sites). To assess whether this ratio indicated an enrichment of close linkages 159 within collection sites greater than could be expected by chance, we randomly permuted 160 collection sites and re-calculated the ratio 10,000 times; an observed ratio below the 2.5% 161 percentile of the distribution of expected ratios was applied to support significant enrichment 162 of close genetic linkages within study sites. Second, we classified genomic linkages using an 163 SNV-distance threshold of 2 SNVs and compared the proportion of genomically linked isolates 164 (defined as being linked to at least one other isolate) among ST1 isolates compared to those 165 among ST2 isolates using chi-squared tests. An SNV threshold of 2 SNVs is commonly used to 166 identify pairs of C. difficile isolates that were likely related via direct transmission/acquisition 167 from a common source; this threshold is based off of evolutionary rates estimated from within-168 host evolution [4]. We then assessed the sensitivity of these results to larger thresholds of 5-10 169 SNVs. We also compared the proportion of isolates genomically linked to at least one isolate 170 collected from a different geographic collection site between ST1 and ST2 using chi-squared 171 tests. All analyses were completed in R v4.0.2. 172 173 174 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint We assessed the suitability of the data for timed phylogenomic analyses by examining temporal 188 signal-or the relationship between genomic differences and sampling date-using two 189 methods. First, we examined a regression of sampling time vs. root-to-tip genetic distance using 190 Tempest and BactDating [33,34]. We then formally evaluated temporal signal using date 191 randomization tests, randomly permuting the sampling dates 10 times and comparing the 192 evolutionary rate estimates and their 95% credible intervals for the random datasets to the 193 estimates from the real data. We report both the more relaxed and more strict criteria for 194 temporal signal assessment using this approach: with the more relaxed criteria being met if the 195 estimated evolutionary rate was not included in the 95% credible intervals of 10 date 196 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint randomized datasets (CR1), and the more strict being met if the 95% credible interval of the 197 estimated evolutionary rate did not overlap any of the 95% credible intervals of the date 198 randomized datasets (CR2) [35]. We proceeded with evolutionary rate estimates so long as the 199 data met CR1. warranted. To assess whether the data violated a strict clock assumption, we evaluated 208 whether the coefficient of variation parameter in the models with an uncorrelated relaxed 209 lognormal clock had a 95% highest posterior density interval (HPD) that overlapped 0; if not, we 210 used this as evidence of the assumptions of a strict clock being violated and applied an 211 uncorrelated relaxed lognormal clock with a lognormal prior distribution with a mean of 5.0x10 -212 7 and standard deviation of 8x10 -7 based on previous evolutionary rate estimates (while 213 allowing still allowing for significant deviation) [29,36]. To assess to what extent the data 214 violated a constant demographic model, we ran models with exponential growth demographic 215 model prior, and evaluated whether the 95% credible interval of the exponential growth rate 216 parameter overlapped 0. If the exponential growth rate parameter was substantially different 217 from 0, we attempted running a more flexible but parameter rich Gaussian Markov Random 218 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint Field (GMRF) skyride model, which allows for periods of growth as well as periods of stasis [37]. 219 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint

ST1 exhibits stronger evidence of phylogenetic clustering by geography compared to ST2 242
To begin our comparison of ST1 and ST2 isolates, we first examined the association between 243 phylogenetic and geographic structure by overlaying the geographic site each isolate was 244 collected from onto strain-specific whole genome phylogenies. Visual examination of these 245 phylogenies revealed a striking difference in geographic clustering, with ST1 displaying larger 246 clusters and ST2 displaying more numerous, smaller clusters and more geographic mixing 247 ( Figure 1). The exception to this observation was the FQS ST1 clade, which appeared more 248 geographically mixed than the FQR ST1 clades. While statistical assessments demonstrated that 249 both ST1 and ST2 displayed more evidence of geographic clustering than would be expected to 250 occur by chance (empiric p-values both <0.001), clustering was more non-random for ST1 than 251 ST2 (Supplementary Figure 2). This enhanced geographic clustering among ST1 isolates could 252 reflect that our healthcare-based study more completely sampled local transmission networks 253 among ST1 isolates compared to ST2 isolates, or it could reflect ST1 spreading via more 254 localized community or healthcare reservoirs with minimal long-distance transmission. 255 256 ST1 isolates display more evidence of recent transmission than ST2, while ST2 isolates are 257

more likely to share intermediate genetic linkages across disparate geographic sites 258
To further investigate whether plausible healthcare-associated transmission among ST1 isolates 259 was driving the geographic clustering patterns we saw in the phylogenies, we next examined 260 the prevalence and nature of close genetic linkages within each lineage as captured by pairwise 261 SNV distances. Isolates linked by very small SNV distances are plausibly linked via recent 262 transmission, and we would expect our healthcare-based study to more comprehensively 263 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint sample healthcare-associated transmission than community-associated transmission. When 264 examining the SNV distance distributions between and within collection sites, among ST1 265 isolates, we observed more closely related pairs of isolates from the same geographic collection 266 site (reflected by a heavier lower tail of the distribution) compared to pairs of isolates collected 267 from different geographic collection sites (5 th percentile SNV distance within sites/5 th percentile 268 SNV distance between sites = 0.59, expected ratio 95% interval 0.93-1.00, Figure 2). However, 269 we did not observe this same pattern among ST2 isolates (5 th percentile SNV distance within 270 sites/5 th percentile SNV distance between sites = 1.00, expected ratio 95% interval 0.93-1.00, 271 Figure 2). Application of SNV distance thresholds demonstrated that 88 (44%) ST1 isolates were 272 within 2 SNVs of another isolate from the same geographic collection site compared to 10 273 (5.5%) ST2 isolates (p-value = <0.001). As the SNV threshold was increased to intermediate 274 values of 5 and 10 SNVs, this trend was maintained (all p < 0.001, Figure 3A). Conversely, at the 275 5 and 10 SNV thresholds, linked ST2 isolates were more likely to be linked to an isolate from a 276 different geographic collection site compared to linked ST1 isolates (all p < 0.001, Figure 3A). is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Timed phylogenomic analyses demonstrate evidence of evolutionary rate heterogeneity 302 within and between ST1 and ST2 lineages 303
Our observation that ST2 is more likely to be genomically linked at intermediate SNV thresholds 304 across disparate geographic sites compared to ST1 isolates led us to explore the potential 305 mechanisms underlying this difference. Two factors we hypothesized might contribute to these 306 findings are 1) increased transmission of ST2 via community-based reservoirs that facilitate 307 more rapid spread over large geographic distances and/or 2) a slower average evolutionary rate 308 among ST2 isolates resulting in less genetic changes over larger amounts of time and space. 309 While examining the former hypothesis was beyond the scope of this study, we explored the 310 plausibility of the latter hypothesis by estimating evolutionary rates for ST1 and ST2 using the 311 BEAST Bayesian phylogenetic software [31]. There were 418 ST1 and 418 ST2 isolates included 312 in this analysis; sequences included a mix of newly sequenced and publicly available global 313 genomes in order to maximize temporal and genetic diversity while maintaining a sample size 314 manageable by the BEAST software (Supplementary Table 1). For ST1 isolate selection, we also 315 opted to maintain all FQS ST1 isolates, given our observations that they may display distinct 316 epidemiological patterns from FQR ST1 isolates. 317 318 Temporal signal analyses, while initiated as a necessary precursor to timed phylogenomic 319 analyses in BEAST, revealed interesting differences between the clock-like nature of ST1 and 320 ST2 isolates. While root-to-tip regression analyses suggested similarly weak but sufficient 321 temporal signal to proceed with timed phylogenomic analyses in BEAST (indicated by positive 322 correlation coefficients, Supplementary Figure 3), the more rigorous hypothesis testing date 323 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint randomization tests demonstrated more evidence of temporal signal among ST1 isolates, which 324 passed both the more relaxed CR1 and more stringent CR2 criteria for temporal analyses, 325 compared to ST2 isolates, which passed CR1 but not CR2 (Supplementary Figure 4). The root-326 to-tip regression also highlighted different temporal patterns among FQS-ST1 isolates 327 compared to FQR-ST1 isolates, which was observed again in date randomization tests on FQS-328 ST1 and FQR-ST1 isolates separately; the FQR-ST1 isolates appeared to drive the temporal 329 signal in the data, and when considered alone, FQS-isolates were more like ST2 isolates, passing 330 the more relaxed CR1 temporal signal criteria but not the more stringent CR2. This observation 331 was consistent with our pairwise SNV distance findings of distinct patterns among FQS ST1 332 isolates, and motivated conducting further analyses both with all ST1 isolates together as well 333 as with FQR ST1 isolates (n = 359) and FQS ST1 isolates (n = 59) considered separately. Supplementary Results for details). Overall, when considering all ST1 isolates together 339 compared to all ST2 isolates, evolutionary rate estimates were slightly higher for ST1 compared 340 to ST2, although the 95% credible intervals overlapped. However, ST1's faster evolutionary rate 341 was driven by FQR ST1 isolates; when separating out FQS and FQR ST1 isolates, the FQR ST1 342 evolutionary rate estimates emerged as significantly higher than that of ST2 isolates (with non-343 overlapping 95% credible intervals) while FQS ST1 isolates had similar evolutionary rate 344 estimates to ST2 isolates (Figure 4). These evolutionary rates translate to approximately 1.36 345 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint

DISCUSSION 361
In this study, we investigated the genomic epidemiology of two dominant C. difficile lineages, 362 ST1 and ST2, across three geographically distinct U.S. medical centers. We observed more 363 genomic evidence of geographic clustering and recent transmission among ST1 isolates 364 compared to ST2 isolates, while also finding more linkages among ST2 isolates from disparate 365 geographic collection sites at intermediate genomic linkage thresholds. Lastly, we estimated a 366 slightly more rapid average evolutionary rate for FQR ST1 isolates compared to FQS ST1 isolates 367 and ST2 isolates using Bayesian timed phylogenomic methods. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint phylogenomic analyses were consistent with this framework in two ways: 1) high evolutionary 440 rate heterogeneity in both ST1 and ST2 isolates may reflect the effects of spore formation, with 441 isolates emerging for a long-dormant spore being found on the tips of phylogenetic branches 442 with a slow estimated evolutionary rate and 2) less evidence of temporal signal and slightly 443 lower estimated evolutionary rates for FQS-ST1 isolates and ST2 isolates compared to FQR-ST1 444 isolates may reflect more time spent in spore-form. Whatever the biological and 445 epidemiological underpinnings of the patterns we observed, this work highlights the challenges 446 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint inherent to applying molecular clock-based methods to studying the epidemiology and 447 evolution of a variably and relatively slowly evolving pathogen like C. difficile. 448 449 Our findings should be interpreted in the context of multiple limitations. First, the retrospective 450 nature of the study resulted in some differences in sample collection between the three study 451 sites: UM and TMC selected based off of PCR Ribotypes, which we then filtered down to only 452 ST1 and ST2 via in silico MLST, while MSKCC originally selected isolates based off of ST as MLST 453 is routine at that center. However, all comparisons were made between ST1 and ST2 isolates 454 and these differences were consistent within the ST1 and ST2 isolates at each site, so we would 455 not expect them to significantly alter the results reported here. Second, limited epidemiologic 456 metadata was available for analysis: only study site and collection date. Despite this, the 457 interesting patterns we observed between genomic linkages and epidemiologic linkages 458 emphasizes the value of integrating genomic data with even limited epidemiologic metadata. 459 Finally, the evolutionary rate estimates presented here are subject to uncertainty, particularly 460 given the observed instances of violated model assumptions and relatively limited temporal 461 signal in the data. However, the overall trends remained stable with varying models, alleviating 462 concerns that our findings are artifacts of model misspecification. This study also has several 463 notable strengths, including the collection of isolates from three distinct geographic sites in the 464 U.S., the application of whole genome sequencing for high-resolution typing and phylogenetic 465 analyses, and the incorporation of global isolates for increased context and power in our timed 466 phylogenomic analyses. 467 468 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint

Conclusions 469
Examination of the genomic epidemiology of C. difficile ST1 and ST2 across three geographically 470 distinct U.S. medical centers revealed divergent epidemiologic and evolutionary patterns 471 between these two common strains. Specifically, we observed more evidence of geographic 472 clustering, recent healthcare transmission, and a slightly more rapid average evolutionary rate 473 among FQR ST1 isolates compared to ST2 and FQS ST1 isolates. One implication of these 474 findings is that an understanding of local molecular epidemiology may facilitate the 475 development of effective interventions targeted at reducing the burden of CDI. These findings 476 also highlight how methodological considerations-including incorporating genomic context 477 when inferring transmission from genomic linkages and considering the potential effect of 478 spore formation on the connection between genomic differences and epidemiology-need to 479 be accounted for when applying genomic epidemiology methods for studying C. difficile 480 transmission. 481 482 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint

CONFLICTS OF INTEREST 490
The authors declare that there are no conflicts of interest. 491 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 30, 2020. ; https://doi.org/10.1101/2020.11.28.20240192 doi: medRxiv preprint