Antimicrobial resistance determinants are associated with Staphylococcus aureus bacteraemia and adaptation to the hospital environment: a bacterial genome-wide association study

Background Staphylococcus aureus is a major bacterial pathogen in humans, and a dominant cause of severe bloodstream infections. Globally, antimicrobial resistance (AMR) in S. aureus remains challenging. While human risk factors for infection have been defined, contradictory evidence exists for the role of bacterial genomic variation in S. aureus disease. Methods To investigate the contribution of bacterial lineage and genomic variation to the development of bloodstream infection, we undertook a genome-wide association study comparing bacteria from 1017 individuals with bacteraemia to 984 adults with asymptomatic S. aureus nasal carriage. Within 984 carriage isolates, we also compared healthcare-associated (HA) carriage with community-associated (CA) carriage. Results All major global lineages were represented in both bacteraemia and carriage, with no evidence for different attack rates. However, kmers tagging trimethoprim resistance-conferring mutation F99Y in dfrB were significantly associated with bacteraemia-vs-carriage (p=10-8.9-10-9.3). Pooling variation within genes, bacteraemia-vs-carriage was associated with the presence of mecA (HMP=10-5.3) as well as the presence of SCCmec (HMP=10-4.4). Among S. aureus carriers, no lineages were associated with HA-vs-CA carriage. However, we found a novel signal of HA-vs-CA carriage in the foldase protein prsA, where kmers representing conserved sequence allele were associated with CA carriage (p=10-7.1-10-19.4), while in gyrA, a ciprofloxacin resistance-conferring mutation, L84S, was associated with HA carriage (p=10-7.2). Conclusions In an extensive study of S. aureus bacteraemia and nasal carriage in the UK, we found strong evidence that all S. aureus lineages are equally capable of causing bloodstream infection, and of being carried in the healthcare environment. Genomic variation in the foldase protein prsA is a novel genomic marker of healthcare origin in S. aureus but was not associated with bacteraemia. AMR determinants were associated with both bacteraemia and hospital-associated carriage, suggesting that AMR increases the propensity not only to survive in hospital environments, but also to cause invasive disease.


51
Background 52 Staphylococcus aureus is a major bacterial pathogen in humans, and a dominant cause of 53 severe bloodstream infections. Globally, antimicrobial resistance (AMR) in S. aureus 54 remains challenging. While human risk factors for infection have been defined, 55 contradictory evidence exists for the role of bacterial genomic variation in S. aureus 56 disease. 57

Methods 58
To investigate the contribution of bacterial lineage and genomic variation to the 59 development of bloodstream infection, we undertook a genome-wide association study 60 comparing bacteria from 1017 individuals with bacteraemia to 984 adults with 61 asymptomatic S. aureus nasal carriage. Within 984 carriage isolates, we also compared 62 healthcare-associated (HA) carriage with community-associated (CA) carriage. 63

Conclusions 76
In an extensive study of S. aureus bacteraemia and nasal carriage in the UK, we found 77 strong evidence that all S. aureus lineages are equally capable of causing bloodstream 78 infection, and of being carried in the healthcare environment. 79 Genomic variation in the foldase protein prsA is a novel genomic marker of healthcare 80 origin in S. aureus but was not associated with bacteraemia. AMR determinants were 81 associated with both bacteraemia and hospital-associated carriage, suggesting that AMR 82 increases the propensity not only to survive in hospital environments, but also to cause 83 invasive disease. 84 and associated with higher mortality. 38  Collaboration (ISAC) which established a multinational prospective cohort study of SAB 176 in 2006. 7 Sequential individuals from these studies over 13 years of age with S. aureus 177 on blood culture were included if there was an isolate available for sequencing, with 178 associated clinical data, and the blood culture had not been deemed to be a contaminant 179 on local clinical review. We identified 1203 cases in patients that were not 180 contaminants, after excluding repeat episodes (775, 232 and 196 at each centre 181 respectively). Bacterial isolates were found from 724, 207 and 163 episodes at each 182 centre respectively, and a minimum clinical data set (see below) was available for 674, 183 187 and 160 cases. We successfully sequenced 1017 of these 1021 cases for inclusion. 184 These included 417 cases from a previously sequenced collection investigating 185 identifying antimicrobial resistance. 47 186 S. aureus isolates from nasal carriage in individuals without S. aureus infection were 187 identified from two studies of S. aureus carriage in Oxfordshire, UK (Figure S2). 188 The first was a study of S. aureus nasal carriage in adults in the community between is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 otherwise it was treated as absent. This produced a set of 23,860,793 variably present 269 kmers, with the presence or absence of each determined per isolate. We identified a 270 median of 2,760,000 kmers per isolate, including variably present kmers and kmers 271 common to all genomes (IQR 2,725,(0)(1)(2)806,000 for locus testing in bacterial GWAS. 63 The association between each SNP and kmer with 285 the phenotype was tested controlling for population structure and genetic background 286 using the linear mixed model (LMM) implemented in GEMMA. 62 We included healthcare 287 or community origin of case/control status as a fixed covariate in the model when 288 testing for associations with the carriage-vs-bacteraemia phenotype. The parameters of 289 the linear mixed model were estimated by maximum likelihood and a likelihood ratio 290 test was performed against the null hypothesis (that each locus has no effect) using the 291 software GEMMA, 62 using a minor allele frequency of 0 to include all SNPs. GEMMA was 292 modified to output the ML log-likelihood under the null and alternative hypothesis and 293 -log10 p values were calculated using R scripts in the bacterialGWAS package. 294 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 Testing for lineage effects. We tested for associations between lineage and phenotype 295 using principal components (PCs) in the R package bugwas (available at 296 https://github.com/sgearle/bugwas), which implements a published method for 297 lineage testing in bacterial GWAS. 63 PCs were computed based on biallelic SNPs using 298 the R function prcomp. To test the null hypothesis of no background effect of each 299 principal component, we used a Wald test 64 against a χ 2 distribution with one degree of 300 freedom to obtain a p value. 301 Kmer mapping and sequence alignment. We used Bowtie 65 to align all 31bp kmers 302 from short-read sequences to the reference genome (MRSA252 55 ). For all 31bp kmers 303 significantly associated with case-controls status, the likely origin of the kmer was 304 additionally determined by nucleotide sequence BLAST 53 of the kmers against a 305 database of all S. aureus sequences in Genbank. We used BLAST 53 to identify the best 306 match for coding sequences of interest in the de novo assembly, and used Jalview 66 to 307 and visualize the assembled sequences. 308

Multiple testing correction. Multiple testing was accounted for by applying a 309
Bonferroni correction; 67 the individual locus effect of a variant (PC, kmer or SNP) was 310 considered significant if its P value was smaller than α/np, where we took α = 0.05 to be 311 the genome-wide false-positive rate and np to be the number of PCs, kmer phylopatterns 312 or SNP phylopatterns. We defined each phylopattern to be a unique partition of 313 individuals by the alleles at that kmer or SNP. 314 The Bonferroni correction represents a conservative approach to controlling for type 1 315 error. The harmonic mean p-value (HMP) has recently been developed as a method to 316 combine alternative hypotheses against the null hypothesis, without sacrificing power, 317 even when the tests are not independent. 68 The HMP was calculated for coding regions 318 using the R package "harmonicmeanp" v3.0 (https://CRAN.R-319 . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 project.org/package/harmonicmeanp). The HMP across a region was then adjusted for 320 the proportion of kmers mapping to that region: 321 HMPadj = HMP/w 322 (where w = proportion of kmers mapping to that region, compared to the total number 323 of kmers mapping to coding regions). The adjusted HMP was compared directly to the 324 significance threshold αL, being a nominal threshold α (0.05) adjusted for the number of 325 unique p-values being tested. 68 326

327
Sequences from 2001 S. aureus isolates (1,017 cases of bacteraemia and 984 328 asymptomatic nasal carriage controls) were analysed (Table 1). Cases were marginally 329 more likely to be healthcare-associated (38% vs33.5%, p=0.04). Consistent with 330 established risk factors for SAB, 2 cases were significantly older (median age 68 years vs 331 59 years, p<10 -5 ) and more likely to be male (68.4% vs 51.9%, p<10 -5 ). Cases had a 332 higher proportion of MRSA than controls (13.6% vs 5.5%, p<10 -5 ), including when 333 comparing HA cases (63/386 (15.3%)) with HA controls (35/330 (10.6%), p=0.04 (c 2 334 test)). Thus, even in individuals exposed to the healthcare environment, MRSA was 335 found more often in bacteraemia than carriage. Reported focus of infection in cases of 336 SAB showed soft tissue and vascular catheter infections to be the most commonly 337 identified foci (Table S1). Mortality by 30 days was 26.5% (Table S2) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint throughout the UK. 34,70,71,72 HA cases and controls were distributed throughout the tree, 347 and did not strongly cluster within the population. 348 Formal testing for lineage effects with bugwas 63 supported the absence of strong lineage 349 effects. The third principal component (PC3), which identified the MRSA clade within 350 the CC-22 lineage, was most strongly associated with bacteraemia-vs-carriage (p=0.02), 351 but this was not statistically significant after adjusting for multiple testing (Figure 1, 352 Figure S3). The overall sample heritability was predicted to be low (2.1%, 95% CI 0.0- To investigate associations between genomic content or sequence variation and the 358 carriage-vs-bacteraemia phenotype we tested SNP associations and also used a kmer 359 approach, 60 identifying 31bp DNA sequences (kmers) in the assembled genome and 360 testing for association between their presence and bacteraemia. This approach allowed 361 us to detect variation in the accessory genome, and variants such as small insertions or 362 deletions that are not well captured by mapping SNPs. 363 Testing all identified SNPs for association with case/control status in 2001 isolates did 364 not identify any statistically significant associations between individual SNPs and 365 bacteraemia at the genome wide level when controlling for population structure (Figure 366 S4). However, the SNP coming closest to a statistically significant association (p=10 -5.6 ) 367 was an A to T mutation at position 1497290 in the MRSA252 reference genome. This 368 SNP encodes a phenylalanine to tyrosine substitution at codon position 99 in 369 dihydrofolate reductase (dfrB); this F99Y mutation confers trimethoprim resistance. 47 It 370 was relatively rare, being found in 41 cases and 5 controls, and correlated with 371 resistance to other antibiotics: 36/46 (78%) of isolates with this variant were also 372 methicillin resistant. This variant was found most commonly in isolates from CC-22 373 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint (63%) and . 374 When testing kmers found in all 2001 isolates, we found 1214 kmers significantly 375 associated with carriage. These kmers mapped to multiple sites across the genome, 376 most of which were repeat regions, including 16S rDNA and transposon insertion 377 sequences ( Figure S5A). However, the presence of these kmers was strongly affected by 378 the Illumina sequencing read length, being found in isolates sequenced using 150bp but 379 not 99bp reads ( Figure S5B). De novo assembly was repeated using a constrained kmer 380 length in assembly (up to 79bp) to control for the variation in read length, and when 381 these new assemblies were used, 76% (924/1214) of the previously identified 31bp 382 kmers were no longer significantly associated with the phenotype ( Figure S5c). We 383 concluded that varying length of sequencing was a major source of confounding in 384 kmer-based associated estimates. 385 To avoid false positive results, we therefore restricted the investigation of kmers 386 associated with carriage-vs-bacteraemia to isolates sequenced with 150bp reads (Table  387 2). This reduced set of cases showed similar epidemiological characteristics to the 388 larger group (Table 1). 389 Kmers tagging antimicrobial resistance (AMR) conferring mutations were significantly 390 associated with bloodstream infection. In total, we identified 22,284,204 kmers, 391 occurring in 930,702 unique patterns across 1610 isolates with 150bp reads. Twenty-392 three kmers, occurring in two phylopatterns, were significantly associated with SAB 393 ( Figure 2). These kmers, mapping to a 52bp region in dfrB, exhibited 11.2-11.6-fold 394 increased odds of being found in a disease-causing, rather than carried, S. aureus. This 395 association remained significant after controlling for population structure (p=10 -8.9 -10 - is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint resistance conferring SNP, these kmers were found in low frequencies (35/626 cases 400 and 5/984 controls). 401 No further individual kmers met the threshold for significance, but there were 402 distinctive peaks enriched for small p-values in the Manhattan plot ( Figure 2). We 403 calculated harmonic mean p-values (HMP) to perform aggregate kmer-based tests of 404 association across coding regions of the genome. The HMP can improve power by 405 combining information across variants and reducing the effective number of tests, 406 thereby attracting a less stringent significance threshold. At the whole-genome level, 407 the pooled evidence for association between coding sequence variation and 408 bacteraemia was considerable (HMP=10 -3.3 ). The evidence for kmers associated with 409 bacteraemia, when pooled, was significant in several loci (Figure 2), including the dfrB 410 locus, SAR1439 (HMP=10 -6.9 , HMPadj=10 -3.3 ). The presence of kmers mapping across the 411 SCCmec region was also significantly associated with bacteraemia (HMP=10 -4.4 , 412 HMPadj=10 -2.2 ). The strongest evidence within SCCmec was for mecA (SAR0039, 413 HMP=10 -5.3 , HMPadj=0.02), which encodes the low affinity PBP2a that confers methicillin 414

resistance. 415
When this region was examined in closer detail, high-risk kmers covered the entirety of 416 the SAR0039 locus, encoding PBP2a, a PBP with low affinity for beta-lactams (Figure 417 2B). There are no alternate low-risk kmers in this region, suggesting the presence of 418 this gene, rather than variation within it, is associated with bacteraemia. Thus, genomic 419 sequences associated with AMR to both trimethoprim and beta-lactams, but not other 420 antimicrobial classes, were significantly associated with bacteraemia. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint the genomic factors associated with healthcare environments by conducting a GWAS for 427 HA-vs-CA among carriers (330 HA, 654 CA). We focused only on carriage isolates 428 because we had more extensive data about hospital admissions in this group, and 429 epidemiological data to further demonstrate that isolates from CA-carriage truly 430 reflected community origin. The MRSA CC-22 lineage showed the strongest association 431 with HA-vs-CA carriage (p=0.06, Figure S7), but this lineage effect was even less 432 significant than for bacteraemia-vs-carriage (p=0.02, Figure S3), suggesting that no 433 lineages were strongly associated with healthcare acquisition of S. aureus carriage. 434 In total 124 SNPs were associated with HA-vs-CA carriage after controlling for 435 population structure, and adjusting for the 77,597 SNP phylopatterns in the population 436 ( Figure S8, Table S3). The most significant were non-coding (intergenic) variants in a 437 region encoding tRNAgly at 2034022-2034039. However, basecalls at these sites were 438 frequently imputed (61.5%). Excluding isolates with imputed calls at these sites 439 reduced the unadjusted OR for finding these SNPs in HA-vs-CA carriage from 2.1 to 0.87, 440 suggesting that the observed association was a product of SNP imputation. The most 441 significant coding variants included a G to A substitution at 2417648 in MRSA242 442 (p=10 -7.2 ), encoding a proline to leucine substitution at 707 in SAR2345, an 443 AcrB/AcrD/AcrF family protein, which is a multidrug efflux system subunit. They also 444 included a C to T substitution at 7255 in MRSA252 (p=10 -7.2 ), which encodes L84S in 445 gyrA and confers quinolone resistance. 47 A variant at these positions was called in all 446 984 isolates, so no calls were imputed. A significant association was also seen with a 447 band of 91 low frequency SNPs, with shared minor alleles co-inherited predominantly 448 in the MRSA sub-clade of CC-22 ( Figure S9). The dfrB mutation seen associated with 449 bacteraemia was not significantly associated with HA-vs-CA carriage (p=0.4). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint ( Figure 3B). A small peak of 11 significant kmers mapped to a hypothetical protein 454 SAR0061, covering to a 41bp region from 67,816 (p=10 -7.3 ). Three significant kmers 455 mapped to a putative transcriptional regulator SAR2394, covering a 33bp region at 456 2,465,937(p=10 -7.2 ). Of the remaining significant kmers, 26/188 (13.8%) did not map to 457 the reference genome, and 25/188 (13.3%) mapped to non-coding regions at 1. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10. 1101/2021 to their frequency in carriage. However, across lineages we found evidence that genetic 481 variants underlying AMR were associated with increased odds of bacteraemia versus 482 carriage. These included some determinants of methicillin resistance, contrasting with 483 previous research indicating that methicillin resistance incurs a fitness cost for 484 S. aureus, reducing its pathogenicity. 82,83 485 The most obvious possible explanation for association between methicillin resistance 486 and bacteraemia would be survival benefit in the presence of beta-lactam antibiotics. 487 However more than half of our MRSA bacteraemia cases were community-associated, 488 being first detected prior to or within 48 hours of hospital admission. While we do not 489 have data about pre-hospital antibiotic treatment, it is unlikely that the majority of 490 patients with CA bacteraemia were on beta-lactam antibiotics at the time that 491 bacteraemia developed. It is also possible that other between-group differences, such 492 co-morbid illnesses, may also account for the different prevalence of MRSA observed 493 between patients with bacteraemia and asymptomatic carriers, and we have not been is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10. 1101/2021 Streptococcus pneumoniae, where a PBP variant associated with penicillin tolerance 508 (though not resistance) was associated with meningitis among isolates found in 509 invasive pneumococcal disease. 75 510 Our finding of trimethoprim resistance associated with bacteraemia is not easily 511 explained through direct antibiotic selection since trimethoprim containing antibiotics 512 Overall, we found MRSA was more prevalent in bacteraemia than carriage in samples, 521 when sampling the population representatively. In the UK at this time MRSA was almost 522 exclusively healthcare associated. Consequently, one potential explanation for this 523 finding is unmeasured hospital exposure among CA bacteraemia cases for which 524 hospital exposure data was only available for the preceding 12 weeks. However even 525 restricting to HA associated cases and controls MRSA was associated with bacteraemia. 526 Furthermore, by controlling for population structure, we can be confident the 527 association demonstrated between mecA and bacteraemia is not simply a reflection of 528 hospital adapted lineages, and that such lineage effects could have been detected using 529 the GWAS methods employed here. In fact, while variation within prsA was strongly 530 associated with HA carriage, variation in this gene was not associated with carriage-vs-531 bacteraemia. Likewise, a quinolone resistance mutation in gyrA was associated with HA 532 carriage but not with bacteraemia. These observations suggest that distinct bacterial 533 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint factors favour hospital adaptation or transmission compared with those favouring 534 bloodstream infection. 535 The surface bound foldase protein PrsA has been implicated as a secondary resistance 536 factor for both beta-lactam and glycopeptide antibiotics. 78,79 In vitro assays have shown 537 that S. aureus can survive despite disruption to prsA, but strains with prsA disruption 538 are more susceptible to oxacillin. 78  Our study demonstrates some constraints and pitfalls for bacterial GWAS. Firstly, 547 sequencing read length was a strong source of confounding in our data which, without 548 adequate control, produced false positive results. This is an ongoing challenge for 549 studies pooling existing sequencing data where it is not possible to randomize cases and 550 controls across sequencing batches. We dealt with this using the conservative option of 551 excluding 391 cases from kmer analysis, representing a substantial sacrifice of power. A 552 further limitation was that population structure appeared to be incompletely controlled 553 for low frequency variants, 84 as exhibited by a set of 91 SNPs in strong linkage 554 disequilibrium (LD) associated with HA-vs-CA carriage. Linear Mixed Models are able to 555 control for lineages, and cryptic population structure through the use of a relatedness 556 matrix, but they make the assumption that closely related isolates are unlikely to have 557 large differences in phenotype. This assumption means that they can incompletely 558 account for lineage effects arising from closely related strains which vary significantly 559 in frequency between phenotypes. 84 In our study, the SNPs in LD were found in the 560 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint predominantly HA-MRSA isolates within CC-22. This may represent either a true 561 association with that lineage, or co-carriage with another genomic element in that 562 lineage. In this case, prsA variation was common in CC-22, but not detected in the SNP 563 based study (as the variation was a deletion rather than a nucleotide substitution), and 564 the signal of association accompanying these SNPs was better explained by prsA 565 variation. Overall, the kmer-based methods were more fruitful in our study, both in 566 their ability to detect non-SNP based variation (such as deletions in prsA, and the 567 presence of mecA), and retaining the ability to identify significant SNPs (including a dfrB 568 variant). 569 In previous studies we have identified within-host adaptation of S. aureus genes 570 associated with development of invasive disease from a colonising isolate -particularly 571 involving the agr locus, and the cell wall proteins under regulatory control of Agr and 572 Rsp. 43 Such variation was not associated with bacteraemia in this population-based 573 study, perhaps because it provides only a short-term advantage to the bacteria, and in 574 the longer term is detrimental, by adversely affecting transmission. In contrast, 575 variation which confers AMR is likely to confer a bacterial survival advantage in 576 carriage in the face of antibiotic selection pressure in the healthcare environment, as 577 well as in the bloodstream, allowing these variants to survive in the population. 578

579
In a study of over 2000 isolates from S. aureus bacteraemia and nasal carriage in the UK, 580 we found strong evidence that all S. aureus lineages are equally capable of causing 581 bloodstream infection, and of being carried in the healthcare environment. 582 We found that genomic variation in the foldase protein prsA was a novel genomic 583 marker of healthcare adaptation in S. aureus. This predictor of healthcare-associated 584 carriage was not associated with bacteraemia, while AMR determinants were associated 585 with both bacteraemia and hospital-associated carriage, raising the suggestion that in 586 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint            is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021.          is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint      is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint        is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021.       is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint        is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021     is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint 1005 Figure S1: Flow diagram of cases screened for inclusion in the study 1006 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint 1007 1008 Figure S2: Flow diagram of controls screened for inclusion in the study 1009 1010 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Significance (negative log10 p-values) of the 10 most significantly associated PCs. A 1019 Bonferroni corrected threshold for significance is 10 -4.6 . 1020 1021 1022 1023 Figure S4: Manhattan plot of SNP associations with carriage-vs-bacteraemia , 1024 controlling for population structure and HA or CA origin. The significance of each 1025 SNP (-log10 P value) is plotted according to location on the 2.9MB MRSA252 reference 1026 genome, with control for population structure and CA-vs-HA origin. SNPs are coloured 1027 according to their predicted effect on protein, and shaped according to the number of 1028 alleles found at that site in the study set. A Bonferroni-corrected significance threshold 1029 based on the number of phylopatterns (n=112, 292) is plotted in red (10 -6.4 ), assuming a 1030 family-wise error rate of 5%. 1031 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; Figure  associated with bacteraemia-vs-carriage were confounded with sequence length. We 1040 plotted significance of association with both bacteraemia-vs-carriage (X axis) and 1041 sequencing read length (Y axis), using a c 2 test, without correction for population 1042 structure. (C) Kmer detection and association testing with GEMMA was repeated, this 1043 time on new assemblies, built using Velvet with a maximum kmer length of 79bp. The 1044 signal of the most significant kmers based on the original assemblies (black) dropped to 1045 below statistical significance when the assemblies were built with a constrained kmer 1046 length (red). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint branches corresponding to the 10 lineages most significantly associated with CA-vs-HA 1069 carriage are coloured on a maximum likelihood phylogeny of the study isolates. Branch 1070 lengths have been square-root transformed to discriminate closely related lineages. 1071 Clonal complexes are denoted in the outermost ring. The second outermost ring 1072 indicates isolate source (blue community, pink hospital) (B) Significance (negative 1073 log10 p-values) of the 10 most significantly associated PCs. A Bonferroni corrected 1074 threshold for significance is 10 -4.3 . 1075 1076 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10.1101/2021.01.13.21249734 doi: medRxiv preprint Figure S10: Presence of significant kmers in the population of CA and HA carriage.

1100
A maximum likelihood tree of carriage isolates is annotated with the predicted origin of 1101 carriage (pink HA, blue CA), the presence of MRSA (orange) and the presence (black) or 1102 absence (green) of each pattern of kmers mapping to prsA, in decreasing order of 1103 statistical significance left to right. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. 1128 Table S2: Mortality at 7, 30 and 90 days, total (percentage). Differences between centres 1129 and 7, 30 and 90 days were not statistically significant (p=0.25, p=0.6 and p=0.9, c 2 1130 test with 2 degrees of freedom) 1131 1132 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 15, 2021. ; https://doi.org/10. 1101/2021