Polymorphisms affecting expression of the vaccine antigen factor H binding protein influence invasiveness of Neisseria meningitidis

Many bacterial diseases are caused by organisms that ordinarily are harmless components of the human microbiome. Effective interventions against these conditions requires an understanding of the processes whereby symbiosis or commensalism breaks down. Here, we performed bacterial genome-wide association studies (GWAS) of Neisseria meningitidis, a common commensal of the human respiratory tract despite being a leading cause of meningitis and sepsis. GWAS discovered single nucleotide polymorphisms (SNPs) and other bacterial genetic variants associated with invasive meningococcal disease (IMD) versus carriage in several loci across the genome, revealing the polygenic nature of this phenotype. Of note, we detected a significant peak around fHbp, which encodes factor H binding protein (fHbp); fHbp promotes bacterial immune evasion of human complement by recruiting complement factor H (CFH) to the meningococcal surface. We confirmed the association around fHbp with IMD in a validation GWAS, and found that SNPs identified in the validation affecting the 5' region of fHbp mRNA alter secondary RNA structures, increase fHbp expression, and enhance bacterial escape from complement-mediated killing. This finding mirrors the known link between complement deficiencies and CFH variation with human susceptibility to IMD, highlighting the central importance of human and bacterial genetic variation across the fHbp:CFH interface in IMD susceptibility, virulence, and the transition from carriage to disease.


156
To explore an alternative mechanism, we examined the effect of the SNPs around the RBS 157 and α-RBS sequence using SHAPE chemistry to probe the RNA secondary structure using a 158 183 nt RNA encompassing the RBS at position -8 to -12 (relative to the translation start site), 159 and fHbpS-7T/C and fHbpS13A/G. The secondary structure model based on SHAPE reactivity 160 data of fHbpS-7C/S13G at 37 o C ( Figure 4A) is consistent with the RBS being base-paired and 161 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249443 doi: medRxiv preprint masked through the formation of a relatively long imperfect helix of 11 base pairs that 162 includes both anti-RBS sequences 1 (αRBS-1) and 2 (αRBS-1) [29]; the polymorphic sites in the 163 carriage associated fHbpS-7C/S13G structure form a G:C base pair at the top of the helix. 164 However, the local RNA structure of the IMD-associated fHbpS-7T/S13A shows significant 165 differences ( Figure 4B-C is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. in fHbp, and adaptation has also not between identified between paired isolates sampled 190 from blood and cerebrospinal fluid [33,34]. We found that the genetic architecture of 191 meningococcal virulence is polygenic, adding to the growing understanding on virulence 192 factors influencing the risk of IMD [17,18,35,36]. Although several loci across the genome 193 were identified, the major signals associated with IMD versus carriage emanated from the 194 fba-fHbp region. This is particularly significant as GWAS of human genetic susceptibility 195 identified SNPs in CFH associated with IMD, although the mechanism remains unknown [4, 196 37]. Our results therefore highlight that the interaction of bacterial and human gene pools 197 across a single molecular interface, involving fHbp and CFH, dictates host susceptibility and 198 the propensity of strains to cause invasive disease. 199 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249443 doi: medRxiv preprint

Sampling frames 201
The discovery sample collection comprised 261 Neisseria meningitidis isolates from the Czech 202 Republic [8,9,10]. Carriage samples were collected from young adults with no association 203 with patients with IMD over four months, while disease isolates were from cases of IMD 204 submitted to the Czech National Reference Laboratory for Meningococcal Infections in 1993 205 [8,10]. 252 isolates from this collection were described previously to identify the MDA island 206 by PCR, but not by WGS. Illumina sequencing reads were downloaded from the European 207 Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena), and Velvet de novo assemblies from 208 PubMLST (https://pubmlst.org/neisseria/) [24]. PubMLST IDs, ENA accession numbers and 209 phenotypes can be found in Supplementary Table 6. with de novo assemblies ³2Mb in length, excluding the 261 genomes from the discovery 214 sample collection and excluding genomes that met any of the following criteria: i) annotated 215 as non-Neisseria meningitidis, ii) annotated with the disease phenotype "other", iii) non-ST-216 41/44 complex assignment (described below), iv) genomes with more than 700 contigs, v) 217 genomes with only one or two contigs, and vi) genomes with a total assembly length greater 218 than 2.5Mb. PubMLST IDs and phenotypes can be found in Supplementary Table 7. 219 220

SNP calling and kmer counting 221
For the discovery sample collection, sequence reads were mapped against the reference ST-222 11 complex, FAM18 genome (GenBank number: NC_008767.1) using Stampy [38]. Bases were 223 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. called using previously described quality filters [39,40,41]. We identified 150,502 biallelic 224 SNPs, 6,063 tri-allelic SNPs, and 239 tetra-allelic SNPs. 225

226
To capture non-SNP-based variation, and SNPs not in the reference genome, we pursued a 227 kmer-based approach where all unique 31 bp haplotypes were counted from Velvet 228 assemblies using dsk [42] in both sample collections. For both sample collections, a unique 229 set of variably present kmers across each data set was created, with the presence or absence 230 of each unique kmer determined per genome. Algorithms, coded in C++, can be downloaded 231 from 232 https://github.com/jessiewu/bacterialGWAS/blob/master/kmerGWAS/gwas_kmer_pattern 233 [16]. We identified 7,806,583 variably present kmers in the discovery sample collection and 234 11,114,868 in the replication study collection. 235 236

Phylogenetic inference 237
A maximum likelihood phylogeny was estimated for the discovery study collection for 238 visualisation of phylogenetic relationships in the sampling frame and for SNP imputation 239 purposes using RaxML [43] with a general time reversible (GTR) model and no rate 240 heterogeneity, using alignments from the mapped data based on biallelic sites, with non-241 biallelic sites being set to the reference allele. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. For the SNP discovery analysis, missing sites due to sequencing ambiguity or strict SNP-calling 251 thresholds were imputed using ancestral state reconstruction [44] implemented in 252 ClonalFrameML [45]. This approach was previously shown to achieve high accuracy [16]. 253 254

Correcting for multiple testing 255
Multiple testing was accounted for by applying Bonferroni correction to control the strong-256 sense family-wise error rate (ssFWER) [46]. The unit of correction for all studies of individual 257 loci in the discovery GWAS was taken to be the number of unique "phylopatterns" i.e. the 258 number of unique partitions of individuals according to allele membership. The locus effect 259 of an individual variant was considered to be significant if its P value was smaller than α/np, 260 where we took α = 0.05 to be the genome-wide false positive rate (i.e. family-wide error rate, 261 FWER) and np to be the number of unique phylopatterns. In the discovery SNP-based analysis, 262 np was taken to be the number of unique SNP phylopatterns (80,099) and in the kmer-based 263 analyses, np was taken to be the number of unique kmer presence/absence phylopatterns 264 (307,830). 265

266
In the replication GWAS, since we tested whether the two genome-wide significant, disease-267 associated kmers in the fba-fHbp operon replicated, we applied a Bonferroni correction to 268 obtain a significance threshold of 0.05/2=10 -1. 60 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. implemented in the software GEMMA [13] to control for population structure. We modified 279 GEMMA version 0.93 to enable significance testing of non-biallelic variants [16]. GEMMA was 280 run using no minor allele frequency cut-off to include all variants. 281 282 For the discovery GWAS, we computed the relatedness matrix from biallelic SNPs only using 283 the option "-gk 1" in GEMMA to calculate the centred relatedness matrix. For the replication 284 study, we computed the relatedness matrix from the kmer presence/absence matrix using 285 Java code which also calculates the centred relatedness matrix. 286 287

Identifying lineage effects 288
We tested for associations between bacterial lineages and the phenotype in the discovery 289 sample collection using the R package bugwas (https://github.com/sgearle/bugwas) [16]. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249443 doi: medRxiv preprint

Estimating sample heritability 296
Heritability of the sample, the proportion of the phenotypic variation that can be explained 297 by the bacterial genotype, was estimated using the LMM null model in GEMMA from post-298 imputation biallelic SNPs [13]. Estimating heritability in case control studies is dependent on 299 the prevalence of cases and the sampling scheme [47]. The proportion of sample heritability 300 explained by the kmers fbaK898 and fbaK899 in the discovery set was estimated by including the 301 phylopattern of the two kmers as a covariate in the LMM null model in GEMMA, and 302 calculating the difference in heritability compared to including no covariates. 303 304

Testing for independent SNP associations 305
To determine whether pairs of significant SNP associations in the discovery sample collection 306 represented independent signals, the two unique significant SNP patterns were tested using 307 LMM including both SNPs as fixed effects, thereby assuming additivity between the two loci. 308 309

Variant annotation 310
SNPs were annotated in R using scripts at http://github.com/jessiewu/bacterialGWAS. The 311 reference FASTA and GenBank files were used in order to determine SNP type (synonymous, 312 non-synonymous, nonsense, read-through and intergenic), codon, codon position, reference 313 and non-reference amino acids, gene name and gene product, on the assumption of a single 314 change to the reference sequence. 315

316
To annotate kmer sequences, we mapped kmers to the reference FAM18 genome using 317 Bowtie2 [48] and the options "-r -D 24 -R 3 -N 0 -L 18 -i S,1,0.30" to identify a single best 318 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249443 doi: medRxiv preprint mapping position for each kmer. For kmers which did not map to the reference genome, 319 BLAST [49] was used to identify the kmer position within FAM18. BLAST results of any 320 sequence length were taken, and the number of mismatches along the whole length of the 321 kmer was recalculated assuming the whole kmer aligned. Kmers with five or fewer 322 mismatches to the reference were shown as aligned to the reference, all other kmers were 323 shown as unaligned to the reference. 324

325
To understand the variation captured by the significant kmers in the gene csb, BLAST [49] was 326 used to extract all copies of the MC58 (Genbank accession number NC_003112.2) allele of 327 csb, the allele that the significant csb kmers mapped to. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. Buffer B after washing with buffer C (50 mM Na-phosphate pH 6.0, 500 mM NaCl, 30 mM 382 Imidazole), and dialysed overnight at 4 o C into Tris pH 8.0. fHbp v1.1 I311A expression and 383 purification was performed as described previously [51]. 384 385

Electrophoresis mobility shift assays 386
Gel retardation assays were carried out as previously using purified FNR D154A , which forms 387 functional FNR dimers under aerobic conditions [52]. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. SHAPE experiments were performed using RNA transcribed in vitro from cDNA sequence [53]. 425 The DNA templates contained a double-stranded T7 RNA polymerase promoter sequence 426 (TTCTAATACGACTCACTATA) followed by the sequence of interest (Supplementary Table 5 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249443 doi: medRxiv preprint gels was then performed to separate fragments. Band-intensities were quantified using SAFA, 438 version 1.1 Semi-Automated Footprinting Analysis [56]. 439 440 All structure calculations were performed using RNAstructure software [57]. ΔG°SHAPE free 441 energy change values were added to the thermodynamic free energy parameters for each 442 nucleotide [58,59] The harmonic mean p-value (HMP) method performs a combined test of the null hypothesis 451 that no p-value is significant [61]. The HMP method controls for the ssFWER like the 452 Bonferroni correction. We applied the HMP procedure to the fba-fHbp region in the discovery 453 and replication studies, including all unique kmer phylopatterns that mapped to either of the 454 two genes plus their upstream intergenic regions. We calculated the asymptotically exact p-455 values using the p.hmp function from the R package 'harmonicmeanp', giving equal weight to 456 all kmer phylopatterns, and the total number of tests performed genome-wide was set to be 457 the number of kmer phylopatterns (np) in order to control the genome-wide ssFWER despite 458 analysing just the fba-fHbp region. We adjusted the p-value by dividing it by the sum of the 459 weights of the kmer phylopatterns included in the fba-fHbp region so that it could be directly 460 compared to alpha, the intended ssFWER, which we set to be 0.05. 461 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249443 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.21249443 doi: medRxiv preprint