Back-to-Africa introductions of Mycobacterium tuberculosis as the main cause of tuberculosis in Dar es Salaam, Tanzania

In settings with high tuberculosis (TB) endemicity, distinct genotypes of the Mycobacterium tuberculosis complex (MTBC) often differ in prevalence. However, the factors leading to these differences remain poorly understood. Here we studied the MTBC population in Dar es Salaam, Tanzania over a six-year period, using 1,082 unique patient-derived MTBC whole-genome sequences (WGS) and associated clinical data. We show that the TB epidemic in Dar es Salaam is dominated by multiple MTBC genotypes introduced to Tanzania from different parts of the world during the last 300 years. The most common MTBC genotypes deriving from these introductions exhibited differences in transmission rates and in the duration of the infectious period, but little differences in overall fitness, as measured by the effective reproductive number. Moreover, measures of disease severity and bacterial load indicated no differences in virulence between these genotypes during active TB. Instead, the combination of an early introduction and a high transmission rate accounted for the high prevalence of L3.1.1, the most dominant MTBC genotype in this setting. Yet, a longer co-existence with the host population did not always result in a higher transmission rate, suggesting that distinct life-history traits have evolved in the different MTBC genotypes. Taken together, our results point to bacterial factors as important determinants of the TB epidemic in Dar es Salaam.

. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint 5 85 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 8 Within the MTBC, nine human-adapted phylogenetic lineages have been described to 134 date; lineage 1 (L1) to L9. Even though the members of the MTBC are highly clonal, 135 and individual strains share more than 99% DNA sequence similarity (3), clinical strains 136 differ in their phenotypes (4). For example, MTBC strains have been reported to exhibit 137 variable growth rates in macrophages, differences in the host immune responses 138 elicited, differences in gene expression, as well as differences in transmissibility (4-7).

139
The MTBC as a whole is hypothesized to have originated in East Africa, which is 140 supported by the MTBC genetic diversity being greatest in that part of the world (8). It 141 has been further hypothesized that at some point during its evolution, the MTBC spread 142 out of Africa and diversified in different regions around the world (9, 10). Throughout 143 the last 600 years, lineages that evolved outside of Africa were brought back to Africa 144 following waves of exploration, trade and conquest (11)(12)(13)(14). Despite centuries of trade 145 and migration, many MTBC genotypes remain highly restricted to specific geographical 146 regions where, in some cases, they have also been associated with particular human 147 ethnicities. For example, L1 occurs mainly along the rim of the Indian Ocean, L5 is 148 restricted to West Africa and has been associated with the Ewe ethnicity in Ghana (15), 149 and the Beijing sublineage of L2 has been linked to the Hui ethnicity (15,16). By 150 contrast, L4 occurs worldwide, although many L4 sublineages are restricted to certain 151 geographical regions like L4.6.1, which is strongly linked to Uganda and some 152 neighboring countries, or L4.5 that mainly occurs in Asian countries (11). Importantly, 153 frequencies of lineages and sublineages can differ markedly between neighboring 154 countries (17), and even within a single country (18). Such patterns of phylogegraphical 155 associations are compatible with the notion that MTBC genotypes might be locally 156 adapted to specific human populations. This notion is supported by the observation 157 that these patterns remain stable in cosmopolitan settings (19)(20)(21). However, 158 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint 9 alternative explanations for the phylogeographical associations of particular MTBC 159 genotypes can be invoked, such as founder effects, or the notion that some genotypes 160 might behave like ecological specialists (11). Based on the current knowledge, is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ;https://doi.org/10.1101https://doi.org/10. /2022 10 183 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022

184
The TB epidemic in Dar es Salaam -patient Figure S1). Patients without a bacterial WGS available (n=652, 38%), had a 205 significantly lower chest X-ray score than patients with a bacterial WGS available (χ 2 206 test, p = 0.001), suggesting that viable bacteria are more likely to appear in sputum 207 from patients with increased lung damage. There were no other differences in the 208 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 sociodemographic and clinical characteristics between patients with and without 209 bacterial WGS data (Supplementary Table 1).

210
The phylogenetic analysis of the 1,082 MTBC genomes revealed that four of the nine 211 known human-adapted MTBC lineages circulate in Dar es Salaam (Figure 1) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 14 between African and European strains is not possible to infer as the latter have 259 disappeared with the decline of TB in Europe(13). The exception was L4.6 whose 260 ancestors seemed to have originated in Central Africa. These findings are in 261 agreement with previous studies quantifying MTBC dispersal from and towards Africa 262 (12)(13)(14)23). In summary, even though East Africa is the most probable origin of the 263 MTBC as a whole (10), the strains sampled in our cohort were most likely introduced 264 into Tanzania from different parts of the world.

265
We next determined the MTBC introductions into Tanzania that more successfully 266 spread within Dar es Salaam, as well as the timing of these introductions. We then 267 used the latter as an approximation for the time that these different MTBC populations 268 evolved with this host population. We reasoned that the most successful introductions 269 were those that left more descendants, and which therefore were more prevalent in 270 our patient population. We identified introductions into Tanzania that led to at least 12 271 sampled cases within our patient cohort ( Figure 1). Based on dated trees generated 272 for each lineage separately, we dated each introduction according to estimated 273 lineage-specific substitution rates from our data and from other publications (see 274 methods for further details, Supplementary Table 8).

275
In total, we identified ten independent introductions represented by at least 12 276 monophyletic strains leading to TB cases in our cohort. These strains have thus is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 15 contributed 8.3% of all current cases (  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 In summary, strains belonging to the four MTBC lineages L1, L2, L3, and L4 were 309 introduced into Tanzania on multiple occasions between an estimated 20 and 312 310 years ago from diverse regions of the world. Following their introduction, these strains 311 have diversified in Tanzania, and some introductions became the source of many TB 312 cases in our cohort while others were not as successful.

313
Early and recently introduced strains do not differ in virulence 314 We hypothesized that strains that have been circulating for a longer period could be 315 better adapted to the host population residing in Dar es Salaam compared to strains 316 introduced more recently. For each of the Dar es Salaam genomes, we identified the 317 most basal node that had only Tanzania as the inferred ancestral geographical range 318 and estimated its age. Due to the high uncertainties in estimating substitution rates 319 (31), we used the relative ages of introduction instead of the absolute ages. A strain 320 from Dar es Salaam was defined as "early-introduced" if the most basal node having 321 Tanzania as the inferred ancestral range had a relative age of greater than 0.2 relative 322 to the age of the most recent common ancestor (MRCA) of the respective lineage the 323 strain belonged to. Thus, at least 20% of a genotype's evolutionary history had to have 324 happened in Tanzania for the descendants of a particular introduction to be considered 325 "early-introduced". Conversely, all the descendants of introductions dated to have 326 occurred at less than 0.2 of the total age of the tree, were considered "recently-327 introduced".

328
We found that the TB epidemic in Dar es Salaam was driven to almost equal parts by 329 early-introduced (52%) and recently-introduced strains (48%). However, there were 330 marked differences between lineages: while for L1 and L3 most strains were classified 331 as early-introduced (83.5% and 78.4%, respectively), most strains in L4 (92.4%) and 332 all in L2 were classified as recently-introduced. We hypothesized that early-introduced 333 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 strains could be locally adapted to the patient population in Dar es Salaam, which might 334 reflect in differences in virulence between early-introduced and recently-introduced 335 strains. We defined virulence as the degree of harm caused to the patient, and used 336 as proxies for virulence the following three measures of disease severity: TB score, 337 chest X-ray score, and bacterial load. We found that whether a strain was early-338 introduced or recently-introduced did not influence the disease severity in the infected 339 patients based on these three proxies (Supplementary Table 6). Applying different 340 introduction age thresholds did not reveal any differences either.

341
In summary, we found that the TB epidemic in Dar es Salaam is driven both by early- is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 was mainly driven by the most common genotype descending from "Introduction 10" 359 within L3.1.1 (Figure 3), which was introduced earlier than others (i.e. estimated 360 relative age of 0.33 or 312 years ago, Figure 2C  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 lineages (31), the same SNP threshold might reflect different properties in different 384 genetic backgrounds. To account for this, we also defined clusters based solely on 385 time to the most recent ancestor, using an increasing age threshold from five to 20 386 years, for each of the four most successful introductions. All genomes that had a 387 common ancestor dated at a certain number of years ago, based on the estimated 388 substitution rate for each lineage, were considered to belong to a transmission cluster.

389
Using the identified clusters, we calculated the secondary case rate ratios comparing

398
To allow for potential confounding factors, we carried out a multivariable regression 399 analysis with each of the three proxies for recent transmission as the outcome variable 400 independently (15 years, 5 SNPs, terminal branch length). We found L2. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint 20 representatives descending from introductions 10, 5, 9, and 1, respectively (Figure 5 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 with L3 being the most abundant. We found a high genetic diversity for L1, L3, and L4, 434 and by comparing this diversity to a global collection of MTBC genomes, we found that is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 22 have been a virgin soil for TB before the introduction of L3.1.1, which is in concordance 459 with old medical reports from colonial times stating that TB was rare before European 460 contact (56). On the other hand, however, the currently available evidence points to 461 East Africa as the most likely origin of the MTBC as a whole (3,  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 We searched for determinants of the evolutionary success of the different MTBC 484 genotypes sampled in our patient cohort. We defined as more successful those 485 genotypes represented by a higher number of direct descendants resulting from 10 486 main independent introductions into Tanzania. A possible scenario would be that 487 genotypes that have been introduced earlier would have attained a higher prevalence. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint estimated period of infectiousness could reflect differences in latency periods of these 510 different MTBC genotypes but could also be affected by differences in sampling 511 proportions linked to potential differences in disease progression. One study in Gambia 512 found that individuals infected with MTBC L6 (also known as Mycobacterium is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 the high intermingling between different districts within Dar es Salaam, host 535 heterogeneity seems an unlikely explanation for our observations but remains to be  between 1986-1991 to 13% between 2006-2008 (48). This observation is 552 consistent with the increased transmission rates of L3.1.1 reported here. Generally, L3 553 has been associated with low transmission (7, 34) but in East African countries, also 554 otherL3 subgroups than L3.1.1, attain relatively high prevalence contradicting that 555 notion (71, 72).

556
Our study was limited in that the patient recruitment was hospital-based, which could 557 have influenced our sampling. Typically, patients seek care once they feel ill, and it is 558 therefore possible that at that stage of active disease, differences in virulence traits are 559 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 small. Performing passive hospital-based sampling could also miss subclinical cases, 560 which might still contribute to transmission and thus lead to an underestimate of the 561 prevalence of MTBC genotypes that cause less severe disease. Our observation that 562 patients without MTBC WGS available had a significantly lower chest X-ray score than 563 patients with a bacterial WGS available, possibly reflect such a sampling bias.

564
However, the fact that we found no association between disease severity and MTBC 565 genotype argues against a systematic recruitment bias related to genotype-specific 566 differences in disease severity.. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 29 -RTL 0.99), whereby 10 genomes were kept for each country included (-mc 10) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 30 minimum read depth at a position of 7x and without strand bias. We excluded positions 658 in repetitive regions such as PE, PPE, and PGRS genes or phages, as described 659 previously (11). The resulting VCF file was then used to create a whole-genome is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint rate, we selected for each lineage all the samples with known date of isolation from the 682 reference set as well as the Dar es Salaam samples. To test for temporal signal, we 683 performed a date randomization test by running LSD v0.3beta (89) 100 times with 684 randomly shuffled dates of isolation as done previously (14). All lineages except for L1 685 passed the date randomization test. We then estimated the substitution rate using LSD 686 for L2, L3, and L4. The substitution rate obtained was used to date the complete 687 dataset including the samples with unknown date of isolation for each lineage. Since 688 L1 did not pass the date randomization test, we took the LSD-based estimate from is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 32 country instead of subcontinental regions in order to explicitly look at Tanzania.

707
According to the output of PastML, we identified the introductions of a lineage into 708 Tanzania, extracted the ages of these introductions, and identified the Dar es Salaam 709 genomes resulting from each introduction. Introductions into Tanzania were 710 considered as more successful when they led to at least 12 TB cases in our cohort.

711
For each introduction, we extracted the time since introduction, both as absolute age 712 as well as relative age compared to the age of the MRCA of that lineage. Genomes of 713 strains resulting from an introduction with a relative age of more than 0. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ;https://doi.org/10.1101https://doi.org/10. /2022 method was used. The thresholds taken as cutoff for patient-to-patient transmission 732 were five, eight, twelve, and fifteen SNPs. For the clustering based on the age 733 threshold, all nodes were extracted from the dated phylogenetic tree where the node 734 was equal or below the threshold and the parent node older than the threshold. Then, 735 all the tip descendants of a node were considered to be in a cluster. The thresholds 736 applied were five, ten, fifteen, and twenty years.

737
Secondary case rate ratios were calculated as described in (19) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ;https://doi.org/10.1101https://doi.org/10. /2022 process, with 'birth' events corresponding to transmission events from one host to 757 another (occurring at a rate λ), while 'death' events occur when a host becomes 758 uninfectious due to recovery or death (occurring at a rate δ). The effective reproductive is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ;https://doi.org/10.1101https://doi.org/10. /2022 city level of sampling ( Figure S9; is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022 38 Supplementary Figure  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022.09.29.22280296 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2022. ; https://doi.org/10. 1101/2022