Inferring person-to-person networks of pathogen transmission: is routine surveillance data up to the task?

Inference of person-to-person transmission networks using routinely collected surveillance data is being used increasingly to estimate spatiotemporal patterns of pathogen transmission. Several data types can be used to inform transmission network inferences, yet the sensitivity of those inferences to different data types is not routinely evaluated. We evaluated the influence of different combinations of spatial, temporal, and travel-history data on transmission network inferences for Plasmodium falciparum, the pathogen responsible for most human malaria. After developing a new inference framework and applying it to simulated data, we found that these data types have limited utility for inferring transmission networks and, in some combinations, tend to overestimate transmission. Only when outbreaks were highly focal in time or when travel histories were highly accurate was the inference algorithm able to accurately estimate the reproduction number under control, Rc, a key metric of transmission. Applying this approach to surveillance data from Eswatini indicated that inferences of Rc and spatiotemporal patterns therein are sensitive to the choice of data types and assumptions about the accuracy of travel-history data. Taken together, these results suggest that transmission network inferences made with routinely collected surveillance data should be interpreted with caution. As we have done here, future studies inferring transmission networks should apply their algorithm to data simulated under alternative assumptions to assess the robustness of their inferences.


Introduction 60
Concomitant with improved epidemiological surveillance, there is growing interest to leverage 61 the collected data to infer transmission networks for a wide range of pathogens and to use those 62 inferences to inform public health efforts. Past studies have incorporated temporal data [1] and 63 spatial data [2-5] to estimate pairwise probabilities of transmission between individual cases and 64 to use those estimates to infer time-varying and spatially varying reproduction numbers, 65 respectively. More recently, methods have been developed to incorporate this type of detailed, 66 individual-level epidemiological data [6-8] to infer transmission networks for infectious diseases 67 of humans, including severe acute respiratory syndrome [9] and tuberculosis [10], and of 68 animals, such as rabies [11] and foot-and-mouth disease [12]. 69 In addition to the diseases for which these methods have been applied to date, there is a 70 growing need to apply similar methods to malaria in near-elimination settings. As incidence of 71 malaria declines within a country, transmission becomes more heterogeneous in space and time 72 [13]. Focal areas of high transmission, known as "hotspots," pose a serious risk of fueling 73 resurgence if left untargeted, potentially reversing decades of progress towards elimination [14]. 74 To this end, granular estimates of when and where transmission occurs are needed, as spatially 75 aggregated estimates may obscure important heterogeneities of practical relevance to control 76 efforts [15]. In addition to characterizing details of local transmission, measurement of progress 77 towards malaria elimination hinges on correct classification of cases as imported or locally 78 acquired [16,17], which is a byproduct of estimating transmission networks. 79 cases were imported, the accuracy of identifying the true parent ranged from 84.2% (68.4 -151 94.7%) under the default inference settings to 63.2% (47.4 -78.9%) using temporal data and 152 estimating the accuracy of the travel history ( Fig 1C). In terms of identifying the outbreak to 153 which a case belongs, the algorithm was accurate under all inference settings, since there was 154 only one outbreak (Fig 1C). of locally acquired cases for which the true parent is correctly identified, Outbreak is the 161 proportion of locally acquired cases for which the inferred parent belongs to the correct outbreak, 162 and Rc is the estimated reproduction number under control. Square points signify the median 163 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. We next applied our method to surveillance data collected in Eswatini during 2013-2017. Under 168 the default inference setting, we estimated the diffusion coefficient D, which quantifies the 169 spatial spread of transmission, to be 4.42 km 2 day -1 (2.92 -6.18 km 2 day -1 ) (Fig 2A). This 170 corresponded to a median inferred transmission distance of 13.0 km (0.0160 -65.9 km), a 171 median inferred serial interval of 47 days (-33 -150 days) (Fig 3A & 3B), and median estimates 172 of ts, the probability that an imported case reported travel, of 0.61 (0.44 -0.78) and tl, the 173 probability that a locally acquired case reported travel, of 0.57 (0.53 -0.61) (Fig 2B & 2C). That 174 the 95% credible interval for ts contained 0.50 indicated that our inference algorithm found 175 limited use of travel-history data in discriminating between imported and locally acquired cases, 176 because that implies that imported cases have equal probabilities of reporting or not reporting 177 travel. The algorithm estimated the proportion of imported cases to be 0.052, corresponding to Rc 178 = 0.95. Mapping risk of importation and local transmission across Eswatini under the default 179 inference setting, we estimated consistently low risk of importation throughout the country and 180 transmission hotspots in the northeastern part of Eswatini, close to the border with Mozambique 181 (Fig 4A & 4B). 182 183 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint Histograms represent the marginal posterior distribution of each parameter, color-coded by the 186 inference settings used. D is the diffusion coefficient, ts is the probability that an imported case 187 reports travel, and tl is the probability that a locally acquired case reports travel. Gray shapes 188 represent the prior distributions placed on each parameter. Inference settings in which a given 189 parameter was not estimated are indicated by NA. 190 191 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint setting. Dashed lines indicate the corresponding null distribution, generated from all random 195 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint pairs of cases in the Eswatini surveillance data set. The grey shape is the serial interval 196 distribution used in the likelihood. 197 198 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. settings. When we believed the travel history, we estimated a larger median transmission 207 distance ( Fig 3D). We attribute this increase in the spatial scale of transmission to clusters of 208 cases with positive travel histories located near metropolitan areas. By forcing those cases to be 209 imported, the algorithm tended to infer transmission across longer distances to explain the 210 origins of the remainder of cases that did not report travel and were thereby inferred to be locally 211 acquired. With respect to time, all five inference settings produced consistent serial interval 212 estimates, though the inclusion of spatial data allowed for a wider range of transmission linkages 213 in time (Fig 3A, 3C, & 3E). Finally, in the absence of spatial data, the model estimated higher 214 predictive power of travel histories in identifying imported cases (ts: 0.83, [0.60, 0.95]), though 215 the travel history was consistently found to be uninformative for identifying locally acquired 216 cases (tl: 0.57, [0.53, 0.60]) (Fig 2K & 2L). 217 Classification of cases as imported or locally acquired, key information for control 218 programs, was sensitive to the choice of inference setting. The proportion of cases classified as 219 imported was most sensitive to different assumptions about the accuracy of the travel histories 220 (Fig 4, left column). Believing the travel history yielded high estimates of importation in western 221 Eswatini (Fig 4C & 4I), whereas estimating or ignoring the travel history yielded low, relatively 222 homogeneous estimates of importation risk (Fig 4A, 4E, & 4G). For instance, using temporal 223 data and estimating the accuracy of the travel history produced probabilities of importation that 224 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint ranged 0.0043 -0.0050, suggesting that nearly all cases resulted from local transmission (Fig  225   4G). Estimates of the spatial distribution of Rc depended most on the choice of which data types 226 we included (Fig 4, right column). Notably, inclusion of spatial and temporal data produced a 227 consistent spatial distribution of relative transmission risk, with transmission hotspots in 228 northeastern Eswatini (Fig 4B, 4D, & 4F). However, believing the travel history reduced the 229 magnitude of transmission that we inferred from a median Rc of 0.95 ( Fig 4B) under default 230 settings to 0.41 ( Fig 4D). Omitting spatial data changed the spatial distribution of transmission. 231 Estimating the accuracy of the travel history yielded high transmission estimates (median Rc: 232 1.00) in eastern Eswatini (Fig 4H), whereas believing the travel history inferred hotspots of 233 transmission (median Rc: 0.42) in southern Eswatini ( Fig 4J).  (Table 1). We found that the model was able to estimate ts and tl 241 reasonably well, depending on the inference setting (Fig 5). One exception was that the 242 algorithm slightly overestimated ts under the default inference setting. We attribute this to the 243 low proportion (0.052) of imported infections in the simulated data set and the strong prior 244 placed on this parameter. This tendency was not observed under the inference setting where 245 spatial data was excluded, because the true value of ts closely matched the mean of the prior 246 distribution in that case (see S2 File for further discussion). 247 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint

253
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint With the exception of believing the travel history, the model consistently overestimated 261 the diffusion coefficient (Fig 5A, 5D, & 5G). We attribute the challenge of correctly estimating 262 the diffusion coefficient to an inability to correctly estimate the underlying transmission network, 263 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint the extent of local transmission in the network, and a numerical insensitivity in the overall 264 likelihood to changes in D. When we conditioned the likelihood of D on the true transmission 265 network when Rc was high, the true values of D fell close to the range of maximum-likelihood 266 estimates, suggesting that this parameter could be estimated correctly if the true network was 267 identified (S1 Fig). The likelihood around the true value was very flat, however, making it easy 268 for D to be estimated incorrectly. When Rc was low, we underestimated the diffusion coefficient, 269 because the likelihood of imported cases increases as D decreases. 270 The overall accuracy of classifying cases as imported or locally acquired was close to one 271 ( Fig 6). Though seemingly promising, these high accuracies masked a tendency to overclassify 272 cases as locally acquired, because many more cases were simulated to be locally acquired than CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint for which the true parent is correctly identified. Outbreak, represented by triangles, is the 291 proportion of locally acquired cases for which the inferred parent belongs to the correct outbreak. 292 Bars denote the 95% credible intervals, and the grey line is the true Rc value of the network. 293 294

Simulation Sweep 295
Validation of our inference algorithm revealed that its performance varied across simulated data 296 sets. When applied to a series of simple test cases in which the transmission networks were small 297 and in an optimal spatiotemporal arrangement, the inference method was able to reconstruct the 298 transmission network and correctly estimate Rc (Fig 1). When applied to larger transmission 299 networks in which outbreaks overlapped in space and time, performance of the inference method 300 was poor (Fig 6). This indicated that the performance of our inference algorithm depends on the 301 epidemiological setting to which it is applied. To address this observation, we generated 2,000 302 simulated data sets in which we varied the proportion of imported cases, the spatiotemporal 303 window over which imported cases were distributed, the diffusion coefficient, and the accuracies 304 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint of the travel history (i.e., ts and tl) (S2 Table). We then applied our inference algorithm under 305 three different inference settings and quantified the accuracy of reconstructing each transmission 306 network. The three inference settings used: (1) spatial and temporal data while estimating the 307 accuracy of the travel history (default setting); (2) spatial and temporal data while believing the 308 travel history; and (3) spatial and temporal data alone (S1 Table). 309 We observed that the accuracy of reconstructing transmission networks depended upon 310 both the inference setting used and the epidemiological features of the simulated data. When we 311 used spatial and temporal data and estimated the accuracy of the travel history or excluded it, the 312 accuracy of reconstructing transmission networks depended on the relative proportion and 313 temporal distribution of imported cases (S8 and S9 Figs). As the temporal window over which 314 imported cases are distributed increased, the accuracy of identifying the true parent and the true 315 outbreak of each locally acquired case increased. With an increasing temporal window, 316 outbreaks within the transmission network became relatively more focal in time, which made the 317 likelihoods of alternative transmission linkages more readily distinguishable. More accurate 318 estimates of Rc under these inference settings similarly depended on the temporal window over 319 which imported cases were distributed (Fig 7A, 7C). When the mean temporal interval between 320 imported infections was greater than two times the mean length of the serial interval (i.e., 321 approximately 100 days), our estimates of Rc improved, though we generally overestimated it. 322 Furthermore, as the proportion of imported cases increased and Rc decreased, the accuracy of 323 identifying the correct outbreak of each locally acquired case decreased (S8 and S9 Figs). This 324 pattern reflected the relationship between Rc and the size of individual outbreaks. As Rc 325 decreased, the size of individual outbreaks decreased, and, consequently, the probability that the 326 inferred parent of a locally acquired case belonged to the same outbreak decreased. 327 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint setting, our estimate of Rc depended only on the proportion of cases that reported travel. When 345 the proportion of cases that reported travel matched the proportion of cases that were imported, 346 we correctly estimated Rc (Fig 7B). Under most simulated scenarios and assumptions about the accuracy of travel-history 360 data, we overestimated the number of locally acquired cases, leading to overestimates of Rc. 361 Crucially, our simulation sweep demonstrated that routinely collected surveillance data was most 362 informative of individual-level transmission networks and Rc when local outbreaks were highly 363 focal in time. Otherwise, while we were able to reconstruct the true transmission network with 364 modest accuracy, we tended to misclassify truly imported cases as locally acquired, thereby 365 overestimating Rc. Taken together, these results suggest limited use of routinely collected 366 surveillance data for informing fine-scale estimates of P. falciparum transmission. At broader 367 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. Although we were able to reach some general conclusions about our inference algorithm, 371 our inferences were highly sensitive to which data types we included and which assumptions we 372 made about the accuracy of travel-history data. Applying our algorithm to surveillance data from 373 Eswatini, we observed that inferred patterns of transmission depended on which data types we 374 included. With the inclusion of spatial data, we captured a spatial pattern of transmission 375 consistent with another analysis from Eswatini [28] with data from a different time period. 376 Assumptions about the travel history appeared to have a strong influence on the overall 377 magnitude of transmission that we inferred, due to the direct relationship between Rc and the 378 proportion of imported cases [16]. As a result, believing the travel history, and thereby treating it 379 as perfectly accurate as in previous approaches [6,[18][19][20], could bias Rc estimates if there are 380 errors in travel-history data. A study comparing community travel surveys to mobile-phone data 381 in Kenya found that travel histories considerably underestimated the volume of travel, suggesting 382 high rates of false negatives in community travel surveys [29]. Believing the travel history may, 383 then, underestimate the number of imported cases and overestimate Rc. 384 The method that we used only considered a single spatial model to infer transmission 385 linkages and assumed complete observation of cases, both of which are factors that could have 386 affected our inferences based on the Eswatini surveillance data. The diffusion model that we 387 used to represent spatial dispersion of parasites assumed that movement is isotropic in space and 388 did not consider landscape features, such as heterogeneity in human population densities and 389 environmental factors that may affect mosquito ecology. A study analyzing self-reported 390 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint movement patterns in Mali, Burkina Faso, Zambia, and Tanzania found that gravity and radiation 391 models of spatial dispersion fit the data well, though the appropriateness of each model depended 392 on the type of traveler, the travel distance, and the population size of the destination considered 393 [30]. Regarding the representation of P. falciparum infections in our data set from Eswatini, 394 there are asymptomatic and mild infections that are unlikely to have been recorded in the 395 surveillance system yet may comprise a substantial proportion of malaria infections within 396 Eswatini [13]. Accordingly, it is possible that our assumption of complete observation of cases 397 could have biased Rc estimates, likely downward due to the fact that missing cases will tend to 398 make offspring numbers appear smaller than they actually are [31,32]. Even so, we expect that 399 our conclusions about the sensitivity of transmission network inferences to the choice of data 400 types and assumptions about travel-history data are robust to these limitations of our study. This 401 reinforces our conclusion of the need for caution in attempting to reconstruct person-to-person 402 transmission networks from routine surveillance data. 403 Given that some of the limitations of our approach may be inherent to the information 404 content of these data types in this system, one potential avenue for improving inferences of fine-405 scale patterns of P. falciparum transmission could involve the integration of additional data 406 streams. For example, mobile-phone data [33] and high-resolution friction surfaces [34] could 407 more realistically characterize mobility patterns, whereas travel-history information that details 408 the dates, duration, and location of each trip that has been used in programmatic contexts [26] 409 could more accurately identify importation events. Additionally, the inclusion of pathogen 410 genetic data, which has the potential to provide a more direct signal of parasite movement, could 411 complement traditional epidemiological data [35]. There is also scope for further methodological 412 development, such as relaxing our assumption of complete observation of infections and 413 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint inferences are operationalized. Although this study was specific to P. falciparum, the results of 437 our analyses indicate that future studies inferring transmission networks of P. falciparum, or any 438 pathogen, should carefully consider the epidemiological setting and the choice of data types and 439 assumptions that inform the model and should validate them using simulated data. 440 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Bayesian framework for estimating transmission linkages 442
Our goal was to obtain probabilistic estimates of a transmission network N that defines  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Fig 8. Schematic of a hypothetical transmission network. A hypothetical transmission 454
network is presented along with the corresponding notation. In the schematic, white circles 455 denote unobserved cases, and black circle denote observed cases. Arrows represent transmission 456 between two cases. 457

458
To estimate N, we used spatial, temporal, and travel-history data about all cases, denoted 459 as ⃑ # , $ , and % , respectively. We did so within a Bayesian statistical framework, meaning that denominator is the probability of the data, which is an intractable quantity to calculate directly 468 given that it would require evaluation of an extremely high-dimensional integral over N and Q. 469 To address this, we used a Markov chain Monte Carlo algorithm to draw random samples of N 470 and Q from the posterior distribution specified in eq. (1). 471 The most critical piece of our inference framework is the likelihood, which we define as a 472 function of each case j as 473 474 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. dimensional Wiener diffusion process determines the location of secondary cases relative to the 516 location of their associated primary case. It follows that, for a given diffusion coefficient D with 517 units km 2 day -1 and generation interval GIi,j, the two-dimensional location ⃑ #,7 of the secondary 518 case j is described by a bivariate normal distribution with probability density 519 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . where N ( :,7 -= :,7 . This formulation assumes that each spatial dimension is independent, 523 that the variance scales linearly with the generation interval, and that movement is isotropic 524 across a continuous landscape. 525 One complication to eq. (5) is that the generation interval GIi,j is unobserved and, 526 therefore, cannot take on a fixed value. Instead, we must use data about the serial interval SIi,j to 527 inform our generative model for ⃑ #,7 . To do so, we take advantage of the property of normal 528 random variables that the sum of two or more random variables is itself a normal random 529 as the probability density of the spatial data that we assume. 540 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint In the event that case i has missing spatial data, we cannot compute the spatial likelihood 541 of eq. (7). To address this, we define a latent unobserved quantity a #,: , which represents the 542 unknown location of case i. We then integrate over the uncertainty in a #,: , 543 544 ( ⃑ #,7 , , :,7 , :,7 , Θ-= \ ( ⃑ #,7 , a #,: , , :,7 , Θ-( a #,: , ⃑ #,7 , -a #,: , (8) 545 546 to compute the probability density of case j with known spatial location ⃑ #,7 arising from case i 547 with unknown spatial location a #,: . Equation (8) is computed as the product of the probability 548 density of the location of a known case j conditional on an unknown location a #,: and the 549 probability density of spatial separation ⃑ #,7 − a #,: conditional on the diffusion coefficient D for 550 all a #,: . Because we assume that movement is isotropic, eq. (8) is a two-dimensional Gaussian 551 integral, simplifying to 552 553 ( ⃑ #,7 , , :,7 , :,7 , Θ-= 1 4 N ( :,7 ) . (9) 554 555 In the event that case j has missing spatial data and case i has known spatial data, the latent 556 unobserved quantity becomes a #,7 . We then integrate over the uncertainty in a #,7 and calculate 557 ( ⃑ #,7 , , :,7 , :,7 , Θusing eq. (8-9). 558 559 Probability of the travel-history data. Although we assume in this scenario that a person's 560 infection was locally acquired, our model must still be capable of explaining the travel-history 561 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint data Xh,j. We define a probability tl that case j reported travel (i.e., Xh,j = 1) even though they 562 were not infected during that period of travel, such that 563 564 Pr( %,7 , :,7 , Θ-= d f , %,7 = 1 1 − f , %,7 = 0 . (10) 565 566 In the event that case j has missing travel-history data, we cannot compute the travel-567 history likelihood of eq. (10). To address this, we defined a latent unobserved quantity a %,7 , 568 which represents the unknown travel history of case j. We then sum across the uncertainty in 569 a %,7 , 570 571 Pr( %,7 = NA, :,7 , Θ-= Pr( a %,7 = 1f + (1 − Pr( a %,7 = 1--(1 − f ), (11) 572 573 to compute the probability that case j was locally acquired given an unknown travel history. In 574 eq. (11), Pr( a %,7 = 1was computed as the proportion of cases with a positive travel history 575 among all cases with known travel-history data. 576 Taken together with the probabilities of the temporal and spatial data described above, 577 the product of these three probabilities constitutes the entirety of the contribution of a case j 578 infected by a known local case i to the overall likelihood of N and Q. 579 580 Scenario 2: Importation of local case j from source population s 581 In the event of i T ,7 , we represent the contribution of such a case to the overall likelihood of N 582 and Q as the product of the probabilities of its temporal, spatial, and travel-history data under 583 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint similar assumptions as in Scenario 1. The key difference in this scenario is that there is no 584 information about the unknown source case that gave rise to case j. 585 586 Probability of the temporal data. Because the person containing parasites that are the direct 587 ancestors of those in case j is unobserved and does not have an Xt,i, we are unable to compute the 588 probability of the temporal data as described in Scenario 1. It is important though to obtain a 589 probability comparable to that from Scenario 1 as a reference point for determining whether it is 590 more likely that a given case arose from some other known local case or from an unknown case 591 us from source population s. To do so, we consider the variable a $,i T , which is a latent variable 592 describing the timing of when us would have been detected, had it been detected. 593 Because us is not observed, we considered it to be asymptomatic and untreated. We then 594 calculated the probability of the timing of a known case j arising from an unknown case us as 595 596 Pr( $,7 , i T ,7 , Θ-= \ Pr ( $,7 , a $,i T , i T ,7 , Θ-Pr(SI = $,7 − a $,i T -a $,i T , (12) 597 598 by integrating over uncertainty in a $,i T . We represented this as the product of the probability of 599 the timing of a known case j conditional on an unknown time of detection a $,i T and the 600 probability of the serial interval $,7 − a $,i T for all a $,i T . In equation (12), we did not distinguish 601 between symptomatic and asymptomatic cases j because the calculation is identical; only the 602 serial interval distributions differ. 603 604 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint Probability of the spatial data. Without an a $,i T for the unobserved case us, we lacked 605 information on the serial interval between it and case j. Consequently, we were unable to use the 606 probability from eq. (7) in that particular form. Instead, we computed the spatial variance as a 607 We applied this spatial variance to the unobserved latent variable a #,i T , which represents 614 the unknown location of the unobserved case us. We integrated over uncertainty in a #,i T to 615 compute the probability density, 616 617 ( #,7 , , i T ,7 , Θ-= \ ( #,7 , a #,i T , , i T ,7 , Θ-( a #,i T , #,7 , -a #,i T , (14) 618 619 of the location of a known case j arising from an unknown source case us with unknown location 620 a #,i T . This is represented as the product of the probability density of the location of a known case 621 j conditional on an unknown location a #,i T and the probability density of spatial separation 622 #,7 − a #,i T conditional on the diffusion coefficient D for all a #,i T . As in eq. (9), we treated eq. 623 (14) as an evaluation of the Gaussian integral, evaluating to 624 625 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . where l > 0 is a temperature increment parameter that governs the degree to which each chain is 663 "heated." As a result of setting b1 = 1, € ( , Θ) is directly proportional to the joint posterior 664 distribution and is referred to as the master or "cold" chain. This algorithm effectively flattens 665 the likelihood in the heated chains by setting bc > 1, allowing them to explore the parameter 666 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint space more freely and to encounter alternative high-density regions more readily than the cold 667 chain would alone. At a pre-defined frequency, two randomly selected chains i and j were 668 allowed to swap parameter sets according to a swap probability 669 670 † ‡wu = minˆ1, proposed value that fell outside the range [0,1] and assigned tuvwxy = 0. 687 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint Changes proposed to the network topology involved the addition or removal of an 688 ancestor from a randomly selected node. We assigned a uniform probability of proposing case a 689 as an ancestor to a randomly selected case i, such that proposals to the network topology are 690 uninformed by spatial and temporal data. Furthermore, we defined the proposal probability of 691 removing case a as an ancestor to a randomly selected case i as Prior assumptions. We placed strong priors on ts and tl, because we assumed that travel 699 histories were mostly, but not completely, accurate. We used a beta-distributed prior on ts, with 700 parameters • T = 12 and • T = 3, which resulted in a mean of 0.8 and a variance of 0.01 for this 701 prior distribution. We also used a beta distributed prior on tl, with parameters • ' = 3 and • ' = 702 12, which resulted in a mean of 0.2 and a variance of 0.01. We assumed a uniform prior on D 703 over the interval [10 P' , ∞) and an even prior across all possible network configurations, 704 meaning that those prior probabilities canceled out in eqs. (17) and (20). assessed convergence by calculating correlation coefficients of case-level probabilities across 709 five chains from independent realizations of the MC 3 algorithm, for a total of 10 pairwise 710 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint comparisons across the five chains. The two case-level probabilities that we considered were the 711 posterior probability that each case was infected by an unknown case us from a source population 712 and the posterior probability that each case j was infected by each other case i. Higher values of 713 these correlation coefficients provided stronger support for convergence. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint Reconstruction of Disease Outbreaks by Combining Epidemiologic and Genomic Data. 754 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180844 doi: medRxiv preprint