Leveraging Pathogen Sequence and Contact Tracing Data to Enhance Vaccine Trials in Emerging Epidemics

Supplemental Digital Content is available in the text. Introduction: Advance planning of vaccine trials conducted during outbreaks increases our ability to rapidly define the efficacy and potential impact of a vaccine. Vaccine efficacy against infectiousness (VEI) is an important measure for understanding a vaccine’s full impact, yet it is currently not identifiable in many trial designs because it requires knowledge of infectors’ vaccination status. Recent advances in genomics have improved our ability to reconstruct transmission networks. We aim to assess if augmenting trials with pathogen sequence and contact tracing data can permit them to estimate VEI. Methods: We develop a transmission model with a vaccine trial in an outbreak setting, incorporate pathogen sequence data and contact tracing data, and assign probabilities to likely infectors. We then propose and evaluate the performance of an estimator of VEI. Results: We find that under perfect knowledge of infector-infectee pairs, we are able to accurately estimate VEI. Use of sequence data results in imperfect reconstruction of transmission networks, biasing estimates of VEI towards the null, with approaches using deep sequence data performing better than approaches using consensus sequence data. Inclusion of contact tracing data reduces the bias. Conclusion: Pathogen genomics enhance identifiability of VEI, but imperfect transmission network reconstruction biases estimate toward the null and limits our ability to detect VEI. Given the consistent direction of the bias, estimates obtained from trials using these methods will provide lower bounds on the true VEI. A combination of sequence and epidemiologic data results in the most accurate estimates, underscoring the importance of contact tracing.

V accine trials conducted during epidemics of emerging infectious diseases provide an important opportunity to test the safety and efficacy of vaccine candidates. Increasing our ability to quickly and accurately understand the impact of a vaccine candidate in the urgent setting of an outbreak is critical for enhancing public health response. The use of the ring vaccination strategy in the Ebola ça Suffit trial during the 2013-2016, West African, Ebola outbreak highlighted the importance of developing innovative designs for trials conducted during an ongoing outbreak. 1 It also underscored the need to think through trial design and analysis strategies in advance to expedite the rollout of a vaccine trial once an outbreak starts and to identify the best methods for obtaining high quality efficacy estimates in outbreak settings. 2 Multiple components of vaccine efficacy can be estimated from a vaccine trial. 3 Individually randomized controlled trials (iRCTs) estimate vaccine efficacy against susceptibility to infection (VE S ), the direct effect of the vaccine on vaccinated individuals. 3 If reducing susceptibility to infection is the only effect of the vaccine, then this measure, combined with information on contact network structure and pathogen transmission dynamics, can be used to estimate the total effect of a vaccination program, a combination of the direct and indirect (i.e., herd immunity) effects. Vaccine efficacy against infectiousness, the reduction in onward transmission from a vaccinated person who is infected compared with an unvaccinated infected person, is another important measure for understanding the impact of a vaccine. 3 Even if a vaccine does not protect everyone who is vaccinated from getting infected, its impact on infectiousness for those who are vaccinated but nevertheless become infected plays a critical role in both outbreak dynamics and also cost-effectiveness of a vaccine program. The significance of understanding interventions' effects on future transmission is exemplified by the efforts of HIV treatment-as-prevention programs to reduce patients' viral loads to undetectable levels to prevent onward transmission. 4,5 To estimate VE I , the vaccination status of infectors must be known. VE I is therefore potentially measurable in household studies 6,7 and partner transmission studies, such as HIV vaccine trials, 8 because in these settings, infector-infectee pairs can be identified (by assuming that household members or partners are the infectors) and thus the vaccination status of infectors is known. However, VE I is not currently identifiable in population-level vaccine trials, such as those often conducted during an infectious disease outbreak, because the transmission network and consequently the vaccination status of infectors, are typically unknown.
Recent advances in pathogen genomics have improved our ability to accurately reconstruct transmission networks. [9][10][11][12][13][14][15] The West African Ebola epidemic and the ongoing COVID-19 pandemic have demonstrated our growing capacity to use sequence data in outbreak settings, [16][17][18][19][20][21] and recent work has highlighted the potential for deep sequence data to add resolution to transmission networks. [22][23][24] We aim to assess if augmenting classical randomized controlled trial designs with pathogen sequence data, as well as contact tracing data, would permit these trials to estimate VE I by reconstructing transmission networks and identifying the trial status of infectors.

Network and Model Structure
We simulate a compartmental network model of an outbreak of a disease with Ebola-like parameters, together with a vaccine trial, the details of which have been previously described. 25 We first generate a network graph with individuals grouped into communities. To create edges between nodes on the graph, we use the network function sample_sbm from the R package igraph 26 to conduct a Bernoulli trial for each pair of nodes, with the probability set higher for two nodes in the same community compared with nodes in different communities ( Table 1). The mean degree (i.e., average number of connections each node has) and β (i.e., the transmission hazard per contact) are calibrated so that the basic reproduction number is 1.5. 27 A specified number of introductions of infection into the network (Table 1) occurs at a time-varying rate, based on a deterministic epidemic curve of an Ebola outbreak. 28 The disease natural history in the communities follows a stochastic susceptible, exposed, infectious and recovered (SEIR) model, with Ebola-like parameters ( Table 1). Each individual has a daily probability of infection from each of their infectious contacts (i.e., connections in the network) of 1−e -β .

Vaccine Trial
Individuals are enrolled into an iRCT, with 50% randomized to vaccine and 50% to control. The vaccine's efficacy against susceptibility to infection is "leaky," with 60% efficacy (VE S = 0.60), meaning upon each exposure, the vaccine reduces a vaccinated individual's chance of infection by 60%. Building upon the original model, we further incorporate vaccine efficacy against infectiousness, which we assume to be 30%, meaning infectiousness among infected vaccinated individuals is 30% lower than among infected unvaccinated individuals (VE I = 0.30). Table 2 shows the number of infections expected for each type of infector-infectee pair from the trial simulations.

Estimating VE I
We first define terms used in estimation of VE I . Let: To estimate ρ , the trial status (i.e., in the vaccine or control group) of infectors must be known, which requires knowledge of who infected whom. If we are able to estimate ρ using the methods described below, we can estimate VE I : To estimate VE I , we first make the unrealistic assumption of complete knowledge of the transmission network, with perfect ascertainment of who infected whom and their infection and recovery times. We then relax the assumption of perfect knowledge of who infected whom. Using the R package seedy, 29 we incorporate pathogen evolution and sampling of both consensus and deep sequence data into the simulations, specifying parameters such as genome length, mutation rate and bottleneck size ( Table 1). The results from the model described above provide the epidemiologic data (e.g., exposure time, infectiousness onset time, recovery time and infector-infectee pairs) that are used as inputs in seedy's stochastic models of mutation that generate these sequence data for each infected individual. For each person, we simulate sampling of both consensus and deep sequence data at a randomly chosen time during the course of their infection. As the choice of parameters, particularly mutation rate and bottleneck size, greatly impacts our ability to reconstruct transmission networks, 22 we vary parameters across simulations to assess their impact on our ability to estimate VE I and to determine the characteristics of pathogens for which these methods will be most applicable (Table 1: values used in sensitivity analyses for model results shown in eAppendix; http://links.lww.com/ EDE/B816).
For each infectee, we then assign a probability to each potential source of infection, based on comparisons of the sequence data for the candidate source(s) and each index case using four different approaches. In the first two approaches, we use consensus sequence data. First, we assign probabilities to potential infectors based on the inverse of the genetic distance between the infectee and potential infectors. Second, we use the geometric-Poisson approximation of SNP distance from seedy to assign probabilities to potential infectors; this approach assumes genetically similar sequences are more likely to be infector-infectee pairs, although also accounting for mutation rate and times of infection. 30 Third, we weight potential infectors by the number of rare variants (i.e., minority variants not seen in the consensus sequence that are rare in the population) they share with each infectee, which may be identified through deep sequence data and has previously been shown to provide additional resolution. 22 Fourth, we combine the second and third approaches, using the consensus sequence data in the event that no shared minority variants for an infectee are identified through deep sequence data. 22 For all four approaches, we then weight the probabilities identified through the sequence data by the probability of infection given the time of symptom onset of the infectee and potential infector(s) based on the serial interval distribution (i.e., the time between when an infector becomes symptomatic and their infectee becomes symptomatic).
Using each of these approaches, we then estimate the ratio of the number of cases infected by a vaccinated person to the number of cases infected by a control. We do this in three ways for each approach (see eAppendix 1; http://links.lww. com/EDE/B816 for more details). First, we weight each identified potential infector by the probability assigned to them and sum the probabilities by vaccination status. Second, we split the probabilities for each infectee into clusters based on the largest gap in probabilities between potential infectors. 31 If the gap is larger than the specified threshold, we use the normalized probabilities from the infector(s) in the top cluster; otherwise, we exclude that infectee from the analysis. Third, we use only the vaccination status of the most likely infector(s) for each infected. Using the estimated ratio of the trial status of the potential infectors and the estimate of VE S from the trial, we then estimate VE I , using the equation above. To incorporate the data from the network obtained through contact tracing efforts during an epidemic, we also conduct all of the approaches described above in a data set restricted to only potential infectors who are contacts of the infectees (i.e., 1.5 Incubation period 28 9.7 days Infectious period 28 5 days VE S 0.6 0.8 VE I 0.3 0.7 Number of communities 2 Size of community 5,000 Probability of connection within community 0.02 Probability of connection between communities 0.001 Importations from main population over trial period 25 20 Trial length (days) 300 Genome length 9 18,958 Mutation rate (per genome per generation) 9,40,41 0.012 0.003 Bottleneck (size of pathogen inoculum at transmission) 16

Ratio of infectors θΦ
A proportion p is randomized to vaccine and to control. In the absence of vaccination, a proportion a would become infected, and a proportion 2q of all exposures to infection of participants would come from other trial participants (with 1-2q external exposures). θ = 1 -VE S , or the risk ratio for becoming infected if one receives vaccine vs. control, and Φ = 1 -VE I , or the relative infectiousness of a vaccinated person who becomes infected to a control who becomes infected.
connections in the network model). In the baseline scenario, we assume sequence and contact data are available for all individuals; we then relax this assumption in sensitivity analyses.

Standard Error
To account for the fact that the data to inform VE I estimates are obtained from clusters of infected individuals and their potential infector(s), we propose the following procedure for estimating the standard errors of the estimates under the approaches that perform best and of the perfect knowledge estimate. For a given simulation and estimate of VE I , we obtain a bootstrap estimate of the standard error as follows. We first sample with replacement from the infected individuals. We then construct a bootstrapped data set using each infected individual from the sample and all of their potential infectors identified by the approach. We estimate VE I from the bootstrapped data set and then repeat these steps 100 times. The standard deviation of the 100 bootstrapped estimates is the standard error of the VE I estimate. This approach could be used with real data observed in a trial and resembles the bootstrapping clusters approach (i.e., clusters are treated as units for resampling) for clustered data. 32 Code is available: https://github.com/rek160/ ADAGIO_WPA.

RESULTS
As expected, under perfect knowledge of the transmission network, VE I is estimated correctly (median of 500 simulations: estimate = 0.29, standard error = 0.19), while imperfect reconstruction of the transmission networks using sequence data results in bias toward the null away from the true VE I of 0.30 (Figure). This imperfect reconstruction is due to the identification of multiple potential infectors for each infectee. For example, another infectee infected by the infector of an index case may share the same number of rare variants as the index case and thus be identified in the top cluster of potential infectors. Of the methods using only sequence data, the shared variant approach using deep sequence data and the hybrid approach return results closest to the true value of VE I , although the approaches using consensus sequence data alone return estimates closer to the null. The approaches using clustering result in more accurate estimates of VE I (Figure) than the methods weighting all possible infectors or methods using only the most likely infector(s) (eFigure 1; http://links.lww.com/EDE/B816).
In reality, sequence data are unlikely to be used in isolation and adding epidemiologic data from the contact network decreases the bias. Using the infector(s) identified from the hybrid and shared variant approaches among potential infectors restricted to contacts results in median estimates of 0.28, close to the true value of 0.30 (Figure; eFigure 1; http://links. lww.com/EDE/B816). As expected, when sequence and contact tracing data are incomplete, the estimates are biased further towards the null (eFigure 2; http://links.lww.com/EDE/ B816); however, general trends of relative accuracy of the different approaches remain.
The ability to accurately reconstruct transmission networks was previously found to be influenced by parameters such as the bottleneck size and the mutation rate. 22 Varying these and other parameters in our simulations show similar results to the baseline scenario (eFigures 3-6; http://links. lww.com/EDE/B816), with the hybrid and shared variant approaches performing worse with a lower mutation rate (eFigure 3; http://links.lww.com/EDE/B816) and a lower bottleneck size (eFigure 4; http://links.lww.com/EDE/B816), as expected because less shared variant information is available in these settings. 22 The magnitude of the bias for each approach may vary with different true values of VE S and VE I . For example, in scenarios with higher VE S (eFigure 6; http:// links.lww.com/EDE/B816), we find that the approaches using only contact data perform better than under the baseline scenario; this may be due to fewer possible infector-infectee pairs given the vaccine's efficacy at preventing infection, decreasing the potential for misclassification.

DISCUSSION
In the case of an outbreak of an emerging infectious disease, the ability to rapidly define the efficacy and potential impact of a vaccine is crucial for improving public health and informing policy decisions. An important component of vaccine efficacy, which is often overlooked, is its ability not only to guard against acquisition of infection by vaccinated individuals but also to prevent onward transmission from those who are vaccinated that nevertheless become infected. VE I is important for fully understanding and modeling the impact of a vaccine and both sequence and contact tracing data have the potential to allow us to estimate VE I in large individually randomized controlled trials conducted during an epidemic. Previously, this estimate was only attainable from household and partner studies. [6][7][8] Advance planning and understanding of the data requirements necessary are critical for obtaining efficacy estimates during the uncertain and urgent setting of an outbreak.
We find that while sequence and contact tracing data have the potential for enabling estimation of VE I , misclassification of the trial status of infectors due to imperfect reconstruction of the transmission network leads to bias toward the null of VE I estimates and overall limits our ability to detect an effect of the vaccine on infectiousness. Given the consistent direction of the bias, if an estimate is obtained in a trial using the methods described here, it is expected to be an underestimate of the true VE I . The approaches using the top cluster of most likely infector(s) identified from the deep sequence shared variants and hybrid data perform the best of all of the methods using sequence data alone and remain the most accurate method when contact tracing data are incorporated. If deep sequencing data are not available, relying on contact tracing data becomes even more important. The substantial improvement in the estimates when restricting to contacts further underscores the importance of contact tracing for reconstructing transmission networks. These results suggest that even as the use of sequencing during outbreaks continues to expand and the technology continues to improve, there remains a critical role for traditional epidemiologic data; the two data sources complement each other and together can provide information above and beyond each on its own.
We have used Ebola as an example pathogen in these simulations; however, these methods can be applied during vaccine trials conducted in outbreaks of other emerging infectious diseases driven by human-to-human transmission. Previous work has pointed to the potential of shared variants identified in deep sequence data to inform transmission. [22][23][24] The intuition of this approach is that the pathogen population within an infected host is not composed of identical genomes but contains some polymorphisms (depending on the population size and the mutation rate). If the transmission bottleneck is sufficiently large, more than one of these genotypes may be transmitted, and the finding that individuals share the resulting polymorphism is then a likely indication of transmission. The methods described here will therefore have variable efficacy for different pathogens. For example, influenza has a high mutation rate, 9 so there is likely sufficient phylogenetic signal and within-host variation to support reconstruction of the transmission network and estimation of VE I . Initial genomics analyses of SARS-CoV-2 found a low mutation rate 33 ; however, the recently identified B.1.1.7 lineage suggests increased rates. 34 Additionally, there is evidence of minority variants detectable by deep sequencing, 35 meaning deep sequencing approaches may have the potential to be used in ongoing vaccine trials to estimate VE I .
Many simplifying assumptions have been made, which could be relaxed in future work. We assume perfect knowledge of infection and recovery times, allowing us to accurately identify the direction of transmission in infectorinfectee pairs; in reality, particularly for pathogens with short incubation periods, the direction of transmission may be less clear. At baseline, we assume complete and correct sampling of sequence data (which in turn means that everyone in the community is a participant in the trial, as we assume, or at a minimum is followed up in the trial), full knowledge of the contact network, and complete contact tracing. Relaxing these assumptions results in estimates further biased towards the null, although the general trends in accuracy of approaches remain the same when data are available for at least 50% of the individuals (eFigure 2; http://links.lww.com/EDE/B816). Below 50%, estimates for some approaches become less reliable as many simulations do not have enough data to produce estimates. Approaches such as those in the TransPhylo R package could be used to assess where cases are likely missing and the overall proportion of the outbreak that has been sampled. 36 A naïve Bayes approach using additional data on individuals in the trial, such as demographic or geographic covariates, has been shown to improve reconstruction of the transmission network when limited sequence and contact tracing data are available. 37 Our methods further absorb the limitations of the seedy package, which assumes neutral evolution and does not permit superinfection, although this latter limitation is likely more of a concern for endemic rather than epidemic disease models.
Estimates from household and partner studies, as well as data obtained in large population-level trials, such as the direct effect and information on viral shedding, can, with assumptions, be used in models to estimate a vaccine candidate's indirect effects. 2,38 The ability to directly estimate vaccine efficacy against infectiousness in large population-level controlled trials would provide important additional data points for better understanding efficacy of vaccine candidates in different settings. Here, despite our simplifying assumptions, this work highlights the potential for existing data sources to be used in the midst of an outbreak to estimate this key measure of vaccine efficacy at a population scale. It further identifies the data sources that will lead to the most accurate estimation and can thus be used for better targeting of the limited resources available for data collection in the midst of an epidemic.