Quantifying heterogeneity in SARS-CoV-2 transmission during the lockdown in India

The novel SARS-CoV-2 virus shows marked heterogeneity in its transmission. Here, we used data collected from contact tracing during the lockdown in Punjab, a major state in India, to quantify this heterogeneity, and to examine implications for transmission dynamics. We found evidence of heterogeneity acting at multiple levels: in the number of potentially infectious contacts per index case, and in the per-contact risk of infection. Incorporating these findings in simple mathematical models of disease transmission reveals that these heterogeneities act in combination to strongly influence transmission dynamics. Standard approaches, such as representing heterogeneity through secondary case distributions, could be biased by neglecting these underlying interactions between heterogeneities. We discuss implications for policy, and for more efficient contact tracing in resource-constrained settings such as India. Our results highlight how contact tracing, an important public health measure, can also provide important insights into epidemic spread and control.

: The data from Punjab. (A) Timeseries of reported cases in Punjab during the period of lockdown in the state (red bars) and those due to the Nanded event (black bars), and total cases from early March to the middle of June. (B) Visualisation of case clusters in the dataset, and their linkages from self-reported contacts. This network-type graph requires assumptions (see Materials and Methods). Most individuals infected only few others, while a few infected many: overall, 10% of cases accounted for 80% of infection events.

4
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Joint degree/PCI distribution with ρ = 0 Table 1: List of the different models used, for capturing heterogeneity in the population. 'Secondary case distributions' (models 1 -2) are as in Fig. 2A. They ignore any interactions between degree and PCI, and instead aim only to capture variation in the numbers of secondary cases per index case. By contrast, 'Joint distributions' aim to model the associations shown in Fig.  2D. They employ the bivariate normal distribution described in the Materials and Methods, with correlation ρ.
the secondary case distribution. Performing these analyses on the full data for seeds (including 1 returnees as well as the 'core' group) shows qualitatively similar results (see Fig. S4). We next 2 examined the implications of these associations, for transmission dynamics. address (ii), we additionally modelled the log-transformed degree and logit-transformed PCI 10 as following a bivariate normal distribution, with correlation ρ (see Table 1, and Materials and 11 Methods). Consistent with Fig.2D, we assumed a range of values for ρ, from -0.4 to 0.
12 For the transmission model we implemented a simple network simulation, in an assumed 13 6 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. of secondary cases amongst 'seeds' (i.e. first cases in each cluster shown in Fig. 1B). Also shown, for comparison, are the best-fitting Poisson distribution (with λ = 1.4)), and the best-fitting negative binomial distribution (with distribution parameters r = 0.067, k = 0.1). The difference between the latter two curves illustrates the strong extra-Poisson variation in the secondary case distribution. (B) Scatter plot of secondary cases vs degree, at the individual level. The secondary case and degree distributions are shown at the logarithmic scale, and adjusted by 1 to account for zeros, to address skewness of the distributions. Although both secondary case and degree distributions show a strong right-skew (panel A), this figure illustrates that the latter does not explain the former: despite a positive relationship between the two distributions, a substantial number of individuals with low degree generate some infections, while many with high degree generate zero onward infections. (C) Estimated marginal density of per-contactinfectiousness (PCI) that, alongside degree, is needed to explain the heterogeneity in secondary cases. Shaded intervals show 95% Bayesian credibility intervals. (D) Estimated PCI vs degree. The figure displays relationship between the logarithm of the odds (logit) of PCI and the logarithm of the degree. These transformations allow us to plausibly model the joint distribution of PCI and degree as a multivariate normal in section 4 (see Materials and Methods and Supporting Information). There is a discernible lower band due to a large number of cases with zero onward infections, which have very low estimated PCI. Among those with onward infections, there is a discernible negative association. 7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. . population of 3,000 individuals, consistent with the population size in this study. For simplicity 1 and generality, we simulated the epidemic in generations of infection: our simulated outbreak 2 behaviour would thus apply to any emerging infection sharing these heterogeneities (see Mate-3 rials and Methods). to accommodate a wide range for R 0 (Fig. 3A), even amongst models that capture a high 7 proportion non-infectors (Fig. 3B).

11
Even amongst the remaining models, however, there is a notable disparity in epidemiological 12 outcomes: amongst models capturing the joint distribution between degree and PCI (Models 3 13 -5), it is not possible to identify a value of ρ that matches most closely to the negative binomial  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. . https://doi.org/10.1101/2020.09.10.20190017 doi: medRxiv preprint  Table 1. Of all models, only the negative binomial secondary case distribution, and the joint degree/PCI models capture the high proportion of index cases who do not cause secondary cases (panel A). However, even amongst these models, there can be substantial variation in R 0 (panel B), owing to the role of correlation between degree and PCI. Middle panels (C,D) show epidemic outcomes over 500 time periods, assuming a 1% probability per time period, of exogenous introduction of an infectious case (here, an 'epidemic' is denoted as any simulation having a cumulative incidence > 500 cases (see Materials and Methods for rationale)) . Uncertainty intervals arise from repeating simulations 250 times, and reflect 95% simulation intervals. (E) Modelled timecourse of incidence, when aggregated over 250 simulations (with each simulation being interpreted here as an independent location). A Poisson secondary case distribution (in yellow) gives rise to a large surge in aggregate infection because of epidemics in multiple locations occurring in a synchronised way.

9
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. .

Efficiency of contact tracing
Although contact tracing plays an important role in the SARS- 1 CoV-2 response, in resource-constrained settings such as India, its demands on the healthcare 2 system can make it difficult to sustain. Motivated by our findings, we propose reframing contact 3 tracing with the goal of efficiently identifying individuals with high PCI. In our data overall, we 4 estimate that if an individual caused at least one onward infection, there is a 79% probability 5 that they caused at least two onward infections. We thus propose a sequential strategy where,  Simple dynamical models highlight the important role that is played by these heterogeneities.

21
At the gross level of the secondary case distribution, the high proportion of zero-infectors yields 22 outbreak dynamics wherein surges can be handled by providing mobile services rather than in-23 creasing hospital capacity in every geography (Fig. 3E). The negative binomial distribution, 24 10 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. . https://doi.org/10.1101/2020.09.10.20190017 doi: medRxiv preprint q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

Fraction of infections found
Pilot size (number of contacts) q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q identifying most cases.

10
An important question, that we are not able to address using the current data, is what drives 11 11 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. . the heterogeneity in per-contact infectiousness. This heterogeneity may arise, for example, from 1 biological factors such as the role of pre-existing, cross-reactive immunity that may moderate 2 viral load in some individuals more effectively than others (14). Our analysis suggests that PCI 3 increases with age and is significantly associated with sex (Fig. S5). Further data on these and 4 other individual-level characteristics would be invaluable in further examining key risk factors 5 for infectiousness. Where risk factors involve individual characteristics that can be readily iden-6 tified in newly diagnosed patients, such as viral load, these factors could also play an important 7 role in guiding future contact tracing efforts. However, heterogeneity in PCI could also reflect 8 variations in the closeness of reported contacts, with some reporting only the closest contacts 9 and others reporting wider contacts, thus explaining the negative correlation. We emphasise that 10 our data is limited to defined 'high-risk' contacts (see Materials and Methods), thus excluding 11 incidental contacts that might be expected to bias our estimates the most. Nonetheless, even 12 if there is variation in closeness amongst high-risk contacts, our analysis offers an approach 13 for adjusting for these variations, when interpreting what routine contact tracing data means for 14 transmission: in this case our estimates for PCI should be regarded as a data-driven weighting of 15 contacts, rather than infectiousness. Our approach can easily be adapted for any dataset where 16 there is additional information on closeness of contact.

17
Amongst other limitations, the contact tracing data was collected, not under controlled study 18 conditions, but as part of a public health response, by the Government of Punjab. Our approach 19 to this data is pragmatic, recognising some inherent limitations: there may be false negatives 20 in the data if people were tested too early, or indeed if people had been infected long in the 21 past, which we cannot tell in the absence of serological tests. As with any contact tracing data, 22 our assumptions for who-infected-whom, in a given contact pair, may be imperfect. We are  Overall, the methods that we have outlined here should apply to any contact tracing database 7 and our publicly available code can be directly applied to any such data that have been collected.

8
Contact tracing forms an integral part of the response to SARS-CoV-2 around the world: while 9 being an important public health strategy in its own right, it can also provide invaluable informa-10 tion about how, and to whom, infection is being spread. Systematic analysis of this data could 11 provide important insights to inform future, smarter strategies for the control of SARS-CoV-2. 12 13 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.   CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. . https://doi.org/10.1101/2020.09.10.20190017 doi: medRxiv preprint 15. Report 9: Impact of non-pharmaceutical interventions (npis.           In the data, we recoded the reasons for testing into six categories: Seed (Normal), Contact 20 (Normal), Farmer/Labour, Migrant/Returnee (Non-Nanded), Nanded, Other. During the lock-21 17 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020.  Pilgrims returning from Nanded (described in section 2), other migrants/returneees, and farm-2 ers/laborers (residing in Punjab but originally from outside it) were tested by a special protocol.

3
The "other" category consisted of certain high-risk populations like frontline healthworkers who 4 were tested for occupational reasons and their families. The categories "Seed (Normal)" and 5 "Contact (Normal)" correspond to those tested due to normal protocol -usually due to symp-6 toms, living in a containment zone, or from the contact-tracing protocol described above -as 7 the first case in a cluster or the contact of a confirmed case, respectively.
8 Table 2 displays the frequency of each reason and the percentage reporting no high risk 9 contacts, with distributions of gender, age and symptomatic status. As we might expect laborers 10 and migrants tend to be younger and more male. Of particular concern is the high percentage 11 of individuals reporting no high risk contacts in most categories (Zero Deg). This is due to a  This is a population for whom we believe we have a robust contact distribution applicable to 6 the population of Punjab and for whom we can reliably identify seeds and contacts.

7
Where such information could be ascertained, we undertook an extensive exercise of match-8 ing contacts to seeds in the entire dataset, to verify the dataset had been coded correctly.

9
Nonetheless, we were concerned about the case ascertainment problem. Assume that both 10 A and B have tested positive. B could have infected A (directly or indirectly) but we observed 11 A first and coded it as a seed. Two possibilities exist either we got it right and A is the seed, 12 or we got it wrong and B is the seed. While we can never be sure of this answer, we can test 13 the robustness of our claims to swapping seeds and contacts. While we cannot direct compare 14 onward infections and degree of contacts to seeds due to the bookkeeping problem, we can test 15 whether seeds and contacts display similar infectiousness. Indeed, we see that that the contacts 16 (2763 individuals) of normal seeds have an aggregate test positivity of 6.0% while the contacts 17 (1885 individuals) of normal "contacts" have a test positivity of 6.5%. These are statistically 18 indistinguishable (p = 0.45). Thus, as long as the coding of seed and contact by the govern-19 ment of Punjab was independent of degree (which is likely because seeds were typically tested 20 due to a biological criterion -showing symptoms -and not a social criterion), we surmise that 21 our estimates of the secondary case distribution are likely to be robust to swapping seeds and 22 contacts.

23
Beyond the core dataset, as table 2 shows, the majority of positive cases are from the Nanded 24 event, from which pilgrims were brought back on dedicated buses. This group is akin to the 25 19 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 15, 2020. . Diamond Princess experience, where multiple people were in contact with each other in close 1 quarters. As such, seeds are contacts were not well-defined in this population. individuals based on their number of contacts. As an example, consider two individuals, A and 10 B, with 2 and 100 contacts, respectively who have infected no one. We are confident that B has 11 a PCI close to zero but not so confident with A due to a small sample size. We address this issue 12 through Bayesian shrinkage. In this setting, individual estimates of PCI (p i ) from high contact 13 individuals (such as B) will be mostly unchanged while those from lower contact individuals 14 (such as A) will be shrunken towards the overall mean (17).

15
Amongst different ways of performing Bayesian shrinkage (e.g., the Beta-Binomial model), 16 we chose to model the logarithm of the odds (logit) of the PCI as following a normal distribution 17 with a common mean and variance, as this functional form is closely linked to our modelling 18 of transmission dynamics (other approaches yield similar estimates (18)). In particular, we 19 estimate: where α i is the log-odds of p i andᾱ is the common mean. Above, σ 2 α is inversely correlated to 21 the amount of shrinkage. As σ 2 α → 0, each α i is given the same value, so each p i is estimated as 22 20 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 15, 2020. . the mean infection rate. As σ 2 α → ∞, p i ≈ z i d i . In practice, the "hyperparameters" like σ 2 α and 1ᾱ are estimated using Markov Chain Monte Carlo (MCMC) methods with diffuse priors. The 2 non-zero values of σ 2 α andᾱ guarantee that our estimated p i is between 0 and 1.

3
In future work with additional data, it may be possible to characterize the complete joint 4 distribution of PCI and number of contacts, thus alleviating the need to shrink to a common 5 mean. For example, healthcare workers with training in mitigating infectious disease spread 6 may have many contacts, but lower PCI than would be expected from the rest of the population.

7
In such a setting, shrinking towards a single mean would underestimate the heterogeneity in 8 the PCI distribution. Most of the extreme cases are 0's in the data from Punjab, however, so in 9 practice these values will be shrunk towards a small, non-zero value under either model.

10
Mathematical modelling of transmission dynamics 11 We implemented a simple network simulation, in an assumed population of 3,000 individuals 12 (consistent with the population size in this study). For simplicity we modelled all networks 13 as random, that is, neglecting clustering and other forms of network structure of higher order 14 than the degree distribution. Also for simplicity, we simulated the epidemic in terms of genera- in Table 1, we drew 3,000 samples. We then constructed a random, directed network treating 23 21 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 15, 2020. . these samples as degrees, to construct a network of the secondary cases that any given individual 1 would cause, once themselves infected (our results in Fig. 3 are qualitatively unchanged when 2 assuming a directed network instead).

3
In figure S2, we show that the degree distribution in the data follows an approximately log 4 normal distribution, and in our discussion of Bayesian shrinkage (above) we showed that the 5 logarithm of the odd (logit) of PCI is constructed to follow a normal distribution. We note fur-6 ther that Fig. 2D and Fig. S4(D) show that the correlation between the log-transformed degree 7 (n) and logit-transformed PCI (p) is plausibly negative for those that infect others. We consider 8 a population of infected individuals. Since the log-transformed degree and logit-transformed 9 PCI each follow a normal distribution and may be correlated, the natural choice for the joint 10 degree/PCI distribution is to model the log-transformed degree and logit-transformed PCI as 11 following a bivariate normal distribution: where µ is a vector composed of the mean values for logit-transformed PCI and log-transformed 13 degree, and we have, for the covariance matrix Σ: where σ lp is the standard deviation for the logit-transformed PCI; σ ln is the standard deviation we use for simulation differs from our Bayesian shrinkage approach presented in Fig. 2D. This 20 distinction is necessary because, in our simulations, we wish to explore variation across multiple 21

22
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. . similar contact and PCI distributions. In any data analysis, however, we will model conditional 1 on a particular observed contact distribution.

2
In a given simulation, we then sampled 3,000 values for degree and PCI. We constructed a 3 random, undirected network from the degree distribution. For each individual m, we assumed 4 that the sampled PCI p(m) applies uniformly to all of their contacts. Thus, although the link 5 between any two individuals A and B is undirected -representing a bidirectional transmission 6 risk -the transmission intensity is not necessarily the same in both directions, and depends on 7 the respective PCIs of A and B (owing to between-individual variations in infectivity).

8
Epidemic simulation For a given population constructed as above, suppose C t is the set of 9 individuals that are infective at the beginning of time-step t; S t is the set of individuals that have 10 not yet had infection and are therefore susceptible; and J t is the set of individuals that are newly 11 infected in timestep t. Further, suppose that p(m) is the sampled PCI for individual m. Then 12 we proceeded along the following iterative steps: 13 While t ≤ 500 and S t , C t both have at least one member: • Determine all contacts of m who belong to S t (for Models 1,2, regarding 'contacts' 17 as secondary cases).

18
• For each such contact, conduct a Bernoulli trial with probability p(m), to determine 19 whether infection occurs (for Models 1,2, taking p(m) = 1).

20
• Accumulate all new infections thus occurring in J t , and remove them from S t . 21 23 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. . Figure S1: Frequency distributions for cumulative incidence, over 500 timesteps, for each of the models listed in Table 1 in the main text. Vertical dashed lines indicate a cumulative incidence of 500, a consistent dividing line between the two modes in these distributions. Thus Fig. 3C in the main text shows the probability mass to the left of this line, while Fig. 3D shows the mean cumulative incidence to the right of this line.  4. Increment t by 1, and iterate from (1). 4 We repeated this algorithm 250 times, for each of the models listed in Table 1. for example, in translating our simulated dynamics to timescales more specific to SARS-CoV-9 2. As mentioned above, we also take a simplified approach to the network structure, assuming 10 the simplest case of a random network, and thus ignoring the potential for clustering, or other 11 types of network topology that could be influential in transmission dynamics (21). Further 12 data on the underlying contact structure, including the retention of information on test-negative 13 contacts, would be helpful in addressing these simplifications.
14 Modelling efficient contact tracing algorithms 15 We used the following algorithm to perform the simulations for Our simulation is illustrative, with some caveats to note. First, the procedure is not opti-4 mized for the number of contacts to test. The problem we address is similar to a bandit prob-5 lem in the machine learning literature where, as more information about the PCI distribution 6 is available through testing, the number of contacts to test is optimized to maximize the (ex-7 pected) number of infections found while minimizing the number of tests. We anticipate that 8 this sequential procedure would be challenging to implement in a public health context, partic-9 ularly in a low resource setting. Instead, we opt for a simple rule that can be implemented with 10 no additional optimization (e.g. testing family members and only testing further if at least one 11 is positive) that can still substantially improve efficiency. We also need information about the 12 joint distribution of PCI and contacts for a fully optimized approach, which we cannot estimate 13 with precision in the data from Punjab. Additionally, we do not consider imperfect tests. False 14 positives would preserve the number of infections found but make the procedure less efficient 15 since some individuals will be tested based on false positives that would not otherwise be tested. 16 False negatives will reduce the number of infections found, though this impact would be miti-17 gated by the right-skew of the PCI distribution. Finally, this procedure relies on the availability 18 of tests with relatively rapid results and the willingness of individuals to be tested. If the delay 19 between testing and receiving results is too long, then contacts who were not tested in the pilot 20 stage could be infecting others in the population while waiting to be tested.

Uniform PCI simulation experiment
Frequency (in 1000 simulations) Secondary cases per index case Figure S3: Inadequacy of the degree distribution to explain the secondary case distribution. While Fig. 2B in the main text illustrates this point visually, this plot offers statistical support. Red lines show the data for the secondary case distribution, while box plots show the bestfitting projections when assuming the degree distribution illustrated in Fig. S2, and moreover that the risk-of-transmission is constant across contacts (that is, a constant PCI). Doing so yields a secondary case distribution that severely underestimates the proportion of cases that caused zero onward infections.

27
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Figure S4: Robustness for full sample. The full heterogeneous sample is characterized in Materials and Methods. While the main text displays analysis on the core sample, the trends hold on the full dataset. Fig. S4(A) replicates Fig. 2A for the full sample, modelling negative binomial and Poisson fits to the secondary case distribution. As before, the negative binomial fits the secondary case distribution well, while the Poisson distribution underestimates the number of zero infectors. Fig. S4(B) shows a scatterplot of the logarithm of the degree distribution against the logarithm of the secondary case distribution adjusted by 1 to account for zeros and the skews in the distribution. The red points denote the core sample and the gray points denote the remainder of the sample. In both samples, although there is a discernible positive association between onward infections and degree, the heterogeneity in degree does not fully explain that of the secondary case distribution. Fig. S4(C) replicates Fig. 2C for the whole sample, showing the marginal distribution of PCIs. It is right-skewed as in the core sample. Fig. S4(D) shows the association between the log odds (logit) of the PCI and the logarithm of the degree. The red points denote the core sample and the gray points denote the remainder of the sample. Both samples are bimodal with a discernible negative association for those that infect others.

28
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Figure S5: Regression results for likelihood of a secondary infection, against the age, sex and symptom status of the index case. Results are from a multivariate logistic regression. Panel (a) shows the marginal effect of age on the probability of secondary infection per contact for the reference groups (asymptomatic, female). Dashed lines represent the bounds of the 95% confidence interval on the coefficients. Panel (b) shows the logistic regression coefficients for the dichotomous variables associated with being symptomatic and being male. For a person at the average age (about 42 years old) being male raises the probability of secondary infection per contact by about 0.5 percentage points (from 0.8% to 1.3%). For a person at the average age being symptomatic raises the probability of secondary infection per contact by about 1.9 percentage points (from 0.8% to about 2.7%). Recall that these estimates come from data where the majority of individuals have no secondary infections, which reduces these overall probabilities.

29
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 15, 2020. . https://doi.org/10.1101/2020.09.10.20190017 doi: medRxiv preprint