Abstract
Quantifying variation of individual infectiousness is critical to inform disease control. Previous studies reported substantial heterogeneity in transmission of many infectious diseases (including SARS-CoV-2). However, those results are difficult to interpret since the number of contacts is rarely considered in such approaches. Here, we analyze data from 17 SARS-CoV-2 household transmission studies conducted in periods dominated by ancestral strains, in which the number of contacts was known. By fitting individual-based household transmission models to these data, accounting for number of contacts and baseline transmission probabilities, the pooled estimate suggests that the 20% most infectious cases have 3.1-fold (95% confidence interval: 2.2-4.2 fold) higher infectiousness than average cases, which is consistent with the observed heterogeneity in viral shedding. Household data can inform the estimation of transmission heterogeneity, which is important for epidemic management.
One Sentence Summary In this study, variation of individual infectiousness is quantified. Potential sources of such variation, particularly heterogeneity of viral shedding is discussed.
INTRODUCTION
Characterizing transmission is critical to control the spread of an emerging infectious disease. The reproductive number is the widely adopted measure of infectiousness. However, it only measures the average number of secondary cases infected by an infected person, not the heterogeneity in the number of transmissions. Variation of individual infectiousness is particularly highlighted by superspreading events (SSEs), in which a minority of cases are responsible for a majority of transmission events. Such phenomena, illustrated by the “80/20 rule” (i.e., 20% of cases responsible for 80% transmission (1, 2)), have been observed in emerging infectious disease outbreaks (3), including severe acute respiratory syndrome (SARS) (4), middle east respiratory syndrome (MERS) (5, 6) and most recently the COVID-19 pandemic (1, 2, 7-9). In these outbreaks, the proportion of cases attributed to 80% transmission (p80), and the dispersion parameter (k) were estimated by fitting the negative binomial distribution to the number of secondary cases (3) as a measure of transmission heterogeneity.
However, the number of contacts per index cases is often not reported in SSE studies, and hence not incorporated in the analyses. In addition, SSE studies usually analyze clusters from different settings, in which the baseline transmission risk and density of exposure could be different (10). Finally, studies of transmission heterogeneity that focus on SSEs described in the literature may suffer from publication bias, with larger clusters having higher probability of being observed and reported (8). Therefore, the observed heterogeneity in the number of secondary cases could be a result of large number of contacts in SSE settings, or confounding from these factors (11, 12), instead of variation in individual infectiousness.
Households are one of the most important settings for SARS-CoV-2 transmission, with 4-fold to 10-fold higher transmission risk than other places (10). Hence, household transmission studies provide an ideal setting to quantify variations in individual infectiousness. In a household transmission study, an index case is identified, and their household contacts are followed-up for one to two weeks, during which there is high transmission potential (13). Therefore, the number of contacts is known while transmission risks and reporting biases can be controlled. We aim to characterize the variation of individual infectiousness by analyzing data from household transmission studies.
RESULTS
Estimating the variation of individual infectiousness from household studies
We conducted a systematic review to gather information on the number of secondary cases with the number of household contacts for each household, in the form of number of households with X cases among households of size Y. In total, we identified 17 studies, comprising 13098 index cases and 31359 household contacts (Fig. S1, Table S1) (14–30). Most studies covered the cases from January to November 2020, which was dominated by ancestral strains, except for Layan et al. (from December 2020 to April 2021) and Hsu et al. (from January 2020 to February 2021), which covered both ancestral strains and the alpha variant.
Process of systematic review
Summary of characteristic of identified studies.
We then developed a statistical model to quantify the degree and the impact of variation of infectiousness of cases on transmission dynamics. The individual-based household transmission model describes the probability of infection of household contacts as depending on the time since infection in other infected persons in the household, so that infections from outside the household (community infections), or infections via other household contacts rather than the index case (tertiary infections) are allowed (31–33). We extend this model by adding a random effect (δi) on the individual infectiousness of cases. Here, the relative infectiousness of case i compared with case j is exp(δi)/exp(δj). The parameter for variation in individual infectiousness (hereafter denoted as infectiousness variation, σvar) is the standard deviation (SD) of the random effect characterizing individual infectiousness, so that δi follows Normal distribution with mean equal to 0 and SD equal to σvar.
We separately fit the models to 14 studies with more than 150 contacts (14–27) (Fig. 1, Table S2). For 12 studies out of 14, models with infectiousness variation perform substantially better (range of ΔDIC: 5.8-268) (Figure 2, Table S2). From these 12 studies, the estimated infectiousness variation (σvar) ranged from 1.03 to 2.83. This suggests that, the 20% most infectious cases are 2.4-fold to 10-fold more infectious than the average case. Based on the two largest studies with 6782 and 3727 households (14, 15), the estimated infectiousness variation is 1.48 (95% credible interval (CrI): 1.29, 1.7) and 1.41 (95% CrI: 1.19, 1.72), suggesting that, among all cases, the 20% most infectious are 3.5-fold (95% CrI: 3.0-4.2 fold) and 3.3-fold (95% CrI: 2.7-4.3 fold) more infectious than the average case. The estimated daily probability of infection from outside the household and estimated person-to-person transmission probability within households are ranged from 0.0003 to 0.017, and from 0.06 to 0.51 respectively. The estimates of parameters for the relationship between number of contacts and transmission (larger value indicates stronger inverse association) are ranged from 0.43 to 0.92, except for the study by Layan et al (18) where it is equal to 0.2. For all studies, the predicted final size distribution is consistent with the observed data and the model fit is judged adequate (Table S3-S4).
Summary of model estimates
Model adequacy check for Lyngse et al. Each element of the table has the format “observed frequency – expected (posterior mean) frequency (95% Credible interval).
Model adequacy check for Lyngse et al. Each element of the table has the format “observed frequency – expected (posterior mean) frequency (95% Credible interval).
Figure shows the average number of contacts and SD of number of contact, SD of number of secondary cases per index cases (σsec), proportion of households in which no contacts are infected (p80), proportion of cases attributed to 80% transmission (p80) and secondary attack rate (SAR) for 17 identified studies.
Figure shows the estimates of infectiousness variation (σvar), the estimated probability of infection from community and estimated probability of infection from households, and the reduction in DIC compared with the model without infectiousness variation. Models are fitted separately to 14 household transmission studies.
We conduct random effects meta-analyses on estimates of individual infectiousness from 14 identified studies. The pooled estimate of infectiousness variation is 1.33 (95% confidence interval (CI): 0.95, 1.70), suggesting that the 20% most infectious cases are 3.1-fold (95% CI: 2.2-4.2 fold) more infectious than the average case (Figure 4). Based on this fitted distribution, we estimate that 5.9% (95% CI: 1.4%, 11.1%) and 14.9% (95% CI: 7.2%, 20.7%) of cases could be at least 8-fold and 4-fold more infectious than average cases, respectively.
In each panel, numbers represent the observed corresponding relationship for the identified studies. Panel A, B, C and D show the relationship between infectiousness variation (σvar) and secondary attack rate (SAR), proportion of households in which no contacts are infected (p0), proportion of cases attributed to 80% transmission (p80) and SD of number of secondary cases per index cases (σsec). In the bottom, numbers in bracket indicate the number of household contacts in corresponding studies.
Red line indicates the estimated distribution and the gray lines indicate the associated uncertainty. Black dashed line indicates average infectiousness (relative infectiousness equal to 1), while the purple and blue dashed lines indicate top 20% and 10% infectiousness respectively.
Proxy measures of infectiousness variation
While such a modeling approach generates valuable information on infectiousness variation from household studies, directly applying this approach may be challenging due to the complexity in modelling and estimation. There is therefore a need to determine the relationship between characteristic of household transmissions and infectiousness variation, to design and validate a proxy of infectiousness variation, such as the widely used p80. We consider the following candidate: 1) p80, defined as the proportion of index cases that had >80% of their household contacts becoming cases in the study, 2) the proportion of households in which no contacts are infected (p0), 3) the secondary attack rate (SAR, the proportion of infected contacts), and 4) the standard deviation (SD) of the distribution of number of secondary cases (σsec) (Fig. 2; Table S5).
Summary of characteristic of identified studies. SD: standard deviation.
Association between infectiousness variation estimated from household transmission models, and other statistics from 17 household studies, based on meta-regression.
Pooled estimates for duration of viral shedding for COVID-19 from 11 systematic reviews
Pooled estimates for duration of viral shedding for COVID-19 in subgroups from 7 systematic reviews
Pooled estimates for duration of infectious period for COVID-19 in subgroups from 2 systematic reviews
In meta-regression, we find the infectiousness variation is associated with p80, p0 and SAR. We estimate that doubling p80and SAR are associated with 0.70 (95% CI: 0.30, 1.09) and 0.55 (95% CI: 0.21, 0.89) unit decrease in infectiousness variation, with R-squared equal to 76% and 68% respectively. We also estimate that doubling p0is associated with 0.85 (95% CI: 0.33, 1.38) unit increase in infectious variation (R2 = 75%). In addition, we find that higher infectiousness variation is associated with only using PCR to ascertain secondary cases. Regarding other statistics, we find that σsec is positively associated with mean and SD of number of contacts. Regarding other factors, we find that higher infectiousness variation is associated with only using PCR to confirm secondary cases. Other than these associations, we find no association between these statistics and implementation of lockdown, ascertainment method of index and secondary cases, and the circulating virus of SARS-CoV-2 in the study period.
DISCUSSION
In this study, we characterize the impact of variation of individual infectiousness on heterogeneity of transmission of COVID-19 in households. We demonstrate that it can be estimated from household data using a modeling approach. The pooled estimate of infectiousness variation from 14 studies suggests that the 20% most infectious cases have 3.1-fold (95% CI: 2.2-4.2-fold) higher infectiousness compared with average index cases. This implies there is substantial variation in individual infectiousness of cases in households. This variation could be attributed to both biological factors and host behaviours.
Regarding host behaviours, multiple contact patterns, particularly by age, could contribute to the variations in infectiousness of cases. For example, mother-child contacts are usually more intense than father-child contacts, and adult cases are more capable of self-isolation within a household compared with children. Furthermore, contact pattern studies suggested that school-age children and young adults tended to mix with people of the same age (34, 35). Biological factors may also contribute to such variations. For example, the SAR for index cases with fever and cough are 1.4-fold and 1.3-fold higher than index cases without fever and cough respectively (36). Viral shedding is used as a proxy measure of infectiousness, and consistently it also has substantial variations (37). Regarding the magnitude of viral shedding, studies report high variations of temporal viral shedding patterns among individuals (38–40). In addition, the duration of viral shedding can be highly heterogeneous, with pooled estimates of mean durations ranging from 11.1 days to 30.3 days, and almost all reviews report high heterogeneity of estimates identified from literatures (37, 41–50). Such heterogeneities still exist in subgroup analyses by age and severity (37, 41-45, 50). Also, the infectious period, proxied by duration of replicant competent virus isolation, is also heterogeneous (50). One review also suggest that heterogeneity in viral shedding is an intrinsic virological factor facilitating higher dispersion parameter for SARS-CoV-2 if we compare it with the corresponding patterns in SARS-CoV-1 and pandemic influenza A(H1N1)pdm09 (51).
The observed variation in individual infectiousness is consistent with past analyses of the dispersion parameter in a negative binomial distribution fitted to number of secondary cases per index case of COVID-19 (2, 52). However, in these studies, the number of contacts are not considered. Therefore, the observed variations may not apply directly to households with limited number of contacts. Here, we explore the relationship between infectiousness variations and household characteristics, and the commonly adopted measure of heterogeneity, proportion of cases attributed to 80% transmission (p80). We find that infectiousness variation is strongly associated with p80, similar to previous analysis about superspreading, suggesting that it could be a measure of infectiousness variation in households.
Infectiousness variation is also correlated with the secondary attack rate (SAR), and the proportion of households in which no contacts are infected (p0). When secondary attack rate is higher, it is expected that more contacts are infected and therefore the observed number of secondary cases would be less heterogeneous. The strong association between p0 and infectiousness variation suggests that the observed variations may also be attributed to some cases that are less infectious than average, which is consistent with Hong Kong data (53). In contrast, the SD of the distribution of number of secondary cases (σsec) is only weakly correlated with infectiousness variation. This is because σsec is highly correlated with the mean and SD of number of contacts, suggesting that it may depend on distribution of number of contacts, and hence may not be comparable among studies. We find higher infectiousness variation is associated with only using PCR to confirm secondary cases. One potential reason is that using other methods may lower the sensitivity of detecting infection, resulting in lower estimates of SAR and hence higher estimates of infectiousness variation. Further investigations are needed to explore roles of other potential factors affecting infectiousness variation, such as contact frequency among different regions (35).
An important limitation of our study is that we do not have individual-level data. Therefore, we are unable to determine the impact of demographic factors like age and sex on infectiousness variation. Also, we cannot disentangle the host behaviors from biological factors. Previous analysis suggested no evidence of the impact of age on infectiousness of cases (2, 9). In addition, we could not include factors affecting susceptibility to infection. Our estimates of infectiousness variation should be interpreted in light of these limitations: they capture heterogeneity in infectiousness due to demographic, host and biological factors. If we were able to account for these factors thanks to the analysis of more detailed datasets, our estimates of infectiousness variations would likely be lower. However, in one study that included susceptibility component in the estimation of individual infectiousness, substantial heterogeneity remained with 20% of cases estimated to contribute to 80% of transmission (22). Second, the recruitment methods among studies are different. This may affect the comparability of the results, although all index cases are laboratory-confirmed in all studies (32). Finally, most of our identified studies were conducted in the period of circulation of ancestral strains, and therefore the identified infectiousness variation may not be directly applicable to other variants.
In conclusion, we developed a modeling approach to estimate variation in individual infectiousness from household data. Result indicates that there is substantial variation in individual infectiousness, which is important for epidemic management.
MATERIALS AND METHODS
Study design
The aim of this study was to develop a statistical model to quantify the variation of individual infectiousness in households, based on publicly available information. An index case was defined as the first detected case in a household, while secondary cases were defined as the identified infected household contacts of the index case. We conducted a systematic review to collect household studies with at least 30 households, reporting the number of secondary cases with number of household contacts for each household for COVID-19, in the form of number of households with X cases among households of size Y. For each study, we also extracted the study period, the coverage of tests of household contacts, the case ascertainment methods, the circulating virus of SARS-CoV-2, and the public health and social measures in the study period. This information was used as an input of modeling analyses in this study. Details of systematic review could be found in Materials and Methods Section 1.
Estimation of variation of individual infectiousness in households
To determine if there are variations of individual infectiousness of cases, we use an individual-based household transmission model (31–33). The model describes the probability of infection of household contacts as depending on the time since infection in other infected people in the household, while infections from outside the household (community infections), or infections via other household contacts rather than the index case (tertiary infections) are allowed. We extend the model by adding a random effect parameter (δi) on the individual infectiousness of each case. Here, δi follows a normal distribution with mean 0 and standard deviation σvar, which quantify the variation of individual infectiousness (hereafter denoted as infectiousness variation). The relative infectiousness of case i compared with case j is exp(δi)/exp(δj).
Since the data were extracted from publication, the infection time of cases was unavailable. Also, the individual infectiousness parameters δi for cases were augmented variables. Therefore, we conducted our inference under a Bayesian framework with using data-augmentation Markov chain Monte Carlo (MCMC) algorithm to joint estimate the model parameters, the missing infection time and augmented variables with using metropolis-hasting algorithm. We separately fitted this model to identified studies. We assessed the model adequacy by comparing the observed and expected number of infections in households by household size. We compared the model with or without the random effect for variation of individual infectiousness by using Deviance Information Criterion (DIC) (54). Smaller DIC indicated a better model fit. DIC difference >5 was considered as substantial improvement (55). Details of the model and inference could be found in Materials and Methods Section 2 and 3.
Statistical analysis
We conducted random effects meta-analyses on identified studies to obtain pooled estimates of individual infectiousness and p80, using the inverse variance method and restricted maximum likelihood estimator for heterogeneity (56–59). Cochran Q test and the I2 statistic were used to identify and quantify heterogeneity among included studies (60, 61). An I2 value more than 75% indicates high heterogeneity (61). We conducted a meta regression analysis to explore the association between these statistics, and further including the following factors: the mean and SD of number of household contacts, implementation of lockdown, ascertainment method of index and secondary cases, only ancestral strains are circulating in study period, and all household contacts were tested.
Data Availability
The data and computer code (in R languages) for conducting the data analysis can be downloaded from https://github.com/timktsang/covid19_transmission_heterogeneity
Funding
This project was supported by the Health and Medical Research Fund, Food and Health Bureau, Government of the Hong Kong Special Administrative Region (grant no. COVID190118; BJC) and the Collaborative Research Fund (Project No. C7123-20G; BJC) of the Research Grants Council of the Hong Kong SAR Government. BJC is supported by the AIR@innoHK program of the Innovation and Technology Commission of the Hong Kong SAR Government. SC acknowledges financial support from the Investissement d’Avenir program, the Laboratoire d’Excellence Integrative Biology of Emerging Infectious Diseases program (grant ANR-10-LABX-62-IBEID), the EMERGEN project (ANRS0151), the INCEPTION project (PIA/ANR-16-CONV-0005), the European Union’s Horizon 2020 research and innovation program under grant 101003589 (RECOVER) and 874735 (VEO), AXA and Groupama. TKT acknowledges the Seed Fund for Basic Research (202111159118) from the University of Hong Kong.
Author Contributions
Tim K. Tsang: Conceptualization, Methodology, Software, Data curation, Writing-Original draft preparation, Visualization, Validation
Xiaotong Huang: Data curation, Investigation, Visualization
Can Wang: Data curation, Investigation
Sijie Chen: Investigation, Visualization
Bingyi Yang: Conceptualization, Writing-Reviewing and Editing
Simon Cauchemez: Conceptualization, Methodology, Supervision, Writing-Reviewing and Editing
Benjamin J. Cowling: Conceptualization, Methodology, Supervision, Writing-Reviewing and Editing
Competing interests
BJC consults for AstraZeneca, Fosun Pharma, GlaxoSmithKline, Moderna, Pfizer, Roche and Sanofi Pasteur. The authors report no other potential conflicts of interest.
Supplementary Materials
1 Systematic review
We conducted a systematic review following the Preferred Reporting Items for Systematic Review and Meta-analysis (PRISMA) statement (62). A standardized search was done in PubMed, Embase and Web of Science, using the search term “(“COVID-19” OR “SARS-CoV-2”) AND (“household” OR “family”) AND (“transmission”)”. The search was done on 22 June, 2022, with no language restrictions. Additional relevant articles from the reference sections were also reviewed.
Two authors (XH and CW) independently screened the titles and extracted data from the included studies, with disagreement resolved by consensus together with a third author (TKT). Studies identified from different databases were de-duplicated.
2 Household transmission model
Our aim was to explore the role of variation of individual infectiousness on transmission, with accounting for the variation and number of contacts, and the transmission probability. Therefore, we extended the household transmission model to account for variation of individual infectiousness of index cases (31). We added a random effect on infectiousness of cases (33). In the model, the hazard of infection of individual j at time t from an infected household member i, with infection time ti in household k, was

where λh was the baseline hazard, δi was the relative individual infectiousness of cases, which followed a normal distribution with mean 0 and standard deviation σvar.
Xk was the number of household contacts. β was the parameter describing the relationship between number of household contacts and transmission rate. It ranged from 0 to 1, with 0 indicating that the transmission rate is independent of number of household contacts while 1 indicates that the transmission rate is inversely proportional to number of household contacts (i.e. dilution effect of the contact time per contact which is lower when the number of household contacts is higher). f(.) was the infectiousness profile since infection generated from the assumed incubation period (mean equal to 5 days) and infectious period (mean equal to 13) (40, 63, 64).
The daily hazard of infection from outside the household is assumed to be λc(t) = λc. Hence, the total hazard infection of individual j on day t was

Based on the transmission model, the probability that an individual i was infected with infection time ti0 was

where zi1 was the start of the follow-up day of individual i. Denote the total follow-up time for an individual i as zi2, then the probability that an individual i did not get infected within the follow-up period is

Hence the log-likelihood function L was

Index cases did not contribute to the likelihood and hence the summation was only on household contacts.
3 Inference, model comparison and validation
The infection time for cases were unavailable in the data extracted from publication. Also, the individual infectiousness parameter δi was an augmented variable. Therefore, we conducted our inference under a Bayesian framework with using data-augmentation Markov chain Monte Carlo (MCMC) algorithm to joint estimate the model parameters, the missing infection time and the relative individual infectiousness with using metropolis-hasting algorithm. For the parameter that only takes positive value, we used a vague Uniform(0,10) prior. For the parameter for relationship between number of household contacts and transmission, we used Uniform(0,1). The algorithm runs for 50000 iterations after a burn-in of 50000 iterations. Converge was visually assessed. One run takes about 3 hours on our typical desktop computer. We fitted the model to all studies, but only 14 of them could converge and provide robust estimates of individual infectiousness.
We assessed the model adequacy by comparing the observed and expected number of infections in households by household size. We simulated 10000 datasets with a structure identical to that of the observed data (in terms of number of household contacts), with parameters randomly drawn from the posterior distribution of model parameter.
We compared the model with or without the random effect for variation of individual infectiousness by using Deviance Information Criterion (DIC) (54). Smaller DIC indicated a better model fit. DIC difference >5 was considered as substantial improvement (55). Since the likelihood of observed data could not be directly evaluated for a given model (65), we used an importance sampling approach to estimate the likelihood for the observed data and evaluate the DIC (66, 67). For each household, we simulated 2000 datasets, with parameters randomly drawn from the posterior distribution. Then we compared the observed data and simulated data. The contribution to the likelihood of a household was equal to the proportion of simulated data with infection status that exactly matched the observed data, for all household members. To avoid the problem of 0-valued likelihood, we used the approach developed by Cauchemez et al (66), and assumed the sensitivity and specificity for diagnosing a case were both 99.99%.
4 Data and code availability
The data and computer code (in R languages) for conducting the data analysis can be downloaded from https://github.com/timktsang/covid19_transmission_heterogeneity
Acknowledgments
The authors thank Maylis Layan for helpful discussion and Jiayi Zhou for technical assistance.