## Abstract

**Background** Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) poses a high risk of transmission in close-contact indoor settings, which may include households. Prior studies have found a wide range of household secondary attack rates and may contain biases due to simplifying assumptions about transmission variability and test accuracy.

**Methods** We compiled serological SARS-CoV-2 antibody test data and prior PCR test reporting from members of more than 9000 Utah households. We paired these data with a probabilistic model of household importation and transmission. We calculated a maximum likelihood estimate of the importation probability, mean and variability of household transmission probability, and sensitivity and specificity of test data. Given our household transmission estimates, we estimated the threshold of non-household transmission required for epidemic growth in the population.

**Results** We estimated that individuals in our study households had a 0.38% (95% CI 0.30% – 0.48%) chance of acquiring SARS-CoV-2 infection outside their household. Our household secondary attack rate estimate was 35% (26% – 47%), substantially higher than the crude estimate of 15% unadjusted for imperfect serological test specificity and other factors. We found evidence for high variability in individual transmissibility, with higher probability of no transmissions or many transmissions compared to standard models. With household transmission at our estimates, the average number of non-household transmissions per case must be kept below 0.40 (0.32 – 0.51) to avoid continued growth of the Utah epidemic.

**Conclusions** Our findings suggest that crude estimates of household secondary attack rate based on serology data without accounting for false positive tests may underestimate the true average transmissibility, even when test specificity is high. Our finding of potential high variability (overdispersion) in transmissibility of infected individuals is consistent with characterizing SARS-CoV-2 transmission being largely driven by superspreading from a minority of infected individuals. Mitigation efforts targeting large households and other locations where many people congregate indoors might curb continued spread of the virus.

## 1 Introduction

Since its emergence in 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus responsible for COVID-19, has spread rapidly, causing severe morbidity, mortality, and disruption to daily life. As public health officials continue grappling with reducing community spread, it is of increased importance to understand transmission risk in different locations where people mix. Transmission within households may be especially important, given the mounting evidence that indoor environments with close, sustained contact are especially high risk for SARS-CoV-2 transmission [1-3]. Furthermore, with substantial observed decreases in mobility during the pandemic [4], individuals likely are spending a greater proportion of time at home, thus increasing the importance of understanding within-household transmission. Likewise, isolation and quarantine measures recommended to help control COVID-19 frequently occur within homes, increasing risk to susceptible household members [5].

Data collected from members of households with at least one person infected with SARS-CoV-2 have revealed a wide range of within-household transmission estimates. One systematic review and meta-analysis [6] found 24 studies with household data conducted from January-March 2020, mostly in China, with secondary attack rate estimates ranging from 5% to 90% in the individual studies; pooling these data led to an average secondary attack rate estimate of 27% (95% CI: 21% – 32%). Another published review and meta-analysis of more recent data found 22 studies on the secondary attack rate in households, including estimates ranging from 4% to 32% [7]. Pooling these studies, the review found an average secondary attack rate of 17.1% (95% CI: 13.7% – 21.2%). Another review and meta-analysis found 40 household studies with individual study estimates ranging from 4% to 45% [8]. Their pooled analysis found that the household-based secondary attack rate for all household contacts was 19.0% (95% CI: 14.9 – 23.1%). Data from households in the U.S. [9-12] produced secondary attack rate estimates from 11% to 53%.

Most household studies generated data by first identifying index household cases via active or passive surveillance followed by monitoring and PCR testing of their household contacts. These studies may exhibit bias if mild or asymptomatic cases were less likely to be identified as an index household case. By contrast, data for the presence of antibodies among household members provide information on the distribution of final sizes of household outbreaks no longer in progress and in which some or none of the cases were identified at the time. We are aware of only 3 studies that used serological antibody data to estimate household transmission, using data from Spain [13], Brazil [14], and Switzerland [15].

In addition to average transmission rates, heterogeneity and variability in SARS-CoV-2 transmission have also been quantified. The amount of individual-level variation in the number of secondary infections can affect final outbreak size [16]. Large variation (i.e., overdispersion) indicates the presence of superspreading by a minority of individuals who transmit to a disproportionately large number of others [17]. Better understanding of superspreading individuals and locations can greatly enhance efficient targeting of transmission control strategies [18]. Backward contact tracing can efficiently trace sources of acquisition to high-transmission individuals and circumstances when superspreading is present [19], and efforts that target similar circumstances for transmission prevention can have disproportionate benefits [20, 21].

Studies have quantified the variability in the number of SARS-CoV-2 transmissions from infected individuals using the dispersion parameter *k*, governing the variance of a negative binomially distributed offspring distribution [22-26]. Those studies estimated high overdispersion (low values of *k*) similar to what was observed during the first SARS-CoV outbreak in 2003 [17]. These estimates were derived from data on transmissions, including superspreading events, occurring in a variety of locations both inside and outside of households. Regarding household transmission specifically, Madewell et al. [8] showed preliminary evidence of overdispersion in household data, with more households than expected experiencing extremes of transmission (i.e., either no transmission or many transmissions) from an introduced case.

In this study, we combine SARS-CoV-2 data from serological antibody tests and self-reported prior PCR tests to estimate within-household transmission of COVID-19 in Utah. Previously published secondary attack rate estimates are largely based on crude formulae which ignore the probabilities of multiple members of a household acquiring infection from the community, multiple generations of transmission within the household (i.e. secondary, tertiary, etc. transmissions), and imperfect test sensitivity and specificity. We addressed these limitations by extending previous models of final household outbreak size distributions [27] to develop a novel probabilistic model of household importation and household transmission combined with test sensitivity and specificity. Our model also quantifies variability in household transmission and the potential extent of overdispersion, to shed light on superspreading phenomena and the implications of household transmission for population-level controllability of COVID-19.

## 2 Methods

### 2.1 Data collection from Utah households

Details of our data collection process are described elsewhere [28]. Briefly, the Utah Health & Economic Recovery Outreach project involved selecting households in several counties in Utah by population sampling designed to form a set of households by which average community seroprevalence could be assessed. Any member of selected households age 12 or older were offered the opportunity to participate in a survey and serological COVID-19 antibody testing. Household members declining to provide a serological sample could opt to fill out the survey, which included the opportunity to report prior PCR test results for COVID-19. Data included in this analysis were collected in the months of May, June, and July, 2020.

The data are represented as follows. For each household in the dataset, we captured the following 7 values from the data:

*n*: total number of people in household*a*: number who were antibody tested*s*: number who responded to the survey but were not antibody tested*a*_{pp}: number who reported a prior positive PCR result and received a positive antibody test*a*_{pN}: number who reported a prior positive PCR result and received a negative antibody test*a*_{Np}: number who reported no prior positive PCR result and received a positive antibody test*s*_{p}: number who were surveyed, reported a prior positive PCR test, and did not receive an antibody test

Each of the *C* unique combinations of the above 7 values found at least once in the dataset was indexed as a vector **y**_{i}:
We tallied the number of households for which each **y**_{i} occurred in the frequency elements *f*_{i}, and represented the entire dataset by the vector **y** = (**y**_{1}, …, **y**_{C}, *f*_{1}, … *f*_{C}).

The dataset **y** and all codes, written in R version 4.0.3, used for analyses described in the following sections are posted and publicly available at https://github.com/damontoth/householdTransmission.

### 2.2 Total household infection size model

Here we derive the probabilities *M*_{kn} for the probability that *k* out of *n* total household members ended up infected. If *k* members of a size-*n* household were infected, that means that *n* − *k* members escaped being infected by a non-household member (called “community” acquisitions) and escaped being infected by any of the *n* infected within the household. Thus, our model for *M*_{kn} combines both probabilities and does not depend on the order of occurrence of household transmissions and subsequent community acquisitions after the initial one, as in similar prior formulations [27]. Also following prior formulations, we assume that active infections were not present in the households at the time of antibody data collection (i.e., that household outbreaks had reached final size), consistent with the low prevalence of active infections at the time of data collection [28].

The *M*_{kn} values depend on 3 parameters. The parameter *p*_{c} is the average per-capita probability of community acquisition, *p*_{h} is the mean transmission probability from an infected person to a fellow household member, and *d*_{h} is the dispersion parameter characterizing variability in transmissibility across infected individuals, with no assumed correlation among members of the same household.

For a given household size *n* ≥ 2, the formula for *M*_{kn} is:
For households of size *n* = 1, note that the expression involving the household transmission parameters does not apply and we have *M*_{01} = 1 − *p*_{c} and *M*_{11} = *p*_{c}.

The probability that a household of size *n* had 0 infections: *M*_{0n} = (1 − *p*_{c})^{n}, is the probability that none of the household members acquired infection from the community and does not depend on the household transmission variables because no household transmissions were possible without a community introduction. For the final number of household infections to be nonzero, there must be at least one community acquisition, which may be followed by within-household transmissions. The expression is the binomial probability that *i* out of the *n* household members had a community acquisition, and the function *T*_{xyz} is the probability that *x* already infected household members lead to a total of *y* transmissions to *z* susceptible household members. In other words, *T*_{xyz} is the probability that the final outbreak size is *x* + *y*, given that *x* household members are already infected in a house with *z* susceptible members. For efficiency of computation, the *T*_{xyz} values are calculated in order of increasing values of *y*, i.e. *T*_{x0z} for each relevant *x* and *z* value are calculated first, then the *T*_{x1z} values, then *T*_{x2z}. This allows the use of *T*_{xyz} values for lower values of *y* to be used in the formula (see Supplementary Material for details):
Within the *T*_{xyz} formula, the function *H*_{xyz} is the probability that *x* infected household members transmit infection *directly* to *y* out of *z* fellow household members who are susceptible. The *H*_{xyz} values are calculated in order of increasing values of *x* for efficient computation (see Supplementary Material):
Finally, the function *F*_{yz} (*p, d*) is the probability mass function of the beta-binomial distribution for *y* successes out of *z* trials, parameterized by a mean success probability *p* and a dispersion parameter *d*. When *d* is finite and nonzero, *F*_{yz} is derived from the binomial distribution with success probability that is a beta-distributed random variable with parameters *α* = *dp, β* = *d*(1 − *p*), with decreasing variance as *d* increases. We also make use of the boundary cases *d* = 0 and *d* → ∞. In the limit *d* → ∞, holding *p* constant, *F*_{yz} becomes the binomial distribution with constant success probability *p* (Supplementary Material). In the maximal variance limit, *d* → 0, with *p* held constant, *F*_{yz} becomes an “all-or-nothing” distribution where *y* = *z* successes occur with probability *p* and to *y* = 0 success occur with probability 1 − *p* (Supplementary Material):
The function B is the beta function. We use *F*_{yz} within the formula for *H*_{1yz} to quantify the distribution of household transmissions directly from a single infected household member, where *y* is the number of transmissions, *z* is the number of susceptible household members, *p* = *p*_{h}, and *d* = *d*_{h}.

The above formulas are derived in the Supplementary Material. Elements of this model appear in other publications. Longini and Koopman [27] derived a formula for *M*_{kn} for the model with no variability among households or individuals, equivalent to our model with *d*_{h} → ∞. While they provided a more efficient formula that takes advantage of the properties of that special case, we confirmed that our calculation scheme above reproduces the results of their formula. Becker [29] published explicit formulas for the final size of household outbreaks after a single introduction to households up to size 5 using the beta-binomial chain model, equivalent to our *T*_{xyz} for *x* = 1 and *z* up to 4. We confirmed that our scheme for calculating *T*_{xyz} produces the same results as their example formulas for arbitrary values of *p*_{h} and *d*_{h}.

### 2.3 Likelihood model

We sought to use our data to simultaneously estimate the 3 parameters (*p*_{c}, *p*_{h}, *d*_{h}) using maximum likelihood estimation (MLE). However, applying the *M*_{kn} formula directly to our data would be problematic because the true number of infections *k* in each household are not known with certainty. The data include two sources of COVID-19 test information by which prior infection status of a portion of individual household members can be probabilistically inferred: antibody test results and surveys in which participants could report results of a prior PCR test.

Antibody test results are subject to imperfect sensitivity and specificity due to false negative tests and false positive tests, respectively. To account for these, we added two additional parameters to be estimated by the MLE: *ϕ*_{A}, the probability that an antibody-tested person with a prior infection tested positive for antibodies, and *π*_{A}, the probability that an antibody-tested person with no prior infection tested negative for antibodies.

Prior PCR test results for COVID-19 virus reported on the survey also do not perfectly identify those with prior infections. To quantify this imperfection, we introduced two more parameters to be estimated by the MLE: *ϕ*_{V}, the probability that a surveyed person with a prior infection reported receiving a positive viral PCR test, and *π*_{V}, the probability that a surveyed person with no prior infection did not report receiving a positive viral PCR test.

Some household members received a survey but no antibody test and other members received neither. The *M*_{kn} formula depends on the total household size *n*, which for many households includes individuals with missing data. For households with at least one but not all members infected (1 ≤ *k* ≤ *n* − 1) and in which less than *n* member were full participants, the likelihood formula required the probability that different portions of the *k* infected members were among those who were antibody tested or surveyed only. To arrive at our formula, we assumed that the antibody-tested and surveyed-only portion of a household were a random sample of household members with respect to their prior infection status. I.e., we assumed that those individuals in a participating household with and without prior infections were equally likely to participate in the study and equally likely to agree to antibody testing.

In all we have 7 variables to be estimated by MLE, encapsulated in the following vector **θ**:
The log likelihood of the dataset **y** described in Section 2.1 with variable set **θ** is then
To present the formula for ℒ(**y**_{i}|**θ**), the likelihood of a particular **y**_{i}, we first define the following quantities calculated from the core elements of **y**_{i} listed in Section 2.1:

*a*_{NNi}=*a*_{i}−*a*_{ppi}−*a*_{pNi}−*a*_{Npi}: number who reported no prior positive PCR result and received a negative antibody test*s*_{Ni}=*s*_{i}−*s*_{pi}: number who were surveyed, reported no prior positive PCR result, and did not receive an antibody test*q*_{i}=*n*_{i}−*a*_{i}−*s*_{i}: number untested for antibodies and not surveyed

Then we have:
In the formula, the function *A* quantifies the probability of observing the given set of test result combinations among antibody-tested people (*a*_{ppi}, *a*_{pNi}, *a*_{Npi}, *a*_{NNi}), given that (*u, v, w, x*) of them had a prior infection, respectively. E.g., *u* is the number of the *a*_{ppi} household member who had an infection (true positives), *v* is the number of the *a*_{pNi} household members who had an infection (true positive by PCR and false negative by antibody test), *w* is the number of the *a*_{Npi} household members who had an infection (true positive by antibody test and did not report a positive PCR test), and *x* is the number of the *a*_{NNi} household members who had an infection (false negative by antibody test and did not report a positive PCR test). The formula for *A* is
The function *f*_{m} (**r**; **p**) is the probability mass function for the multinomial distribution, where the number of trials is the sum of the elements of **r**, which are the number of infected or uninfected antibody-tested people who received each of the four possible test result combinations. The vector **p** contains the probability of each of the four test result combinations given that the person was infected (for **p** = **p**_{I}**)** or uninfected (for **p** = **p**_{U}**)**:
The first element of **p**_{I}, *ϕ*_{V} *ϕ*_{A}, is the probability that an antibody-tested person with a prior infection reported a positive PCR test (with probability *ϕ*_{V}) and also had a positive antibody test result (with probability *ϕ*_{A}). Note that *ϕ*_{A} represents the sensitivity of the antibody test, but *ϕ*_{V} includes both the sensitivity of the PCR test and the probability that an infected person actually sought and received a PCR test during the period of infection in which detectable virus was present and reported that positive test on our survey. Elements 2–4 of **p**_{I}are the probabilities that an antibody-tested, prior infected person reported a positive PCR test but tested negative for antibodies, did not report a positive PCR test and tested positive for antibodies, and did not report a positive PCR test and tested negative for antibodies, respectively. The elements of **p**_{U} are the corresponding probabilities for individuals with no prior infection.

The function *S* quantifies the probability of the survey-only data (*s*_{pi}, *s*_{Ni}) given that *y* of the *s*_{pi} individuals had a prior infection and *z* of the *s*_{Ni} individuals had a prior infection:
The function *f*_{b} (*q*; *r, p*) is the probability mass function for the binomial distribution, for *q* successes given that there were *r* independent trials with probability *p* for success of each trial.

The function *H*(*k, k*_{a}, *k*_{s}) in the likelihood equation is the probability that, when *k* of *n*_{i} individuals in the household were infected, *k*_{a} infected individuals were among the *a*_{i} individuals antibody tested and *k*_{s} infected individuals were among the *s*_{i} individuals surveyed but not antibody tested:
The function *f*_{h} (*b*; *c, d, e*) is the probability mass function of the hypergeometric distribution for the number *b* of infected people selecting to be antibody-tested or surveyed-only, given that there were *c* infected people and *d* uninfected people available for selection in the household, and *e* people were tested or surveyed-only. These terms account for individuals in households who received neither an antibody test nor a survey, who may have included infected individuals. Our use of the hypergeometric distribution led from our assumption that, if some members of the household had a prior infection and others didn’t, the antibody-tested / surveyed individuals were a random sample from the household with respect to their prior infection status.

### 2.4 Likelihood optimization and uncertainty

We minimized the log likelihood over the 7 unknown parameters (*p*_{c}, *p*_{h}, *d*_{h}, *ϕ*_{V}, *ϕ*_{A}, *π*_{V}, *π*_{A}) using the observations (*n, a, s, a*_{pp}, *a*_{PN}, *a*_{Np}, *s*_{p}) for each household, to produce the MLE: . We derived approximate confidence interval boundaries for an individual parameter *θ*_{i} using the likelihood ratio test, using the statistic 2 , where *θ* consists of *θ*_{i} freely varying and the other 6 elements of *θ* held at their optimal value. We defined a 95% confidence interval boundary where *θ*_{i} produces a value for this statistic equal to the 95^{th} percentile of the chi-squared distribution with 1 degree of freedom. We also plotted 2-dimensional confidence region boundaries for each of the 21 possible (*θ*_{i}, *θ*_{j}) parameter pairs by allowing each pair to vary freely together while holding the other 5 at their optimal values. We calculated the boundary in the (*θ*_{i}, *θ*_{j}) parameter plane where the likelihood ratio statistic equals the 95^{th} percentile of the chi-squared distribution with 2 degrees of freedom. To calculate P-values at which certain fixed parameter values could be rejected in favor of the MLE, we used the chi-squared distribution with degrees of freedom equal to the number of fixed parameters. Additionally, we developed a simulation model to produce synthetic data sets on which to test our likelihood model. We ran the simulation for the same number of households with the same sizes and participation rates for survey and antibody testing as in the actual data (fixed values of *n, a*, and *s* for each household). We randomized importations to households and simulated transmissions using the MLE values of the three epidemiological parameters, randomized survey and antibody test results using the MLE sensitivity and specificity values, and maximized the likelihood against the simulated data. We repeated this process for 500 simulated data sets and recorded the median estimated value of each variable, for comparison against the MLE value that generated the data. We also used the 500 sets of simulation-based estimates as a parametric bootstrap to generate 95% confidence estimates for each variable, for comparison against the intervals generated from the likelihood ratio test.

Finally, we tested an alternate model that allows the community acquisition probability to vary by household, such that some households may have a higher per-capita acquisition rate than others applied to each household member. To quantify this probability in the alternate model, we employed the beta-binomial distribution for the number of community acquisitions in a household of a given size (see Supplementary Methods).

### 2.5 Household transmission variability

We quantified the implications of our household transmission variability estimates by calculating the probability of transmission extremes, compared to those produced by the classic binomial transmission model (*d*_{h} = ∞). Specifically, we calculated the probability that an initially infected individual transmits to no one or everyone in households of sizes from 2 to 10. For households of size *n*, the probability of no transmissions from the index infection is *F*_{0,n−1}(*p*_{h}, *d*_{h}) and the probability the index person transmits directly to the entire household is *F*_{n−1,n−1}(*p*_{h}, *d*_{h}). We used our overall MLE values for and to calculate these values for each *n*, with confidence intervals using our parametric bootstrap results. For comparison to the binomial model we applied *d*_{h} = ∞, paired with the alternate MLE of *p*_{h} under that constraint.

We also calculated an example of a dynamic transmission model that produces a distribution of household transmission probabilities close to that produced by our MLE beta distribution, using the method of moments. Specifically, if an infected person’s duration of infectiousness is assumed to be fixed and transmissibility to a housemate is modeled as a gamma distribution with shape parameter *k*, then we solve for the value *k* that produces the same mean and variance for the transmission probability as that of the beta distribution with mean *p*_{h} and dispersion *d*_{h} (Supplementary Methods). We solved for *k* using our MLE and values, and we derived a confidence interval for *k* using the pairs of (*p*_{h}, *d*_{h}) estimates from our parametric bootstrap analysis.

### 2.6 Within-household reproduction number

We calculated the within-household reproduction number *R*_{h}, defined as the expected number of household transmissions directly from a community acquirer with all fellow household members susceptible:
where *μ* and σ^{2} are the mean and variance of the household size distribution, and *p*_{h} is the secondary attack rate as determined by our MLE. This equation for *R*_{h} is derived as in Ball et al. [30] and detailed in the Supplementary Material.

Additionally, we derived an alternate household reproduction number defined as the expected total number of transmissions in the household of an infected person who acquired infection in the community and has no initially non-susceptible housemates. This differs from *R*_{h} in that it counts all potential downstream transmissions in the household stemming from the index community acquirer. The formula for , derived in the Supplementary Material, is
To investigate the implications of household transmission for population-wide transmission control, we use a threshold condition delineating subcritical and supercritical transmission in the population. Supercritical transmission occurs when , where *R*_{c} is the average number of community (non-household) transmissions from an infected person. We derive this formula in the Supplementary Material, following Ball et al. [30]. We estimated *R*_{h}, , and the threshold value for *R*_{c} by applying our MLE estimates of *p*_{h} and *d*_{h} to the above formulas and their confidence intervals by applying the (*p*_{h}, *d*_{h}) pairs from each parametric bootstrap estimate.

## 3. Results

### 3.1 Data summary

We compiled data from 9,383 households. Of these, we retained 9,091 (96.9%) for use in the MLE. The 292 excluded households were removed because the household size was unknown (192) or the reported household size was less than the number of people tested or surveyed in the house (100). In the 9,091 retained households, there were 27,998 (3.08 per household) reported household members, 13,712 (1.51 per household) people who were both surveyed and antibody tested, and another 5,262 (0.58 per household) who were surveyed but not antibody tested.

Of the 13,712 antibody tests in the retained households, 170 (1.24%) were positive. Of those 170 people with a positive antibody test, 55 (32.4%) reported receiving a positive PCR test. Of the 18,974 people who were antibody tested or surveyed only, 117 (0.62%) reported receiving a positive PCR test. This broke down to 0.53% (72 / 13,712) for those who were antibody tested and 0.86% (45 / 5,622) for those who were surveyed but not antibody tested. The rate of testing positive for antibodies among those reporting a prior positive PCR test was 76.4% (55 / 72). The rate of survey participants agreeing to antibody testing was lower for those who reported a prior positive PCR test compared to those who did not: 61.5% (72 of 117) vs. 72.7% (5217 of 18,857), a small but statistically significant (P < 0.01) difference in proportion. Of the retained households, 189 (2.1%) had at least one household member who either tested positive for antibodies or reported a prior positive PCR test. There were 157 households with exactly 1 positive member (by either antibody test or reported PCR test or both), 25 households with 2 positives, 5 with 3 positives, 1 with 4 positives, and 1 with 6 positives. In all, there were *C* = 270 unique y_{i} vectors representing household data described in section 2.1.

The crude SAR measure derived from antibody testing only (fraction of antibody-tested housemates of antibody-positive household members who were also antibody positive) was 13.7% (26 / 190). The crude SAR estimate from reported PCR data only (fraction of surveyed housemates of people reporting a positive PCR test who also reported a positive PCR test) was 22.6% (30 / 133). When combining both types of data, the crude SAR estimate (fraction of surveyed / tested housemates of any antibody-positive or reported PCR-positive person who were positive by either or both measures) was 14.8% (43 / 290).

### 3.2 Maximum likelihood estimates

The MLE for *p*_{c}, the per-person community acquisition probability from outside the household, was 0.38% (0.30% 0.48%). For within household transmission probability, the MLE produced an average secondary attack rate estimate *p*_{h} = 35% (26% – 47%). The MLE for the dispersion parameter *d*_{h}, quantifying variability in transmissibility by person, was 0.54 (0.06 – 2.6). The boundary case *d*_{h} = ∞, representing the classic binomial household transmission model with no variability in individual infectiousness [27], could be rejected with P = 0.002.

Our MLE result for *ϕ*_{V}, the probability that a surveyed person with a prior infection reported testing PCR-positive, was 76% (66% – 86%). For *ϕ*_{A}, the probability that a prior-infected person’s antibody test was positive, the MLE was 85% (75% – 93%). For *π*_{V}, the probability that a surveyed person with no prior infection reported no PCR-positive test, the MLE was 99.93% (99.87% – 99.98%). For *π*_{A}, the probability that an antibody-tested person with no prior infection tested negative for antibodies was 99.3% (99.1% – 99.4%).

When optimizing the likelihood equation against 500 synthetic data sets simulated using the MLE variable assumptions, the median estimates of each parameter were very close to the MLE values (Table 1). The confidence intervals derived from these bootstrap estimates were similar to those derived from the likelihood ratio test, though the bootstrap intervals were somewhat wider for the three parameters governing importation and transmission. Likewise, the likelihood ratio-based intervals reported in Table 1 expanded modestly when we calculated 2-dimensional confidence regions based on each pair of estimated parameters, with most regions exhibiting close to symmetric shapes around the MLE (Supplemental Figures). Notably, the 95% confidence regions involving the transmission dispersion parameter *d*_{h} can extend to the high-variability boundary *d*_{h} = 0, a result that is also reflected by the fact that the MLE for the model with fixed *d*_{h} = 0 cannot be rejected with 95% confidence (P = 0.09) (Table 2).

Our alternate model that employed a beta-binomial distribution for the number of household acquisitions, using a new dispersion parameter *d*_{c} estimated as an additional variable in the MLE, found *d*_{c} = 2.4 (0.95 – 9.6), with somewhat altered estimates of the other parameters (Table S1) compared to those in Table 1. However, the log likelihood of the model in Table 1, which is equivalent to the alternate model with *d*_{c} = ∞, is sufficiently close to that of the alternate model that *d*_{c} = ∞ cannot be rejected by the likelihood ratio test and is favored by the Bayesian information criterion. However, if overdispersion in household community acquisitions does occur, the uncertainty ranges of the transmission variables *p*_{h} and *d*_{h} become large (see Supplementary Results).

### 3.3 Household transmission variability

We quantified the implications of our key finding of high transmission variability within households of persons infected with COVID-19 by calculating the probability of transmission extremes. Compared to our overall MLE, the classic binomial transmission model (*d*_{h} = ∞) produced a similar average secondary attack rate estimate of *p*_{h} = 31% (24% – 41%). However, the binomial model produces substantially lower probabilities that an infected individual transmits to no one or everyone in larger households (Table 3).

For example, our MLE model estimates that an infected member of a 8-member household would have a 44% (16% 68%) chance of transmitting to no one, but a 17% (1% – 44%) chance of transmitting infection directly to all 7 housemates. By contrast, the no-variability binomial model estimate would be substantially lower for each extreme: 7% (3% – 15%) chance of transmitting to no one and 0.03% (0.004% – 0.2%) chance of transmitting to everyone (Table 3).

We calculated an example of a dynamic transmission model that would produce the same mean and variance of a person’s transmission probability to a household member that is produced by our MLE beta distribution. If an infected person’s duration of infectiousness is assumed to be fixed and transmissibility to a housemate is modeled as a gamma distribution with shape *k*, then *k* = 0.21 (95% CI 0 – 1.3) when the mean and variance are matched, regardless of the infectious duration (Supplementary Methods). This estimate of *k* is comparable to the dispersion parameter *k* of the negative binomial distribution commonly used to characterize overall variability in the number of transmissions from individuals, which can be derived from the Poisson distribution with a mean that is gamma-distributed with shape parameter *k* [17].

### 3.4 Within-household reproduction numbers

Our estimate of the household reproduction number *R*_{h}, the expected number of household transmissions from a community acquirer with no other infected fellow household members, depends on our estimate of *p*_{h} and the mean *μ* and variance σ^{2} of the household size distribution. From our data we found *μ* = 3.08 and σ^{2} = 3.13, so our estimate is *R*_{h} = 1.09 (0.76 – 1.49). Our estimate of the alternate household reproduction number , the expected total number of transmissions in the household of a community acquirer, is .

The supercritical threshold for *R*_{c}, the average number of non-household transmissions by an infected individual, is approximated by (see Methods section 2.5 and Supplementary Material). Using our estimate for in Utah, this formula suggests that *R*_{c} must be kept below approximately 0.40 (0.32 – 0.51) to avoid increasing growth of COVID-19 infections in the population.

## 4. Discussion

The key findings of our analyses stem from our simultaneous estimation of the average and variability of SARS-CoV-2 household transmission, household importation, and test data accuracy. Our novel combination of those interacting features within our model revealed two important epidemiological insights. First, we found that accounting for test error, especially the specificity of the serological antibody test, produced a substantially higher estimate for the household secondary attack rate. Second, we found evidence of substantial variability of transmissibility within households, which has important implications for understanding broad transmission patterns and mitigation strategies.

An important implication of the first finding is that assuming perfect test accuracy may be a source of underestimation for the household secondary attack rate in other studies. Our maximum likelihood estimate was 35% (26% – 47%), which is higher than recent pooled estimates of 17–19% from the most recent meta-analyses of worldwide household studies [7, 8]. These and other published studies have generally estimated the secondary attack rate by a simple calculation of the fraction of tests that were positive among household contacts of known cases. When we applied that calculation to our combined data, we found a crude secondary attack rate estimate of 14.8%. We traced the major source of this substantial underestimate to the assumption of perfect test specificity inherent in the crude formula. When we ran our MLE under the assumption of perfect specificity (no false positives) for the antibody test, the result for secondary attack rate was 16.6%, much closer to the crude estimate. Notably, the estimate of community acquisition probability when assuming perfect specificity was substantially *higher* than our overall MLE: 1.2% vs. 0.4%. Thus, our model suggests that allowing for false positives can shift the attribution of infections toward household transmissions and away from acquisitions outside the household.

Of note, we did not make independent assumptions for the serological test specificity or other test accuracy measures, but instead optimized them simultaneously with the epidemiological parameters in our likelihood model. For the antibody test specificity, our MLE of 99.3% (99.1% – 99.4%) was within the uncertainty range of the test manufacturer’s estimate of 99.6% (99.0% – 99.9%) based on 4 positive tests from 997 samples collected prior to September 2019 [31]. Thus, our model independently produced a plausible estimate of the test specificity; if instead we had assumed the manufacturer’s point specificity estimate directly, our secondary attack rate estimate would be 30%, still substantially higher than the crude estimate.

Our second major finding of overdispersion of household transmission stemmed from our use of the beta-binomial distribution to quantify the number of household transmissions from infected individuals. We quantified individual-level variability in transmissibility using a dispersion parameter *d*_{h}, and the optimal value occurred at low dispersion (high variability; *d*_{h} = 0.54). The more commonly used binomial model, a special case of our model at minimal variability (*d*_{h} → ∞), was rejected, suggesting that transmission patterns are not well captured by that simplifying assumption.

Our dispersion parameter estimate is not directly comparable to another commonly used dispersion parameter, often named *k*, that characterizes variability in the total number of transmissions (whether household or not) from each infected person as a parameter of the negative binomial distribution [17]. We attempted to relate *d*_{h} to *k* in the context of simple model in which the only source of variability is a person’s transmissibility per unit time in contact with others, finding *k* = 0.21. This estimate is similar with point estimates for SARS-CoV-2 of *k* = 0.1 [22], *k* = 0.25 [23], and *k* = 0.33 [24]. This similarity perhaps suggests that variability in infectivity per time is a major driver of overall transmission variability for SARS-CoV-2. This could be consistent with findings that viral shedding is highly variable by individuals with SARS-CoV-2 infections, both during asymptomatic and symptomatic phases of disease, suggesting that heterogeneous transmissibility may be largely explained by overdispersion in levels of viral shedding by individuals [32]. However, other studies suggest that SARS-CoV-2 transmission overdispersion in the wider population beyond households may be less driven by biological heterogeneity and more by heterogeneous social contact behavior [33].

The level of within-household transmission variability captured by the parameter *d*_{h} affects the contribution of household transmission toward threshold levels of overall transmission. Threshold conditions are often expressed using a reproduction number (*R*), the average number of transmissions from each infected person. The average number of household transmissions directly from an initially infected household member (*R*_{h}) is independent of *d*_{h}, but *d*_{h} does affect the average number of household transmissions in the next generation, i.e. by someone who acquired infection from a housemate. When transmission variability is higher, the household transmission potential of a household acquirer is lower, reducing to zero in the “all-or-nothing” limit *d*_{h} = 0. To capture this effect, we introduced an alternate reproduction number , which is the average number of total household transmissions after the initial introduction, when final household outbreak size has been reached.

Neither *R*_{h} > 1 nor are sufficient threshold conditions for epidemic growth, which requires some level of between-household transmission to be sustained. Given our estimate of , we can estimate the critical value of *R*_{c}, the average number of non-household community transmission that would push transmission for the population above the supercritical threshold for a growing epidemic, with the threshold condition . Thus, we estimate that *R*_{c} must be kept below approximately 0.40 (0.32 – 0.51) to avoid a growing epidemic in Utah if household transmission continues to be well characterized by our model. As this result depended on the average household size in our data, it is notable that Utah has the highest state-average household size in the United States. The average household size in Utah is 3.1, about 20% higher than the national average household size. Thus, our *R*_{h} estimate may be high compared to other locations. A lower value of *R*_{h} would lead to a higher threshold value for *R*_{c}.

Our findings also provide further insight for interpretation of COVID-19 test results. In addition to the serological test specificity discussed above, our model also produced estimates of the sensitivity of this test. The MLE of *ϕ*_{A} = 85% (75% – 93%) is difficult to compare to the manufacturer’s data, which were based on testing symptomatic, PCR-positive people for antibodies. Overall, 109 of 122 (89%) of those tests were positive for antibodies, but the results were highly dependent on the number of days post-symptom onset at which the serological sample was taken [31]. Because the symptom histories of the antibody-tested people in our data is largely uncertain, we cannot evaluate whether our result is consistent with the manufacturer’s data, or whether our result can be reliably applied to other data.

We also produced estimates of the sensitivity and specificity of our survey data with respect to the probability that a reported prior-positive PCR test (or lack thereof) reflects the existence (or lack) of a prior infection. Our result of *ϕ*_{V} = 76% (66% – 86%) is our estimate of the percentage of individuals with SARS-CoV infections who tested PCR-positive test during their infection and were willing to report it on our survey. This may be a high identification fraction compared to other areas of the U.S.: one study estimated that less than 60% of symptomatic cases in the U.S. were identified during February-June 2020 [34]. Our result may reflect unusually successful case ascertainment efforts in Utah during the Spring and early Summer of 2020, perhaps partly owing to slower emergence compared to other regions. Notably, we found that a model assuming perfect specificity of the antibody test (*π*_{A} = 100%) produced a dramatically lower estimate of *ϕ*_{V} = 37% (Table 2), which suggests that crude estimates based on serology data may underestimate case ascertainment without considering false positives.

Finally, our estimate for very high specificity *π*_{V} = 99.93% (99.87% – 99.98%) of survey-reported PCR testing reflects the low probability of PCR tests producing false positives. It is possible that some false positives in our survey data occurred by erroneous reporting, i.e. survey respondents reporting a prior positive PCR test that did not occur, rather than via errors in testing procedure. Even though our MLE for this parameter was in excess of 99.9%, we found that an alternate model assuming *π*_{V} = 100% produced notably different estimates of some of the other parameters, which suggests that studies producing epidemiological estimates relying on a 100% PCR specificity assumption should test robustness of conclusions to small deviations from that assumption.

This study has several limitations. Our estimate of high household transmission variability may not be robust to alternate assumptions for the way community acquisition risk varies by household. For example, some households could have been comprised of families with both parents working essential jobs during Spring/Summer 2020, with children attending in-person day care or camps, thus placing the entire household at much higher risk of community acquisition compared to households working / caring for children at home. Also, households could have high collective community acquisition probability via attending multi-household gatherings of extended family or other social groups. In these ways, households conceivably could vary considerably in their infection numbers for reasons that don’t involve within-household transmission.

We tested the implications of this alternate possibility for household variability in our model by allowing variability in community acquisition by household using an additional dispersion parameter to the MLE model (Supplementary Results). Interestingly, the MLE for the transmission dispersion parameter *d*_{h} still occurred at high variability in household transmission under this alternate model. Furthermore, the improvement in likelihood was not substantial, such that the more complicated model would not be favored by the likelihood ratio test nor the Bayesian information criterion. However, larger uncertainty ranges under the alternate model suggest that we may not be able to definitively rule out the possibility that variability in community acquisition risk by household plays a substantial role in explaining overall variability in household infection numbers.

It is also possible that household transmission variability could be driven by properties of households such as contact behavior, underlying health composition of household members, physical properties of the domicile such as size and ventilation, or other properties that could increase transmission risk of all household members together. Possible variability in person-to-person transmission probability by household, rather than by individual, is not accounted for in our model. Using a beta distribution for this probability across different households to arrive at an alternate final size distribution would require integrating the beta distribution over the full final size distribution equationsnproduced by the binomial-chain model, which would be complicated for larger households. Alternatively, one could model a functional relationship between observed properties of a household in the dataset and its average transmission probability, while retaining dispersion occurring at the individual level. We have not attempted this with our data; we suspect that the sample size of outbreaks in households with a given feature would not be large enough to draw meaningful conclusions.

Another limitation lies in our potentially inaccurate assumptions used to quantify the probability of prior infections among those with missing data within participating households. Most non-participating individuals within participating households were children under 12, who were not offered antibody tests nor surveyed. Thus, our model carries the assumption that children under 12 have equal community acquisition rates, susceptibility to acquisition from another household member, and transmissibility to other household members compared to study participants. Our assumption is consistent with studies finding similar transmission rates to and from children compared to adults. In a study of COVID-19 clusters linked to day care centers within our study area in Utah [35], 42% of the cases occurred in children, who represented 60% of the people with epidemiological contacts to the facilities. The infected children (median age 7) transmitted infection to at least 26% of their non-facility contacts, close to our household estimate. Another study found that children under 10 in China were as likely to be infected as adults [36]. However, other studies suggest that children may be less likely to acquire infection than adults [37], and one study found very low household secondary attack from infected children in South Korea [38]. A study similar to ours found lower rates of importation and household acquisition among children aged 5-9 compared to older groups, although confidence intervals overlapped [15]. If substantial differences existed between children under 12 and our study participants, one or more of our estimates could be biased.

In addition, many eligible participants older than 12 chose not to participate, either declining the serological antibody test only (but still filling out a survey) or declining to participate at all. Comparing full participants to survey-only participants, we found that participants reporting a prior positive PCR were less likely to agree to antibody testing, though the difference was not large (61.5% vs. 72.7%). It is unknown whether a prior confirmed or suspected infection affected eligible household members’ decision to agree or decline to fill out a survey.

We also have not adjusted for potential biases related to non-participation rates of entire households that were selected and approached for inclusion in the study. Our data collection included a complicated sampling design across several different strata, and weights were introduced partly to account for different rates of nonresponse across the different strata. For simplicity we ignored these details and sampling weights for the analysis presented here. Thus, households with higher COVID-19 risk may be overrepresented or underrepresented in our data relative to their frequency in the broader population of households with Utah.

Although these potential limitations, which also exist for other analyses of household transmission from serological data [13-15], remain in our analysis, we believe our model has addressed other limitations of existing models that may be more substantial. Our improvements to household secondary attack rate estimates, including factoring out non-household community acquisitions and tertiary transmissions, inclusion of overdispersion estimates, and careful consideration of the impact of imperfect test sensitivity and specificity, have produced improved insights into this important measure.

In conclusion, we found evidence of a relatively high secondary attack rate and high overdispersion in transmission of SARS-COV-2 in Utah households during a time when overall community prevalence was low. Other published household secondary attack rates may be underestimated without accounting for imperfect test sensitivity and specificity. Controllability of the virus may depend on mitigating transmission from a minority of highly infectious individuals in large households and other household-like locations where several people congregate indoors for extended periods.

## Data Availability

Data necessary to reproduce the results in the manuscript are publicly available.

## Supplemental Methods

### Alternate model with variability in household importation

For the alternate model, the formula for *M*_{kn} for a given household size *n* ≥ 2 becomes
For households of size *n* = 1, *M*_{01}(*p*_{c}, *d*_{c}) = *F*_{01}(*p*_{c}, *d*_{c}) and *M*_{11}(*p*_{c}, *d*_{c},) = 1 − *F*_{01}(*p*_{c}, *d*_{c}). The function *F*_{yz} (*p, d*) is defined in the main text (probability mass function of the beta-binomial distribution with boundary case definitions at *d* = 0 and *d* → ∞), where in this case *y* is the number of community acquisitions and *z* is the total number of household members. The main-text model is a special case of this alternate model, with *d*_{c} → ∞.

The likelihood equation is the same as in the main text, but with the additional element *d*_{c} in the vector θ of variables to be optimized:
and *M*_{kn}(*p*_{c}, *p*_{h}, *d*_{h}) in the likelihood equation is replaced with *M*_{kn} (*p*_{c}, *d*_{c}, *p*_{h}, *d*_{h}) as defined above.

We found the MLE and single-parameter confidence intervals using the same procedure described in the main text, and further assessed the uncertainty the estimate of *d*_{h} by solving for the MLE of the other 7 variables when fixing it at its boundary values 0 and ∞. We also compared the likelihood at the MLE of the alternate model to that of the main text model using the likelihood ratio test, to determine whether the main text model result could be rejected in favor of the alternate model by this criterion. As an additional comparison, we used the Bayesian information criterion to score the alternate model against the main text model, using *n* = 9091 as the number of data points (number of households) and *K* = 7 parameters for the main-text model and *K* = 8 parameters for the alternate model.

### Beta-binomial distribution at limits of dispersion parameter: d → ∞ and d → 0

Our likelihood equations make use of the beta-binomial probability distribution, parameterized with an average probability *p* and a dispersion parameter *d*. The probability mass function *F* for positive, finite values of *d* is
We use *F*_{yz} (*p, d*) to quantify the distribution of household transmissions directly from a single infected household member, where *y* is the number of transmissions, *z* is the number of susceptible household members, *p* = *p*_{h}, and *d* = *d*_{h}. In our alternate model we also use *F*_{yz} (*p, d*) to quantify the distribution of community acquisitions among members of a household from non-household members, where *y* is the number of community acquisitions, *z* is the total number of household members, *p* = *p*_{c}, and *d* = *d*_{c}.

Here, we derive the formula for *F*_{yz} (*p, d*) at the boundaries of the range of possible values for *d*: *d* → ∞ and *d* → 0. To do this, we rewrite *F*_{yz} (*p, d*) in an alternate form. First, using the property B(*x, y*) = Γ(*x*)Γ(*y*)⁄Γ(*x* + *y*):
Then using the property, for positive integer *n*, Γ(*z* + *n*) = *z*(*z* + 1) ⋯ (*z* + *n* − 1)Γ(*z*):
In the following we rewrite the numerators and denominators in powers of *d*. Only the lowest and highest powers of *d* will matter for taking the limits that follow, so the other terms within “⋯” are not shown.
When taking the limit *d* → ∞, we note that each fraction has highest order *d*^{z} in both numerator and denominator, so the limit will be the ratio of the coefficients of that term:
When taking the limit *d* → 0, we note that the fraction for cases *y* = 0 and *y* = *z* have lowest order term *d* in both the numerator and denominator, so the limit will be the ratio of the coefficients on those terms. The fraction for the 0 < *y* < *z* case has only powers of *d*^{2} and higher in the numerator, and a nonzero *d* term in the denominator, so the limit is 0:

### Direct transmission probabilities H_{xyz}

Here we derive the formulae for *H*_{xyz}: the probability of *y* transmissions to *z* susceptible household members directly from *x* infected members

As described in the main text, we first define the probabilities for transmissions directly from *x* = 1 infected household member:
Next consider *z* = 2 infected household members. The probability that *y* = 0 transmissions occur is the probability that both infected members transmit to 0 others: *H*_{20z} = *H*_{10z}*H*_{10z}. To calculate the probability that *y* > 0 transmissions occur from the two infected members, it is convenient to consider the two infected individuals having transmission opportunities in sequence, say A followed by B. If A transmits to any household members, this reduces the number of susceptible members remaining for B to infect. For example, the probability that *y* = 1 is the probability that A transmits to 0 of *z* and B transmits to 1 of *z*, plus the probability that A transmits to 1 of *z* and B transmits to 0 of *z* − 1 remaining susceptible members: *H*_{21z} = *H*_{10z}*H*_{11z} + *H*_{11z}*H*_{1,0,z−1}. The calculation follows a similar pattern for *y* = 2: *H*_{22z} = *H*_{10z}*H*_{12z} + *H*_{11z}*H*_{1,1,z−1} + *H*_{12z}*H*_{1,0,z−2}. It follows that:
Now for *x* = 3 infected household members, we can use the fact that we have already calculated *H*_{2yz}, which covers the transmission probabilities from two of the three infected members, and then we include the probability that third member transmits to any remaining susceptible members that the first two did not infect:
Following this pattern, we continue calculating *H*_{xyz} for each *x* in increasing sequence:

### Total transmission probabilities T_{xyz}

Here we derive the formulae for *T*_{xyz}: the probability of *y* total transmissions to *z* initially susceptible members from *x* initially infected members. In other words, *T*_{xyz} is the probability that the final household outbreak size is *x* + *y*, given that *x* household members were initially infected and *z* household members were initially susceptible.

First, we note that *T*_{x0z} = *H*_{x0z} for all possible (*x, z*) pairs, because if the initial *x* infected members do not transmit to anyone (*y* = 0), the household outbreak is over and the final size has been reached. Next we consider the probability of *y* = 1 total transmissions, which occurs when the initial *x* infected members transmit directly to 1 other (with probability *H*_{x1z}), who then does not subsequently transmit to any of the remaining *z* − 1 susceptible members (with probability *H*_{1,0,z−1} = *T*_{1,0,z−1}). Hence,
For *y* = 2 total transmissions, we must include the probability that the initial *x* infected members transmit directly to 2 others who then transmit to none and the probability that the initial *x* infected members transmit directly to 1 other who then produces an outbreak among the remaining susceptible members with 1 total transmission:
Following similar logic for *y* = 3, and making use of the *T*_{x0z}, *T*_{x1z}, and *T*_{x2z} values already calculated, we arrive at:
The general formula calculated for increasing values of *y* is:

### Within-household reproduction numbers and threshold condition

We define the within-household reproduction number *R*_{h} as the expected number of household transmissions directly from an infected person who acquired infection in the community and has no non-susceptible housemates. Let *h*_{i} be the fraction of households with size *i*, up to a maximum size *N*. Then the mean *μ* and variance σ^{2} of the household size distribution are
Let *c*_{i} be the probability that a randomly chosen person has *i* housemates. The probability that a randomly chosen person lives in a house of *total* size *i* (including themselves) is *ih*_{i}⁄*μ*, so
Then *R*_{h} is *p*_{h} times the mean number of housemates of a randomly chosen person:
We also define an alternate within-household reproduction number as the expected total number of transmissions in the household of an infected person who acquired infection in the community and has no initially non-susceptible housemates. Given that a person acquiring infection in the community has *i* susceptible housemates, the probability that *j* of their housemates become infected before the household outbreak terminates is *T*_{1ji}(*p*_{h}, *d*_{h}), as defined in 2.2 of the main text. The expected total number of transmissions in their household will be then be. So, the household reproduction number formula is:
For the high-variability boundary case at *d*_{h} = 0, we have that
because *T*_{1ji}(*p*_{h}, 0) = *p*_{h} when *j* = *i* and 0 for other nonzero values of *j* (reflecting all-or-nothing transmission). It follows that when *d*_{h} = 0. This makes intuitive sense because in the all-or-nothing scenario, when the index person transmits, all in the household are infected directly, and there is no one left to infect in subsequent generations, so the final household outbreak size is entirely reflected in *R*_{h}.

We next investigate the implications of our estimate for population-wide transmission control. The threshold condition delineating subcritical and supercritical transmission in the population occurs when the maximal eigenvalue of the matrix
exceeds one [30]. Here, *R*_{c} is defined as the average number of community transmissions per infected individual (i.e., average number of transmissions to people not in the infected individual’s household). The occurrence of *R*_{c} in both rows of the matrix reflects an assumption that its value applies to the transmissibility of people who acquire their own infection in the community and in their household. The zero element in the lower-right corner of the matrix reflects the fact that the people on average who acquire infection in their household do not transmit further in their household, by definition, because was derived from the final household outbreak size equations encompassed in *T*_{xyz}.

The maximal eigenvalue exceeding one produces the following threshold condition: This is equivalent to

### Relationship between beta distributed probability and dynamic transmission parameters

If an infected person’s duration of infectiousness is *τ* and the transmission rate to a contact is *β*, the probability that transmission to the contact occurs is *p* = 1 − *e*^{−βτ}. We assume *τ* is fixed and *β* is a gamma distributed random variable with shape *k* and rate *r*. Then, the first and second moments of the random variable *p* are:
The variance is then
We then equate the mean and variance to those of the beta distribution with mean *p*_{h} and dispersion *d*_{h}, which we used in our MLE model in the main text.
Combining those two equations yields
We solved the first equation for *k*, which is independent of the assumption for *τ*, using our MLE estimates of *p*_{h} and *d*_{h}. We applied each of the (*p*_{h}, *d*_{h}) pairs from our parametric bootstrap analysis to this equation to derive the confidence interval for *k*.

## Supplemental Results

The alternate model produced an estimate for the new dispersion parameter *d*_{c} = 2.4 (0.95 – 9.6) and altered estimates for the other 7 parameters compared to their values for the main text model (Table S1). The log likelihood at this MLE was about 1 greater than the log likelihood produced by the main text result (Table S2), suggesting that the main text model (equivalent to the alternate model with *d*_{c} = ∞) cannot be rejected with high confidence in favor of the alternate model by the likelihood ratio test (P = 0.16). The Bayesian information criterion (BIC) for the alternate model is 2358.8 compared to 2351.6 for the main text (*d*_{c} = ∞) model, which favors the main text model with a BIC difference of 7.1.

The conclusion of high household transmission variability from the main-text model is consistent under this alternate model, with the MLE occurring at low value of the dispersion parameter *d*_{h} = 0.30, and the low-variability binomial model *d*_{h} = ∞ can be rejected with P = 0.02. However, uncertainty ranges become wider at higher levels of overdispersion in household risk of community acquisition. This is illustrated by the fact that a model assuming no household transmission (*p*_{h} = 0), i.e. all household cases explained by acquisitions outside the households with high overdispersion (*d*_{c} = 0.5), cannot be rejected with very high confidence (P = 0.068). Thus, the alternate explanation for the distribution of household cases may not be definitively ruled out by our data.

Figures S1-S7. Two-dimension confidence regions for each pair of estimated parameters.

Solid curves are the 2-dimensional confidence regions derived from the likelihood ratio test, comparing the likelihood ratio statistic to the 95^{th} percentile of the chi-squared distribution with 2 degrees of freedom. Large solid circle is the MLE estimate and dashed lines are the confidence intervals for each individual parameter derived from the likelihood ratio test (Table 1 main text). Small dots are the MLE estimates from each of 500 simulated data sets generated using parameter values set at the MLE from the actual data (parametric bootstrap).