## Abstract

**Introduction and Goals** SARS-CoV-2 is transmitted both in the community and within households. Social distancing and lockdowns reduce community transmission but do not directly address household transmission. We provide quantitative measures of household transmission based on empirical data, and estimate the contribution of households to overall spread. We highlight policy implications from our analysis of household transmission, and more generally, of changes in contact patterns under social distancing.

**Methods** We investigate the household secondary attack rate (SAR) for SARS-CoV-2, as well as *R*_{h}, which is the average number of within-household infections caused by a single index case. We identify previous works that estimated the SAR. We correct these estimates based on the false-negative rate of PCR testing and the failure to test asymptomatics. Results are pooled by a hierarchical Bayesian random-effects model to provide a meta-analysis estimate of the SAR. We estimate *R*_{h} using results from population testing in Vo’, Italy and contact tracing data that we curate from Singapore. The code and data behind our analysis are publicly available^{1}.

**Results** We identified nine studies of the household secondary attack rate. Our modeling suggests the SAR is heterogeneous across studies. The pooled central estimate of the SAR is 30% but with a posterior 95% credible interval of (0%, 67%) reflecting this heterogeneity. This corresponds to a posterior mean for the SAR of 30% (18%, 43%) and a standard deviation of 15% (9%, 27%). If results are not corrected for false negatives and asymptomatics, the pooled central estimate for the SAR is 20% (0%, 43%). From the same nine studies, we estimate *R*_{h} to be 0.47 (0.13, 0.77). Using contact tracing data from Singapore, we infer an *R*_{h} value of 0.32 (0.22, 0.42). Population testing data from Vo’ yields an *R*_{h} estimate of 0.37 (0.34, 0.40) after correcting for false negatives and asymptomatics.

**Interpretation** Our estimates of *R*_{h} suggest that household transmission was a small fraction (5%-35%) of *R* before social distancing but a large fraction after (30%-55%). This suggests that household transmission may be an effective target for interventions. A remaining uncertainty is whether household infections actually contribute to further community transmission or are contained within households. This can be estimated given high-quality contact tracing data.

More broadly, our study points to emerging contact patterns (i.e., increased time at home relative to the community) playing a role in transmission of SARS-CoV-2. We briefly highlight another instance of this phenomenon (differences in contact between essential workers and the rest of the population), provide coarse estimates of its effect on transmission, and discuss how future data could enable a more reliable estimate.

## 1 Introduction

Lockdowns and stay-at-home orders have been deployed across the world to curb the growth of SARS-CoV-2, the virus that causes COVID-19. These policies reduce transmission by reducing overall mobility.

Since the introduction of lockdowns, community mobility has decreased by 28% −70% (Google, 2020), and the effective reproduction number *R* has decreased by an average of 48% (IQR 38-65%) across 16 US states (shown in Figure 4). At the same time, these policies have increased time spent at home and therefore likely increased household transmission of SARS-CoV-2: Google (2020) reports 15% −29% increases in time spent at home in these states. In the UK, Shanghai, and Wuhan, the majority of social contacts occurred in households after lockdown. The estimated proportion of contacts is 58% for the UK (up from 34% pre-lockdown from Jarvis et al. (2020)) and 79% and 94% for Shanghai and Wuhan respectively (Zhang et al., 2020). Household transmission played an important role in previous epidemics such as the 2009 H1N1 pandemic (Endo et al., 2019; Tsang et al., 2016; Lau et al., 2012). It may now also be an important factor in SARS-CoV-2 transmission, especially since community transmission has decreased and household transmission has increased under stay-at-home policies.

Our aim is to estimate household transmission and determine its overall contribution to growth of SARS- CoV-2. We do so both via a meta-analysis of previous studies, a direct analysis of data collected in Vo’, Italy (Lavezzo et al., 2020), and analysis of contact tracing data that we curated from Singapore (Singapore COVID-19 Dashboard, 2020). We correct these data for false negatives and in some cases for selective testing of only symptomatic cases. Code for our analysis, together with the Singapore dataset, are available at https://github.com/andrewilyas/covid-household-transmission.

We estimate that household transmission contributes between 0.22 and 0.42 infections to the overall reproduction number. In comparison, overall reproduction numbers during stay-at-home policies ranged from 0.65 to 1.25. We conclude that household transmission is now a significant contributor to overall SARS-CoV-2 reproduction (see Figure 5 for region-specific estimates).

We next examine two folk theories that have sometimes been used to argue against attending to household transmission. The two theories are that preventing transmission within a household is either (a) futile, since transmission is inevitable or (b) low-impact, since secondary infections are “contained” in the household (Horowitz, 2020):

*Household inevitability*: No interventions to reduce household transmission are likely to work.*Household containment*: People infected at home will stay at home and not cause community spread.

Our meta-analysis of the household secondary attack rate yielded a central estimate of 30% and estimates for individual studies ranging from 11% − 55%. This implies that household infection is not a certainty and can be mediated by differences in setting, which casts doubt on inevitability. In Section 4.2 we discuss policies for reducing household infection. More generally, our estimates imply that infection is not certain even among individuals who are in regular contact.

We formalize the household containment theory in terms of ratios of secondary transmission, and show that these can be estimated from well-annotated contact tracing data. Unfortunately, existing data are too noisy to reliably estimate containment. In Section 4.2 we outline how these rates can be approximated from aggregate data, and suggest future work for estimating them directly.

Our results underscore that mobility is an imperfect measure of infection risk, since decreases in mobility likely increase household infections. We suggest instead basing policy on contact patterns, which mediate transmission both for COVID-19 (Liu et al., 2020; Jarvis et al., 2020) and previous epidemics (Edmunds et al., 1997; Wallinga et al., 1999; 2006; Mossong et al., 2008; Hens et al., 2009; Eames et al., 2012). To this end, we explore contact heterogeneity caused by stay-at-home orders, due to the divide between essential workers and the rest of the population. An initial simulation study (Section 4.3) forecasts that decreasing transmission risk between essential workers is 8−35 times as effective as reducing other types of transmission.

## 2 Methods

The dynamics of disease spread are described via the effective reproduction number *R*, which measures the average number of new infections caused by each infected person *i*. To quantify household transmission, we decompose *R* into two components *R* = *R*_{c} + *R*_{h}. The *community* reproduction number *R*_{c} is the average number of new infections caused by an infected individual *i* outside *i*’s household. The *intra-household* reproduction number *R*_{h} is the average number of new infections caused by an infected individual *i* inside *i*’s household. The ratio measures the fraction of transmission occurring within households.

In addition to *R*_{h}, we consider two measures of intra-household spread that quantify individual risk of infection. The *household secondary attack rate* (SAR) is the probability an infected person *i* infects a specific household member *j*. Additionally, we define the *household conditional risk of infection* (CRI) to be the probability that *j* is infected, conditioned on household member *i* being infected. In contrast to the SAR, the CRI allows for *j* having been infected by someone other than *i*, and so CRI can be estimated without attributing infections to particular primary cases.

### 2.1 Challenges in estimating *R*_{h}, SAR and CRI from empirical data

Estimating the growth and prevalence of household cases requires detailed household-level data. The relevant information to estimate *R*_{h}, SAR, and CRI includes:

Testing results for all members of a household, including both positive and negative cases.

Infection attribution within a household, which involves identifying the primary cases in a household and constructing the household transmission chains.

Clinical outcome at the end of the study. If a case is still infectious at the end of a study the data might undercount the number of infections attributed to the case.

Assuming accurate household level data we can estimate *R*_{h}, SAR, and CRI as follows:

*R*_{h}can be estimated as the ratio of secondary household cases to total household cases. It requires accurate attribution of infections within households, but does not require negative case counts.SAR can be estimated as the ratio of secondary household cases to total susceptible household members. It requires accurate attribution as well as negative test counts in households.

CRI can be estimated as the ratio of total number of infected pairs in households to total number of household pairs. It requires positive and negative case count at household level but not direct attributions.

In practice, estimating these quantities is made difficult by asymptomatic infections and false negatives. Existing evidence suggests that asymptomatic cases constitute 18-43% of all infections and may have different households transmission dynamics than symptomatic cases (Lavezzo et al., 2020; Gudbjartsson et al.; Ferretti et al., 2020; Streeck et al., 2020). However most studies predominantly test symptomatic individuals, which leads to biases in the estimates. We can correct the estimates to account for asymptomatic cases. We calibrate our corrections on studies that have tested a representative sample of the population, as in studies of Vo’, Italy and Gangelt, Germany (Lavezzo et al., 2020; Streeck et al., 2020).

Regarding false negatives, recent studies have shown high false-negative rates (FNR) for RT-PCR tests, and this rate varies over the time course of the disease. Kucirka et al. (2020) find an FNR of 20% for patients tested on Day 8 of infection and 67% on Day 4. The average FNR in the 10 days following onset of symptoms is estimated as 17% by Wikramaratna et al. (2020) and 30% by Kucirka et al. (2020). The FNR also varies between different kinds of swab (Wikramaratna et al., 2020) and between different labs (Kucirka et al., 2020). The problem of false negatives can be mitigated by correcting estimates using the FNR (Wikramaratna et al., 2020). However, in recent studies of household transmission (see Section 2.2.1), estimates of SAR have not been corrected for the FNR.

### 2.2 Procedure for estimating *R*_{h}, SAR, and CRI

In order to estimate *R*_{h}, SAR, and CRI and account for asymptomatics and false negatives we propose two approaches. First we adjust estimates from previous studies to account for false negatives and asymptomatics (see Section 2.2.1). Second, we estimate CRI on the blanket testing dataset from Vo’, Italy (Lavezzo et al., 2020) (see Section 2.2.3). Additionally, we have assembled our own dataset based on contact tracing data from Singapore, which we use to estimate *R*_{h} (see Section 2.2.2).

#### 2.2.1. Adjusting SAR literature estimates

For the SAR, we searched PubMed and Google Scholar to find any previous work that estimates the household SAR from empirical data. We found a total of nine papers. Whenever appropriate, we recalculated the estimated SAR for each study to correct for the false-negative rate (FNR) of RT-PCR testing and the proportion of cases that are asymptomatic (denoted “AR”). We used a hierarchical Bayesian random effects model both for correcting estimates from individual studies and for pooling results to compute a meta-analysis estimate of the SAR. In the model, the SAR for study *i* (denoted `SAR`_{i}) is drawn from a Beta distribution and each study has a false-negative rate `FNR`_{i} drawn from a prior based on estimates in the literature (Section 2.1). The proportion of asymptomatics `AR` is shared across studies and is also drawn from a prior based on existing literature. The likelihood of a household member testing positive is *p*_{i} = `SAR`_{i} ∗ (1 − `FNR`_{i}) ∗ (1 − `AR`) for studies where only symptomatic contacts were tested. To estimate the household reproduction number *R*_{h} for each study *i*, we adjust the total number of secondary cases using FNR_{i} and AR and divide by the number of primary cases. Results are shown in Figure 1 and a full description of the model is found in Appendix B.

#### 2.2.2. Singapore contact tracing data for estimating *R*_{h}

Singapore’s Ministry of Health has collected information about COVID-19 cases tracked via contact tracing (Singapore MOH, 2020). UpCode Academy published this data in an interactive dashboard (Singapore COVID-19 Dashboard, 2020). We extracted associated metadata for each positive case along with a directed graph providing information about the infection source. This resulted in a case and transmission network for 6588 patients. Confirmation dates for cases ranged from January, 23rd to April 19th.

We used the data to construct a transmission graph, where nodes correspond to cases and edges to infections. As terminology, we distinguish *source* cases (which are the cause of infections) from *target* cases (which are infected by sources). We define a *cut-off* date, such that all infections with confirmation date prior to the cut-off are labeled as sources. All the nodes that have an incident edge from a source are labeled as targets. A node in the infection graph can be both source and target. When relationships are known, edge annotations reflect the relationship between source and target cases (e.g. family member, colleague, contact). *R*_{h} can be thus estimated as the ratio of total number of *household* infections to total number of source cases. Since there is no annotation for “household”, we use the annotation “family member” as a proxy.

A schematic infection graph is shown in Figure 2. In the case of real data some of the edges are not direct source-target attributions, but are mediated by ‘Cluster’ nodes. Despite not always having direct infection attributions we can use temporal information to determine the number of sources and targets.

#### 2.2.3. Data from Vo’ for estimating *R*_{h} and CRI

Most of the population in the town of Vo’, Italy was tested both at the start of a lockdown and 14 days later (Lavezzo et al., 2020). The two phases of testing covered 86% (2812 subjects) and 72% (2343 subjects) of the population in Vo’, respectively.

We analyzed data collected in order to estimate *R*_{h} and CRI. Following the same terminology as in the previous section, the date of the second round of testing acts as a cut-off date. Since the data does not trace infections to sources, we assume that all secondary household infections were caused by the primary case.

We estimate the FNR by assuming that symptomatic cases in households with other confirmed infections are positive cases.

## 3 Results

### 3.1 SAR and *R*_{h} estimates from previous studies

Estimates of the SAR for each study are shown in Figure 1 and Table 4 and precise counts and estimates in Tables 3 and 4. Our Bayesian model allows for the possibility of heterogeneity in the SAR across studies and we find substantial heterogeneity. Figure 1 shows that some 95% credible intervals for the SAR do not overlap. The pooled central estimate of the SAR is 30% with 95% credible interval (0%, 67%), which captures the model’s posterior prediction for the SAR for a new study. If we do not correct estimates for false-negatives and asymptomatics, the central estimate is 20% (0%, 43%). The heterogeneity of SAR across studies is partly due to variation in the false-negative rate, which we model but do not observe directly. It may also be due to variation in the cohort of primary cases and their behavior in the household. As a point of comparison, the household SAR has been estimated to be 15% for H1N1 influenza (Carcione et al., 2011) and 8% for SARS (Lau et al., 2004).

One concern about previous studies is that asymptomatics are under-represented among the primary cases. This could also contribute to the observed heterogeneity. Table 4 shows estimates of the CRI based on random population testing. These estimates (0.31 for Gangelt, Germany and 0.50 for Vo’, Italy) are consistent with our central estimate of SAR, which predicts CRI ≈ 0.41. This provides some evidence that our SAR estimates are robust to under-representation of asymptomatics.

### 3.2 Estimates from Singapore contact tracing data

In Singapore, the quality of annotations appears to have degraded over time. Therefore, we limited our analysis to consider source cases that were confirmed prior to March 27th 2020, along with their corresponding target cases (even if the target cases are confirmed later). We believe this cut-off date was early enough to ensure data quality. This subset of the infection graph contains a total of 1076 nodes and 599 edges. Out of the total number of cases, 710 are connected to other known cases forming 111 clusters. Among these, 417 out of 710 are source cases with confirmation date prior to the cut-off. The remaining 366 out of 1076 are not traced to any other case.

The effective reproductive number *R* is the average number of new infections generated by each case. We can estimate *R* from our transmission graph by computing the ratio of total number infections to total number of source cases. The data contains 599 total infections and 417 source cases, which results in an estimated *R* of 1.44 (0.77-2.31). Figure 3 contains estimates for cut-off dates other than March 27th, central estimates for which range between 0.90 and 4.93. Regarding household transmission, there are 108 household infections. Assuming FNR of 20% the corrected estimate for *R*_{h} is 0.32 (0.22-0.42). The corrected central estimate ranges from 0.19 to 0.34 depending on the cut-off date (see Figure 3; also Figure 8 and Table 5 in Appendix A).

### 3.3 Estimates from blanket testing data

In Vo’, Italy, the first round of testing recorded 73 positive cases among 53 households with a total of 137 members. The second round of testing identified 7 more positive cases, 3 of which were in households with prior reported cases. The CRI can be estimated as the ratio between the total number of ordered pairs among infected individuals in the same household (60) and the total number of ordered household pairs where the first individual in the pair is infected (136). This yields a point estimate for.

Assuming that secondary infections are caused by the primary case in a household, there 73 sources, 23 (73+3-53) targets (household infections), and 84 (137-53) susceptible individuals (household contacts). This yields a point estimate for *R*_{h} of . Due to the nature of blanket testing we cannot determine the number of index cases, which makes estimates of SAR unreliable.

#### Correcting for false negatives and asymptomatics

Out of subjects residing in households with at least one conformed case, 7 have experienced symptoms but have tested negative. We can adjust for false negatives by considering the distribution of case counts across households. Assuming an asymptomatic rate of 0.41 Lavezzo et al. (2020) (among confirmed cases), and a false negative rate (FNR) of 22%, the expected number of false negatives in infected households is 6.77, which corresponds to the observed value.

For FNR value of 22%, and average household size of 2.1.^{2}, the corrected estimates are CRI= 0.54 and *R*_{h} = 0.37 ((0.34 − 0.40)). We estimate confidence intervals for *R*_{h} by making normal approximations to the conditional distribution of the corrected *R*_{h} values for each individual. In the case of CRI we cannot accurately quantify uncertainty due to heterogeneity in household sizes and difficulties in estimating the number of index cases in the community. (see Table 2 for other FNR values).

#### Confounds and heterogeneity

We might be concerned that testing decreased the household transmission due to earlier household isolation, as in (Bi et al., 2020). However, if we restrict to early cases (where the first household member has symptoms prior to the start of testing), the adjusted CRI is essentially unchanged at 48%.

Next we might ask about heterogeneity in CRI: perhaps spouses have a very high CRI rate. But among presumed partners (adult co-habitants in the same ten-year age range), CRI was still only 52%. Possibly more important is age: CRI is only 24% among individuals under 50 when adjusting for 15% FNR. There is evidence that younger people have a higher FNR, but even an FNR of 40% would only lead to an adjusted CRI of 33%.

Each of these subgroup analyses is on only a small number of total cases and so should be interpreted with caution. In addition, the Vo’ data may overestimate both *R*_{h} and the CRI due to an older population,

## 4 Discussion

### 4.1 The contribution of household transmission to reproduction number

In the previous section, we estimated the contribution of household transmission to the spread of COVID-19 through SAR, CRI, and the household reproduction number *R*_{h} from a variety of sources. Estimates for *R*_{h} based on contact tracing data and blanket testing data ranged from 0.2-0.37. Here we examine the extent to which this household transmission explains overall transmission levels. More formally, we study the ratio of the effective household reproductive number *R*_{h} to the effective reproductive number *R*.

We estimated *R* across geographic regions where enough data was available^{3} for both pre- and post-lockdown time periods (Figure 4). Specifically, we used an overdispersed (*α* = 0.1) Poisson model (Yadlowsky et al., 2020) fit to death data (Dong et al., 2020) to estimate the growth rate, which was translated into *R* using the generation time distribution *ω*(*t*) estimated by Ferretti et al. (2020).

Based on our results in Table 4, we assume that *R*_{h} = 0.3 for all regions pre-lockdown. After lockdown, location tracking data from Google shows that people spent more time in their households (an 11% increase in the United States (Google, 2020)). Hence we estimate post-lockdown *R*_{h} as 0.3 *M*, where *M* is the increase in average time spent inside residential areas (Google, 2020).

Figure 5 (left) shows a scatter plot of estimated community vs household reproduction numbers for each region we study. Figure 5 (right) shows a histogram of the estimated contribution of household transmission to total reproduction number both pre- and post-lockdown. The share of *R* attributed to household transmission increased from 0-25% pre-lockdown to 25-50% post-lockdown. This shows there could be meaningful benefits from interventions that reduce *R*_{h}.

### 4.2 Implications for household inevitability and containment

We previously identified two common arguments against intervening against household transmission: house-hold inevitability and household containment. We revisit these theories in light of our estimates.

#### Inevitability

Household inevitability posits that once a household member is infected, the other members will also become infected, thus making interventions futile. Our central estimate of the household SAR is 30% with less than 5% posterior probability assigned to the SAR being above 67%. This shows that household transmission is not inevitable. There is also evidence that the SAR can be reduced by behavioral interventions. Wang et al. (2020a) found that the SAR was lower in households where people wore masks at home, cleaned regularly with disinfectant, and avoided close contact with the primary cases. Li et al. (2020) found the SAR was 0% for households where the primary case was isolated on symptom onset. These results comport with what is known about the incubation and generation time (Jing et al., 2020; Ferretti et al., 2020) and the effectiveness of PPE (Feng et al., 2020). However, these results are both observational studies and we await more rigorous experiments.

#### Containment

Household containment posits that people infected in the household (as opposed to the community) are unlikely to cause further community infection. If true, this would imply that interventions to reduce household transmission are less valuable for reducing overall spread.

We can quantify this effect, if it exists, by considering secondary transmission rates from household and community infections. To simplify matters we refer to those infected via household transmission as “household cases” and those infected via community transmission as “community cases.” We then let *R*_{c|h} be the average number of community infections caused by a household case, and *R*_{c|c} the average number of community infections caused by a community case. Household containment asserts that *R*_{c|h}*/R*_{c|c} ≪ 1.

To our knowledge, there is no public dataset that allows us to test household containment empirically. Accurate contact tracing data would enable us to measure *R*_{c|h} and *R*_{c|c} directly by counting cases, but existing data does not reliably distinguish between primary and secondary infections within a household. Instead, we provide an *a priori* analysis. There are two relevant differences between people infected at home and those infected in the community:

Household cases may be demographically different (e.g., they could predominantly be children) or have different levels of exposure to the community (e.g., they could predominantly be non-essential workers).

Household cases may self-isolate earlier than community cases, based on observing that the primary case in the household is symptomatic.

These demographic and contact differences could decrease *R*_{c|h}*/R*_{c|c}: for example, if all household cases are non-essential workers and all community cases are essential workers, the differences in contact would decrease *R*_{c|h}*/R*_{c|c} by a factor of 3.5 (based on self-reported contact data from (Rothwell, 2020)). We also expect early isolation protocols to have an effect. The effect size should depend on the proportion of households that decide to use early isolation, and the probability of secondary transmission under the protocol. The former is dependent on policy and population behaviour, whereas the latter can be estimated from disease characteristics (generation time distribution, incubation time distribution, and asymptomatic rate; c.f. (Ferretti et al., 2020)).

More empirical data on these quantities, or more fine-grained contact tracing and testing data, is needed for better estimates and will facilitate a proper cost-benefit analysis around reducing household transmission.

### 4.3 Other types of contact heterogeneity

Our results highlight the fact that net community mobility gives an incomplete picture of overall transmission, and thus a contact-based view of transmission dynamics (Hens et al., 2009; Eames et al., 2012; Mossong et al., 2008) may be helpful. Contact-based models may be especially relevant in the presence of interventions that drastically shift contact patterns, such as the lockdown orders prompted by COVID-19 (Zhang et al., 2020). For example, lockdowns have amplified differences in contact between essential workers and the rest of the population: a recent Gallup panel (Rothwell, 2020) found that essential workers report 3.5 times as many contacts per day as other individuals. In this section, we briefly show how even outside of households, studying contact may help shed light on the effectiveness of policy interventions.

While the most accurate way to study contact in an epidemic is through a network or graph model (Meyers et al., 2005; Lloyd-Smith et al., 2005; Kiss et al., 2006), prior work has shown that even simple stratified compartmental models can be effective (Wallinga et al., 2006; Jarvis et al., 2020; Xia et al., 2013; Liu et al., 2020; Zhang et al., 2020) at predicting epidemic growth. These models partition the population into groups and are specified by the average level of contact between groups, rather than between individuals; a common choice is to assign groups to age ranges: e.g., the POLYMOD study (Mossong et al., 2008), or in the context of COVID-19, compartmental age models of the UK (Jarvis et al., 2020) and seven Chinese cities (Liu et al., 2020; Zhang et al., 2020).

Here, we use a compartmental model with only two groups: people staying at home (‘low contact’ or LC) and essential workers (‘high contact’ or HC). Transmission under our HC/LC model is determined by a two-by-two contact matrix *M*, where *M*_{ij} is the number of contacts made by a member of group *j* ∈ {*LC, HC*} with members of group *I* ∈ {*LC, HC*}. Since we are not aware of high-quality data measuring these quantities directly during lockdowns, we instead obtain coarse estimates from available survey data. Using self-reported contact data (Rothwell, 2020; Jarvis et al., 2020), we obtain point estimates for the contact matrix in the United States (see Figure 6; methodology, sensitivity analysis, and an analagous estimate based on data from the UK are shown in the Appendix).

A common assumption for respiratory diseases such as SARS-CoV-2 is that exposure levels are proportional to contact levels (Edmunds et al., 1997; Wallinga et al., 1999; 2006). An additional implicit assumption of compartmental models is that infection rates between two groups are independent of the groups themselves. A more accurate model would account for disparities in infection rates (e.g., Zhang et al. (2020) find that social contacts tend to have a higher infection rate than workplace contacts)—this is also true of age-compartmental models, where the relative infectiousness of children and adults is not yet known (Jarvis et al., 2020). Under this assumption, the dominant eigenvalue of the contact matrix is a constant multiple of the reproductive number (more precisely, the contact matrix is a constant multiple of the “next-generation matrix” (Diekmann et al., 1990; den Driessche and Watmough, 1992), whose dominant eigenvalue is *R*). We can therefore estimate how changes in contact affect disease transmission, by looking at how each entry of the contact matrix affects the dominant eigenvalue (Caswell, 2006; Klepac et al., 2020).

Since the two groups in our model are easily identifiable, policymakers can target interventions to specific modes of interaction (e.g. by enforcing more stringent physical distancing in the workplace to reduce HC-HC contact, or providing PPE to workers with public-facing occupations to reduce HC-LC contact). To forecast the effect of such interventions, we consider the decrease in reproduction number *R* caused by a 10% decrease in each type of contact (Figure 7). The results predict that reducing contact between high-contact individuals will have disproportionate effect on reducing overall transmission. Specifically, our point estimate predicts a 10% reduction in contact between high-contact individuals (HC-HC contact) being 35x more effective than a 10% reduction in LC-LC contact, and 8x more effective than a 10% reduction in HC-LC contact.

These predictions are based on rather crude parameter estimates, and could be greatly improved by more direct measurements of contact structures, and by incorporating heterogeneity from other sources (e.g. by also stratifying the model on age or contact location (Jarvis et al., 2020; Liu et al., 2020; Zhang et al., 2020)).

## Data Availability

Public data was used for the analysis. Contact tracing data, which was compiled by automatically aggregating data from an interactive dashboard, is available in a consolidated form in our public codebase.

## A Data Tables

### A.1 Case counts

### A.2 Intra-household reproduction number and SAR estimates

### A.3 Contact tracing estimates by cut-off date

### A.4 Reproduction number estimates by cut-off date

## B. Hierarchical Model for Literature Estimates

We use the following Bayesian Hierarchical Model:

### Data

*N*_{i} − number of household contacts considered in each study

*C*_{i} − number of confirmed cases

**1**_{AR}*i* − indicator; 0 - the study tested asymptomatics, 1 - otherwise

**1**_{FNR}*i* − indicator; 0 - the study corrected for false negatives, 1 - otherwise

### Priors

*FNR*_{i} ∼ *Uniform*(0.15, 0.35)

*AR* ∼ *Uniform*(0.18, 0.43)

*SAR*_{i} ∼ *Beta*(*α, β*)

*α, β* ∼ *Half Flat*()

### Likelihood

*p*_{i} := *SAR*_{i}(1 − *AR* · **1**_{AR}*i*)(1 − *FNR*_{i} · **1**_{FNR}*i*)

equivalently:
*C*_{i}|*SAR*_{i}, *FNR*_{i}, *AR* ∼ *Binomial*(*N*_{i}, *SAR*_{i}(1 − *AR* · **1**_{AR}*i*)(1 − *FNR*_{i} · **1**_{FNR}*i*))

We perform inference via MCMC sampling using PyMC3^{4} with the the built-in NUTS Hoffman and Gelman (2014) sampler. We use 4 chains with 6000 iterations each. The burn-in period is 2000.

Figure 10 below contains histograms corresponding to posterior samples of *SAR* and *R*_{h} as well as the posterior samples of the mean of *SAR* and *R*_{h} respectively.

## C. Contact Matrix Estimation

We thus focus on a simple model analagous to the age stratification model, but with just two groups: high-contact individuals (HC) and low-contact individuals (LC). To simplify matters, we will treat the HC cluster as consisting of essential workers only, with the rest of the population belonging to the LC cluster. If we make assumptions analagous to those made in the age stratification model^{5}, then once again the contact matrix *M* determines disease transmission via its dominant eigenvalue and eigenvector (Caswell, 2006). This observation leaves us with four relevant quantities to estimate: *M*_{HC,HC}, *M*_{HC,LC}, *M*_{LC,HC}, and *M*_{LC,LC}. Using self-reported contact data from the US (Rothwell, 2020) we provide estimates for each of these parameters below.

### Total contacts

The most readily available statistic from the available data is the number of total contacts reported by members of each cluster. In the US, Rothwell (2020) find that those who are working report 13.9 contacts per day, whereas those not working report 4.0 contacts per day.

### Population proportions

Allen et al. (2020) suggest that essential workers represent 40% of the US workforce, which as of 2018 represents 63% of the US population^{6}. Since 29% of the US workforce is reportedly equipped to work from home^{7}, we estimate that 18% of the population is doing essential work.

### Estimating *M*_{HC,HC}

This parameter corresponds to the number of essential workers that an essential worker will have close contact with, on average. This is equal to the number of contacts an essential worker has with co-workers, added to the average number of contacts an essential worker will have with other essential workers in the general population.

For coworker interactions, Rothwell (2020) report that 9.9 of the 13.9 interactions had by essential workers were at their places of work, or 71% of all contacts. This is likely to be an overestimate of coworker interaction, since it includes interactions with low-contact individuals at the essential worker’s workplace (e.g. clients at a bank or shoppers at a grocery store). To get a reasonable lower bound, we use the number of workplace contacts reported by essential workers in sectors that are not public-facing, namely construction, agriculture, and communications. Workplace contacts for these sectors range from 6.1-7.4, and so we take 6.8 as a central estimate, or 49% of all contacts. This is likely an underestimate, since it does not take into account high-contact jobs such as those in food processing or assembly lines.

To get *M*_{HC,HC}, we have to add the number of interactions with coworkers to the number of interactions with essential workers in the general population. To estimate the latter, we take the remaining number of total essential worker contacts (e.g. if we estimate 9.9 coworker contacts, 4.0 remain) and use population proportions to estimate how many of them are with other essential workers (e.g. 0.18 · 4.0 = 0.7).

In the end, we give a central estimate of 9.35 for *M*_{HC,HC} based on the range [8.1, 10.6].

### Estimating *M*_{HC,LC}, *M*_{LC,HC}, and *M*_{LC,LC}

Once we obtain an estimate of *M*_{HC,HC}, the remaining parameters can be straightforwardly deduced. First, *M*_{LC,HC} is the number of total essential worker contacts minus *M*_{HC,HC}. This entry uniquely determines *M*_{HC,LC}, since contact is symmetric and thus *w*_{HC} · *M*_{LC,HC} = (1 − *w*_{HC}) · *M*_{HC,LC}, where *w*_{HC} is the proportion of essential workers in the population. After solving for *M*_{HC,LC}, the final entry *M*_{LC,LC} is obtained by subtracting *M*_{HC,LC} from the total number of contacts reported by individuals in the LC cluster.

## Footnotes

1 https://github.com/andrewilyas/covid-household-transmission

↵

^{1}https://github.com/andrewilyas/covid-household-transmission↵

^{2}More accurately, 2.1 was the mean number of individuals in each household who were tested.↵

^{3}Regions were selected from a pool of all US states and over 100 other regions, and were included based on (a) availability of lockdown dates, (b) having enough pre-lockdown data to estimate*R*reliably. The exact calculations are given in our code release.↵

^{5}Concretely, that transmission depends only on the number of cluster-specific contacts, the infection probability of each contact, and the contact duration↵

^{6}https://www.bls.gov/emp/tables/civilian-labor-force-participation-rate.htm