## ABSTRACT

We combine COVID-19 case data with demographic and mobility data to estimate a modified susceptible-infected-recovered (SIR) model for the spread of this disease in the United States. We find that the incidence of infectious COVID-19 individuals has a concave effect on contagion, as would be expected if people have inter-related social networks. We also demonstrate that social distancing and population density have large effects on the rate of contagion. The social distancing in late March and April substantially reduced the number of COVID-19 cases. However, the concave contagion pattern means that when social distancing measures are lifted, the growth rate is considerable but will not be exponential as predicted by standard SIR models. Furthermore, counties with the lowest population density could likely avoid high levels of contagion even with no social distancing. We forecast rates of new cases for COVID-19 under different social distancing norms and find that if social distancing is eliminated there will be a massive increase in the cases of COVID-19, about double what would occur if the US only restored to 50% of the way to normalcy.

## Introduction

As COVID-19 spreads across the world and the United States, governments and individuals have worked to slow the growth of the disease by reducing the extent to which people leave their homes. In the United States, these actions have largely been acted on by households who have voluntarily stayed home unless they needed to travel, but they have also been bolstered by orders of local and state governments. These orders have occurred over different time periods, and have taken many different forms, but they have had a similar flavor of limiting gathering sizes, closing schools, and shutting down non-essential businesses or shifting their operations to a contact-free experience. That said, many areas of the country never had any form of stay-at-home orders.

Ultimately, the purpose of the stay-at-home orders is to reduce the amount of contact between people in order to slow the growth of COVID-19, which is thought to be spread primarily through droplets that require being within a relatively small distance of an infected person. In this paper, we first measure the extent to which social distancing reduces the speed at which COVID-19 spreads. We then run simulations of how COVID-19 will spread over time under different policy regimes.

We find that COVID-19 spreads less than proportionately with the number of contagious individuals. We also observe that social distancing during late March and April significantly reduced the spread of the disease. Higher population density also leads to an increased spread of COVID-19.

Our model gives good out-of-sample forecasts of the disease for the two and half weeks after the end of our mobility data, assuming that the country continues the nearly 50% return to normalcy observed at the end of April (as compared to the observed peak social distancing levels). We forecast that completely opening up the country to 100% of the pre-shutdown levels of social interaction will lead to 4 million additional COVID-19 cases (officially diagnosed) by the end of September 2020, corresponding to a doubling of the cases than we would expect if the country continued with the path of 50% return to normalcy we observe at the end of April. However, there is a great heterogeneity among counties, and according to our simulations 44% of the counties could open up while still experiencing a low infection rate of less than 0.1% over a 3-month period. These counties all have low population densities.

## Model

The model we estimate is a simplified version of a susceptible-infected-recovered (SIR) model. We assume that
where *y _{i,t}* is the number of individuals who are infected in county

*i*on day

*t*,

*R*is the rate at which infectious individuals in the county transmit the disease,

_{i,t}*S*is the percentage of the county population that is susceptible to COVID-19 (i.e., the share of people who have not yet had COVID-19), and

_{i,t}*Y*is the number of cumulative individuals who have been infected by day

_{i,t}*t*. The

*y*

_{i,t}_{−2}

_{−}

*y*

_{i,t}_{−8}term reflects our assumption that infected individuals are contagious from the second day after they catch the virus through the seventh day, leading to a serial interval of 4.5 days

^{1}. This treatment of the infectious population is an approximation to the standard SIR models, where the infectious population is typically modeled as a stock that has an outflow at a constant rate. This assumption makes the estimation much easier with the large number of fixed effects we include in our model, and as a practical matter this assumption only has a minimal impact on our estimates of the contagion of COVID-19. In the supplemental appendix we show that we get extremely similar results if we take the time of contagiousness to be 14 days (

*y*

_{i,t}_{−2}

_{−}

*y*

_{i,t}_{−16}) instead of 6 days.

The main difference between this model and a standard SIR model is that a standard SIR model constrains *ω* = 1. We show in the supplemental appendix that the estimated model with this constraint does not perform well out of sample. We instead find that *ω* < 1. This shows that the marginal impact of one more sick person diminishes as more and more people are sick. There are several reasons why this may be expected, with the greatest reason being that contagious individuals may end up endangering many of the same group of unexposed individuals. One might expect this to be the case if people often have the same or overlapping groups of friends or acquaintances. We see some of this directly when, for example, cases are clustered within households, nursing homes, or places of work. In the accompanying supplemental appendix we present a networking model and show that we would get *ω* < 1 if people have interconnected networks of contact.

In order to better understand the variation of the rate of contagion, we allow *R _{i,t}* to vary according to a number of factors instead of treating it as a constant parameter. Thus, we model

This specification implies that transmission rates can differ across counties (the county fixed effects *α _{i}* reflect different population densities and also different demographic compositions), time periods (date fixed effects

*β*are included mostly to accommodate different rates of testing and also the different rates of reporting that happen on weekdays vs weekends), levels of social distancing

_{t}*d*, and different temperatures,

_{i,t}*h*, the impact of which has been debated

_{i,t}^{2},

^{3},

^{4}. The social distancing measure,

*d*, is based on cellphone GPS location data that are provided by SafeGraph, and are available for free to researchers studying COVID-19. We measure social distancing as the fraction of phones that stay exclusively at home during a given day.

_{i,t}The *ε _{i,t}* term is our statistical error term. Equation (1) is estimated by taking the logarithm of both sides, with the details in the appendix. Note that the social distancing level by individuals, as well as social distancing regulations, are not determined in a vacuum. Rather, we observe that people social distance more in areas that are harder hit by COVID-19. Thus, the

*ε*term may be correlated with the social distancing measures, causing a biased, underestimated impact of social distancing on slowing the spread of the disease. We control for this endogeneity bias by estimating the model using an Instrumental Variables (IV) technique, where we use the amount of rain as an instrument for social distancing. Specifically, we assume that rain directly shifts the level of social distancing, but is not correlated with

_{i,t}*ε*.

_{i,t}## Results

To estimate the model parameters, we use county-level officially confirmed COVID-19 daily case data of 2,704 US counties or county-equivalents from February 3 to April 28. We omit the data from New York, New Jersey and Connecticut, as explained in the technical appendix. We append the data with daily county-level weather data as well as cellphone mobility data provided by SafeGraph. The results are presented in Table 1. We find that social distancing indeed decreases the growth rate of COVID-19: Moving from the observed mean pre-COVID level of social distancing (0.25) to the post-COVID peak level (0.40), the magnitude of *R* is reduced by 56%.

We also find that the exponent on the number of contagious people, *ω*, is 0.47. It is significantly lower than 1, the exponent assumed in a standard SIR model. This shows that there is a strongly concave relationship between the number of infected people and the rate at which the disease spreads. This level of concavity also implies that while initial outbreaks of COVID-19 will expand exponentially, they will quickly turn to a slower rate of growth. The growth looks linear or even plateauing when plotted cumulatively, although the disease will persist for a long period of time and continue building a substantial number of cases. This may explain why the recent growth rate of COVID-19 cases has slowed considerably after a quick take-off, and yet this growth has persisted. We also find that higher temperatures may slow the spread of the virus, but with a much smaller impact.

Most of the variation of the contagion rates *R*_{i,t} is captured by our county-level fixed effects, *α _{i}*, in Equation (2). In order to understand the drivers of the contagion rates, we run a regression of the county fixed effects

*α*against county-level demographics. It has been shown in

_{i}^{5}that when the number of fixed effects is large, these coefficients can be treated as data for the purpose of statistical inference. The results are reported in Table 2. We find that population density is a crucial factor influencing the spread of the disease. In fact, COVID-19 would be expected to never flare up beyond a very small base level in some areas with sufficiently low population densities, as we discuss below. We also observe that greater concentrations of Black and Hispanic residents and public transit commuters are associated with higher contagion rates. Interestingly, higher median incomes are related to greater contagion. We are uncertain what drives this result: one possibility is that people with higher incomes may interact more with nearby cities that have more outbreaks. We also include the share of seniors (age 70+) and children (below age 18) in the population. Seniors are marginally more likely to spread the disease, but children show no sign of having a lower rate of infecting people, confirming the finding in

^{6}.

## Prediction and Forecasting

Using our model, we simulate future cases beyond our sample period. First, to examine how our model performs, we predict the out-of-sample case numbers from the end of our data period up to May 16, 2020, under different social distancing assumptions. We start by forecasting the cumulative COVID-19 cases if each county continued the social distancing at the levels observed at the end of April. Assuming the observed February level as normalcy, the end-of-April level is at the 50% between the peak lock-down level and normalcy (so we say that such a level is at 50% of normalcy). We also implement the same exercise under several other social distancing benchmarks. These benchmarks are defined specifically in the supplemental appendix. The results appear in Figure 1. We observe that our model forecasts the pattern of disease contagion well if the level of social distancing in early May remained at the level seen at the end of April (50% return to normalcy).

Finally, we forecast how the disease will evolve up to September 30, 2020 under different reopening strategies. The cumulative and daily cases appear in Figures 2 and 3, respectively. The cyclicality observed in Figure 3 reflects the variation we observe in the weekly data, which may reflect different reporting delays or different social-distancing behavior due to the day-of-the-week effect. The forecast shows that social distancing matters, but that the impact of increased mobility becomes higher as we move closer to normalcy. If social distancing is eliminated, we observe that the largest effects will be felt in the first two months. This occurs because of the shrinkage of the uninfected population in each county. Note that cases will be elevated to almost double the daily rate that we would observe under a 50%-return-to-normalcy even into the September of 2020, when cases are likely to reach an almost steady weekly level. Our estimates suggest that it will be difficult to return to school and normalcy in the fall of 2020 without sparking a large outburst of COVID-19. Ultimately, moving from the 50%-return-to-normalcy to a full return to normalcy will lead to 4 million additional confirmed cases. If we assume that confirmed cases are only 10% of actual cases, and that COVID-19 has an 0.75% infection-fatality rate (IFR), as we justify in the supplemental appendix, we would expect 300,000 to 600,000 deaths by the end of September 2020 if the social distancing occurs at 50% to 100% levels of normalcy, respectively. We note, however, based on our forecast that 44% of the counties in the sample (1,196) could completely reopen and still experience a confirmed case rate lower than 0.1% from June to August, 2020. These counties are less populated and account for less than 15% of the population in our sample. We also note that our analyses do not consider the positive effects of alternative preventive protocols such as wearing facial masks and better hand washing. Such protocols may help slow the contagion process.

## Conclusion

Taken together, we demonstrate that the rate of spread of COVID-19 in the United States is concave in the number of contagious individuals. This explains why the growth rate of COVID-19 cases has been slower than expected given the initial exponential growth, above and beyond the effect from social distancing. We empirically identify the substantial impact of social-distancing on combating the pandemic. We also forecast how COVID-19 will evolve in the future, and the timing over which different parts of the country will reach their peaks and how the patterns may affect our reopening strategies.

## Data Availability

Most of the data we use is publicly available from multiple sources, and we lay out our sources ad where researchers can access this data clearly. Cell phone stay-at-home rates are from SafeGraph, which is available for free from all researchers studying COVID-19. We cannot share this data, but it is widely available, so the paper is auditable.

## Author contributions statement

All authors formulated the empirical model. M.L. and S.Y. analyzed the data. All authors drafted and reviewed the manuscript.

## Acknowledgments

We thank SafeGraph Inc. for providing the mobility data analyzed in this study.

## Technical Appendix: Forecasting the Spread of COVID-19 under Different Reopening Strategies

### Introduction

In this online appendix, we first present our data. We then discuss our model and assumptions. Finally, we present our assumptions for the simulations.

### Data

Our data come from a multitude of sources. We lay out the sources for each of these in turn.

#### Positive Cases

Data of positive cases are based on the COVID-19 data published by the New York Times (https://github.com/nytimes/covid-19-data, accessed on May 17, 2020). The data contain the daily confirmed case counts in the U.S. at the county or county-equivalents level. We exclude cases in the states of New York, New Jersey, and Connecticut due to the large outbreak there and the complicated relationship between New York City (which is the seat of 5 counties) and the surrounding counties. We drop 3 of the remaining counties because we do not have social distancing data for 2 of them, and we cannot match the demographic data for a third (Oglala Lakota County, SD). This reduces the number of counties to 2,778. In addition, we remove 74 counties that had no confirmed cases in the entire sample period. The remaining 2,704 counties and county-equivalents constitute our main sample. These counties account for 89.6% of the total U.S. population and 55.2% of the total U.S. confirmed cases as of April 23, 2020.

There are a few days where there are negative cases that are reported. These are generally corrections to previous over-reporting. Thus, we clean the negative numbers of cases by subtracting the absolute value of the negative cases from the proceeding day. In the event that that leads to a negative number of the proceeding day, we iterate again.

COVID-19 also has an incubation period of approximately 5 days ^{1,2}. Because of this lag between when a person gets sick and when they are diagnosed with COVID-19, we assume that the cases reported on a particular date actually measure the COVID-19 infections from 5 days earlier. We also assume that the true number of cases is approximately 10 times the number of diagnosed cases. We get this number by assuming that the Infection Fatality Rate (IFR) is 0.75% ^{3}. We also assume that any deaths that occur happen 14 days after the confirmed test result. On May 16, 2020, the last day of our confirmed case data, there were 88,660 deaths in the US. On May 2, 2020, there were 1,138,961 officially diagnosed cases. We hence obtain the factor as (88,660/0.0075)/1,138,961 = 10.4. We round this number to 10. Our estimates are not sensitive to the specific factor we use.

#### Social Distancing

We use social distancing data from the company SafeGraph, which collects cellphone GPS data from U.S. residents, and has made them available for free to academics studying COVID-19. These data are collected through a series of pings that the company receives for all users who have installed a number of smartphone apps. The list of apps that collect this information is kept as a trade secret. For each county, we use the fraction of cellphones that stayed near home for the whole day as our measure of social distancing. The SafeGraph data are published at the Census Block Group level. To accommodate other data sources which are available at a less granular level, we aggregate the this variable to the county level by taking the weighted median, using the number of cellphones in each Census Block Group as the weight.

#### Demographic data

We obtain the demographic data from the Census Bureau’s 2014-2018 American Community Survey (ACS), which contains information of each county’s profile of population, ethnicity, age, median income, and commuting pattern. The ACS, however, does not report population densities. SafeGraph, the company who provides us with the social distancing data, also maintains a dataset of the land area of each Census Block Group in the US. We aggregate the land areas to the county level. Together with the county population information from the Census Bureau, we are able to construct the population density data of each county.

#### Weather data

We gathered historical daily rain and temperature data from National Oceanic and Atmospheric Administration (NOAA) (source: https://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.ncdc:C00861/html, accessed on May 21, 2020). The raw weather data is at the weather station level and we match weather stations to the counties they are in. We use the average values across weather stations within the same county to construct the weather variables for that county. For a small number of counties where there are no associated weather stations, we use the daily state averages as proxies.

#### Voting data

We obtained county-level voting data from https://public.opendatasoft.com/explore/dataset/usa-2016-presidential-election-by-county/table/?disjunctive.state (accessed May 20, 2020). This data is explained at https://github.com/Deleetdk/USA.county.data (accessed May 20, 2020), and the election vote totals originally came from the New York Times.

#### Shelter-in-place orders data

Shelter-in-place orders (SIP) data are compiled by Keystone, a strategy and economics consulting firm. The company collects and distributes the SIP data (https://www.keystonestrategy.com/coronavirus-covid19-intervention-dataset-model/, accessed on May 21, 2020.) for free to researchers studying COVID-19.

#### Putting it all together

Our sample is an unbalanced panel because counties start to have positive number of confirmed cases on different dates. The earliest date we observe in the sample is Jan 29, 2020, and the last day is April 23, 2020. Note that we construct actual cases using reported cases 5 days later, and thus the corresponding sample period based on reported cases is Feb 3, 2020 to April 28, 2020.

Summary statistics of all of the variables we use in the estimation are presented in Table A1. Note that our case data proceed past the dates used for estimating the model and are up to May 16, 2020. We use those data for validating the model. Those data are publicly available, but we are happy to supply summary statistics for this hold-out sample upon request.

### Empirical Analysis

#### Detail of Empirical Specification

In this subsection, we detail the assumptions we make and the estimation procedure. As noted in the main paper, the model we estimate is a modified version of the standard susceptible-infected-recovered (SIR) model:
where *y _{i,t}* is the number of individuals who are infected in county

*i*on day

*t*,

*R*is the rate at which infectious individuals in the county transmit the disease,

_{i,t}*S*is the percentage of the population that has not yet had COVID-19 and is thus susceptible to it, and

_{i,t}*y*is the number of cumulative individuals who have been infected up until day

_{i,t}*t*. This model differs from the standard SIR in two key ways. First, the standard SIR model constrains

*ω*= 1. We discuss in the paper that there are theoretical reasons to believe that the correct model of transmission involves

*ω*< 1. As an example, we present a network model below that demonstrates that

*ω*< 1 is possible even using the conventional transmission mechanism. Second, the standard SIR model does not specify a discrete time frame over which the infected individuals are contagious, but rather builds a stock of infected individuals and assumes that these individuals exit their infected period at a fixed rate. We view our model as an approximation of this process, which greatly eases our estimation and allows us to easily add important variables to explain the contagion process in our analysis. We use a 6-day infectious window and an assumed latent period of 2 days (

*y*

_{i,t}_{−2}−

*y*

_{i,t}_{−8}). This gives a mean serial interval of 4.5 days, which is close to several estimates

^{4}. In this appendix, we also present the results where we use a 14-day window (

*y*

_{i,t}_{−2}−

*y*

_{i,t}_{−16}), and show that results are similar.

We assume that the rate of transmission, *R _{i,t}* varies by a set of factors, which we model as includes county-level fixed effects, date fixed effects, the measure of social distancing, and daily average temperature. The county fixed effects account for differences in demographics across counties, such as the demographics shown in Table 2 of the main paper as well as other unobservable county-specific factors. The date fixed effects account for both day-of-the-week differences in the patterns of travel for people (e.g., the time away from the house to go to work or to go to the park, which may lead to different exposures to the disease) as well as differences in the rate of testing and reporting that occur across time. We assume that the errors

*ε*are uncorrelated across counties. We further assume that

_{i,t}*ε*is uncorrelated across time, although we cluster the standard errors by county.

_{i,t}We estimate the model by taking logarithm of both sides. After rearranging we get:

Note that sometimes *y _{i,t}*, the diagnosed case number, is 0 for some counties on some dates. Therefore, we adjust this formula slightly by adding 1 to

*y*so the logarithmic values are always well-defined:

_{i,t}In some counties, *y _{i,t}*

_{−2}−

*y*

_{i,t}_{−8}is 0 for some periods. We do not use those observations for estimation. Note that because this is a lagged variable, this is a selection based on independent variables and not based on dependent variables, and hence it does not bias our estimation.

One concern that can arise in estimating this model is that the amount of social distancing is likely to be correlated with the error terms, *ε _{i,t}* in the regression. We address this concern using an Instrumental Variable (IV) approach, which requires that we find a variable that affects social distancing but is not correlated with

*ε*. We use the amount of rain (measured in mm) as a shifter of social distancing that does not directly cause COVID-19 to spread. We run a first stage regression of social distancing to test whether this instrument has much power. The

_{i,t}*F*-statistic for this test is 435.43, indicating that this is a strong instrument. The main estimation results are presented in Table 1 of the paper and replicated in column 1 of Table A2 in this appendix.

Research on COVID-19 is nascent, and there are different views of how long infected individuals stay contagious. Suppose that such individuals are contagious for 14 days instead of 6 days. Then the model becomes:

We present the estimation results of this model in column 2 of Table A2. Note that this regression has more observations because there are fewer instances where we observe no cases in a county for a 14-day window than for a 6-day window. The results are largely unchanged. The coefficient on social distancing levels are slightly lower, but well within one standard error of the corresponding coefficient in column 1. The exponent on the infectious individuals is 0.435. That is somewhat smaller (but statistically different) than the 0.470 we observe with the shorter 6-day window, but overall the curvature shape is similar to what we have observed with the 6-day window. The effect of temperature is slightly smaller but similar. As we will discussion below in Section **Simulation**, both specifications give similar long-run forecasting results.

All county-level demographic factors remain constant over time in our analysis. While our main regression gives many insights, impacts of these demographic factors on the spread of the virus are captured by the county fixed effects. In order to better understand how these factors affect the contagion rate, we next regress the county fixed effects on several demographic variables. The coefficients from this regression should be thought of as the impact of these demographics on the rate of contagion. The results from the model are reported in column 1 of Table A3 (replication of Table 2 in the paper). In column 2 we present the results we would obtain if we instead modeled the contagious period to be 14 days. Similar to Table A2, the results are again very similar under this alternative specification. The only statistically significant difference is that the coefficient on log(population density) is slightly smaller, although the effect is of a very similar magnitude. We also observe that the *R*^{2} of the 14-day contagious period model is slightly lower than the *R*^{2} of the 6-day contagious period model.

#### Explaining Social Distancing

While the focus of this paper is on measuring the contagion effect, we consider an exploratory regression to see what is correlated with the rate at which individuals choose to social distance. The results are shown in Table A4.

The results indicate that the levels of social distancing are driven much more by the number of national cases than the number of local cases. The results also indicate that the fraction of voters who voted for Trump in the 2016 presidential election explains much of the observed levels of social distancing. Going from a solidly Clinton-voting to a solidly Trump-voting county (i.e., going from the minimum to the maximum of Trump vote shares) represents a swing of almost 9 percentage-points of social distancing. To put the change into prospective, the social distance measures range from 11% to 65%, a 54 percentage-point variation.

Demographically, higher population densities lead to lower social distancing, but increased compliance with sheltering-in-place orders. More affluent counties have less social distancing. Counties with a higher fraction of Black population are less likely to social distance, perhaps due to work demands. A greater Hispanic population and a greater public transportation population are both associated with more social distancing. Supporting the idea that much of the traveling is for work, we see that people are less likely to practice social distancing on weekdays than on weekends.

In terms of policy, the closing of schools is the policy that is most-indicative of how much social distancing will occur. We observe that closing public venues is correlated with reduced social distancing, and that limiting gatherings (we treat all gathering size limits of 500-people or less the same, but most of these limits are for far smaller groups than 500 people) has a statistically significant effect on the amount of social distancing.

### Simulation

After estimating our model, we forecast the number of cases that would emerge under different social distancing regimes. For this exercise, we first divide our model’s predicted numbers of “true” cases by 10, which gives us the prediction of diagnosed cases (as described in Section **Positive Cases**). Next, because the 2,704 counties in our sample are a subset of the whole nation, on a given date the predicted diagnosed case number is a fraction of the total cases in the US. While this fraction changes on a daily basis, we use an approximation by taking the median of the daily ratios between diagnosed cases observed in our sample and in the whole nation during our sample period. The median of the daily ratios is 0.63. Accordingly, we divide our predicted diagnosed cases by 0.63 to obtain the national number of diagnosed cases.

The first step of our simulation involves validating the model: We predict how many cases would emerge in the weeks after our data (Dates: April 29 to May 16, 2020) in order to validate our model. The results are shown in Figure 1 in the original paper. We observe that we are well able to predict the number of observed cases if the social distancing in early May represented a 50% return to normalcy, which is defined as . *FractionStayAtHomePeak _{i}* represents the fraction of devices laying home at the peak of social distancing in our data (April 5-April 11, 2020). This variable is county-specific, hence the county subscript

*i*.

*FractionStayAtHomeBeforeSD*represents the observed lowest level at which devices stayed home in county

_{i}*i*in February, and

*FractionStayAtHome*represents the fraction of devices staying home in county

_{i,t}*i*on date

*t*. We compute the different levels of social distancing accordingly: For example, a 25% towards normalcy represents social distancing at the level of 0.75 × (minimum social distancing) + 0.25 × (maximum social distancing). We find that the levels of social distancing at the end of April were approximately at the 50% return to normalcy levels. Overall, our model predicts the national cases well.

We next forecast the cumulative and daily cases of COVID-19 through the end of September at different levels of social distancing. Those forecasts appear in Figures 2 and 3 in the original paper. In Figure A1 below, we replicate Figure 2 in the paper but further add the confidence intervals. To avoid cluttering, we only depict 50% and 100% return-to-normalcy levels in A1.

As a robustness check regarding the 6-day contagion window specification, we also consider forecasting US daily cases under the specification where the contagion window is 14 days. Figure A2 shows the evolutions of daily cases till September 30, 2020 under 50% and 100% return-to-normalcy regimes. We overlay the forecasts of both 14-day and 6-day specifications for easy comparison. From the figure, we may see the forecasts of 14-day and 6-day contagion window specifications are fairly close under the 50% return-to-normalcy level. For the 100% normalcy level, the two specifications differ at the beginning but converge quickly. In the long run, the two specifications give similar forecasts for daily cases.

### Concavity of SIR model and Network Dynamics

One unique feature of our model is that we estimate an exponent on the number of contagious cases. We include this flexibility because such a model fits the data much better, and also leads to forecasts that have more limited growth after an initial take-off of COVID-19 cases, as is commonly observed. To demonstrate our model’s better fit, we compare the prediction of our model against an alternative model with the exponent fixed at 1 as the standard SIR model in Figure A3. At the 50%-return-to-normalcy level that is observed at the end of April, the standard SIR model would predict much higher cumulative case numbers than the actual numbers. In contrast, our model’s prediction fits much better.

We next illustrate that the concave relationship we estimate for the number of contagious individuals on the number of new cases can come from social networks between people. We seek to demonstrate the theoretical feasibility of our results rather than the necessary or sufficient conditions under which the nonlinearity will arise. Thus, we simulate a very simplified model of networks and disease process.

To do this, we simulate a network with the following process: We take 10,000 individuals. We create a network by first randomly assigning that any two individuals will be joined with a common node with probability 0.001999 (corresponding to each person getting almost 20 friends on average). Call these connection “round-1 friends.” We then expand this network by assigning each node to have an edge with each of the round-1 friends of their friends with a probability of 0.6.

We assume that the disease spreads with the following process. We seed 4 individuals to have the disease in period 0. Then in each period we assume that any connected individual will get sick with probability 7/(number of connections), where the number of connections is specific to the individual, and varies due to the random assignment of the people who are connected (This probability is capped at 1 in case someone is randomly assigned fewer than 7 connections, which is very unlikely).

After simulating this process, we then regress (ln (*y _{t}*) − ln (

*S*)) =

_{t}*c*+

*ω*ln (

*y*

_{t}_{−1}) +

*e*. We run this simulation 10 times. The mean value for , with a range of (0.44, 0.62). This shows the plausibility of network effects leading to an estimate in the range that we have estimated in our main model.

_{t}