## Abstract

The SARS-CoV-2 pandemic has caused significant mortality and morbidity worldwide, sparing almost no community. As the disease will likely remain a threat for years to come, an understanding of the precise influences of human demographics and settlement, as well as the dynamic factors of climate, susceptible depletion, and intervention, on the spread of localized epidemics will be vital for mounting an effective response. We consider the entire set of local epidemics in the United States; a broad selection of demographic, population density, and climate factors; and local mobility data, tracking social distancing interventions, to determine the key factors driving the spread and containment of the virus. Assuming first a linear model for the rate of exponential growth (or decay) in cases/mortality, we find that population-weighted density, humidity, and median age dominate the dynamics of growth and decline, once interventions are accounted for. A focus on distinct metropolitan areas suggests that some locales benefited from the timing of a nearly simultaneous nationwide shutdown, and/or the regional climate conditions in mid-March; while others suffered significant outbreaks prior to intervention. Using a first-principles model of the infection spread, we then develop predictions for the impact of the relaxation of social distancing and local climate conditions. A few regions, where a significant fraction of the population was infected, show evidence that the epidemic has partially resolved via depletion of the susceptible population (i.e., “herd immunity”), while most regions in the United States remain overwhelmingly susceptible. These results will be important for optimal management of intervention strategies, which can be facilitated using our online dashboard.

## Introduction

The new human coronavirus SARS-CoV-2 emerged in Wuhan Province, China in December 2019 (Chen et al., 2020; Li et al., 2020), reaching 10,000 confirmed cases and 200 deaths due to the disease (known as COVID-19) by the end of January this year. Although travel from China was halted by late-January, dozens of known introductions of the virus to North America occurred prior to that (Holshue et al., 2020; Kucharski et al., 2020), and dozens more known cases were imported to the US and Canada during February from Europe, the Middle East, and elsewhere. Community transmission of unknown origin was first detected in California on February 26, followed quickly by Washington State (Chu et al., 2020b), Illinois and Florida, but only on March 7 in New York City. Retrospective genomic analyses have demonstrated that case-tracing and self-quarantine efforts were effective in preventing most known imported cases from propagating (Ladner et al., 2020; Gonzalez-Reiche et al., 2020; Worobey et al., 2020), but that the eventual outbreaks on the West Coast (Worobey et al., 2020; Chu et al., 2020b; Deng et al., 2020) and New York (Gonzalez-Reiche et al., 2020) were likely seeded by unknown imports in mid-February. By early March, cross-country spread was primarily due to interstate travel rather than international imports (Fauver et al., 2020).

In mid-March 2020, nearly every region of the country saw a period of uniform exponential growth in daily confirmed cases — signifying robust community transmission — followed by a plateau in late March, likely due to social mobility reduction. The same qualitative dynamics were seen in COVID-19 mortality counts, delayed by approximately one week. Although the qualitative picture was similar across locales, the quantitative aspects of localized epidemics — including initial rate of growth, infections/deaths per capita, duration of plateau, and rapidity of resolution —were quite diverse across the country. Understanding the origins of this diversity will be key to predicting how the relaxation of social distancing, annual changes in weather, and static local demographic/population characteristics will affect the resolution of the first wave of cases, and will drive coming waves, prior to the availability of a vaccine.

The exponential growth rate of a spreading epidemic is dependent on the biological features of the virus-host ecosystem — including the incubation time, Susceptibility of target cells to infection, and persistence of the virus particle outside of the host — but, through its dependence on the transmission rate between hosts, it is also a function of external factors such as population density, air humidity, and the fraction of hosts that are susceptible. Initial studies have shown that SARS-CoV-2 has a larger rate of exponential growth (or, alternatively, a lower doubling time of cases^{1}) than many other circulating human viruses (Park et al., 2020). For comparison, the pandemic influenza of 2009, which also met a largely immunologically-naive population, had a doubling time of 5–10d (Yu et al., 2012; Storms et al., 2013), while that of SARS-CoV-2 has been estimated at 2–5d (Sanche et al., 2020; Oliveiros et al., 2020) (growth rates of ∼0.10d^{−1} vs. ∼0.25d^{−1}). It is not yet understood which factors contribute to this high level of infectiousness.

While the dynamics of an epidemic (e.g., cases over time) must be described by numerical solutions to nonlinear models, the exponential growth rate, *λ*, usually has a simpler dependence on external factors. Unlike case or mortality incidence numbers, the growth rate does not scale with population size. It is a directly measurable quantity from the available incidence data, unlike, e.g., the reproduction number, which requires knowledge of the serial interval distribution (Wallinga and Lipsitch, 2007; Roberts and Heesterbeek, 2007; Dushoff and Park, 2020), something that is difficult to determine empirically (Champredon and Dushoff, 2015; Nishiura, 2010). Yet, the growth rate contains the same threshold as the reproduction number (*λ* = 0 vs. *R*_{0} = 1), between a spreading epidemic (or an unstable uninfected equilibrium) and a contracting one (or an equilibrium that is resistant to flare-ups). Thus, the growth rate is an informative direct measure on that space of underlying parameters.

In this work, we leverage the enormous data set of epidemics across the United States to evaluate the impact of demographics, population density and structure, weather, and non-pharmaceutical interventions (i.e., mobility restrictions) on the exponential rate of growth of COVID-19. Following a brief analysis of the initial spread in metropolitan regions, we expand the meaning of the exponential rate to encompass all aspects of a local epidemic — including growth, plateau and decline — and use it as a tracer of the dynamics, where its time dependence and geographic variation are dictated solely by these external variables and per capita cumulative mortality. Finally, we use the results of that linear analysis to calibrate a new nonlinear model — a renewal equation that utilizes the excursion probability of a random walk to determine the incubation period — from which we develop local predictions about the impact of social mobility relaxation, the level of herd immunity, and the potential of rebound epidemics in the Summer and Fall.

## Results

### Initial growth of cases in metropolitan regions is exponential with rate depending on mobility, population, demographics, and humidity

As an initial look at COVID-19’s arrival in the United States, we considered the ∼100 most populous metropolitan regions — using maps of population density to select compact sets of counties representing each region (see Supplementary Material) — and estimated the initial exponential growth rate of cases in each region. We performed a linear regression to a large set of demographic (sex, age, race) and population variables, along with weather and social mobility (Fitzpatrick and Karen, 2020) preceding the period of growth (Figure 1). In the best fit model (*R*^{2} = 0.75, BIC = −183), the baseline value of the initial growth rate was *λ* = 0.21d^{−1} (doubling time of 3.3d), with average mobility two weeks prior to growth being the most significant factor (Figure 1B). Of all variables considered, only four others were significant: population density (including both *population-weighted density* (PWD) — also called the “lived population density” because it estimates the density for the average individual (Craig, 1985)— and *population sparsity, γ*, a measure of the difference between PWD and standard population density, see Methods), *p* < 0.001 and *p* = 0.006; specific humidity two weeks prior to growth, *p* = 0.001; and median age, *p* = 0.04.

While mobility reduction certainly caused the “flattening” of case incidence in every region by late-March, our results show (Figure 1C) that it likely played a key role in reducing the *rate* of growth in Boston, Washington, DC, and Los Angeles, but was too late, with respect to the sudden appearance of the epidemic, to have such an effect in, e.g., Detroit and Cleveland. In the most extreme example, Grand Rapids, MI, seems to have benefited from a late arriving epidemic, such that its growth (with a long doubling time of 7d) occurred almost entirely post-lockdown.

Specific humidity, a measure of absolute humidity, has been previously shown to be inversely correlated with respiratory virus transmission (Lowen et al., 2007; Shaman and Kohn, 2009; Shaman, Goldstein, and Lipsitch, 2011; Kudo et al., 2019). Here, we found it to be a significant factor, but weaker than population density and mobility (Figure 1C). It could be argued that Dallas, Los Angeles, and Atlanta saw a small benefit from higher humidity at the time of the epidemic’s arrival, while the dry late-winter conditions in the Midwest and Northeast were more favorable to rapid transmission of SARS-CoV-2.

### Exponential growth rate of mortality as a dynamical, pan-epidemic, measure

In the remainder of this report, we consider the exponential rate of growth (or decay) in local confirmed deaths due to COVID-19. The statistics of mortality is poorer compared to reported cases, but it is much less dependent on unknown factors such as the criteria for testing, local policies, test kit availability, and asymptomatic individuals (Pearce et al., 2020). Although there is clear evidence that a large fraction of COVID-19 mortality is missed in the official counts (e.g., Leon et al., 2020; Modi et al., 2020), mortality is likely less susceptible to rapid changes in reporting, and, as long as the number of reported deaths is a monotonic function of the actual number of deaths (e.g., a constant fraction, say 50%), the sign of the exponential growth rate will be unchanged, which is the crucial measure of the success in pandemic management.

To minimize the impact of weekly changes, such as weekend reporting lulls, data dumps, and mobility changes from working days to weekends, we calculate the regression of ln [Mortality] over a 14-day interval, and assign this value, *λ*_{14}(*t*), and its standard error to the last day of the interval. Since only the data for distinct 2-week periods are independent, we multiply the regression errors by to account for correlations between the daily estimates. Together with a “rolling average” of the mortality, this time-dependent measure of the exponential growth rate provides, at any day, the most up-to-date information on the progression of the epidemic (Figure 2).

In the following section, we consider a linear fit to *λ*_{14}, to determine the statistically-significant external (non-biological) factors influencing the dynamics of local exponential growth and decline of the epidemic. We then develop a first-principles model for *λ*_{14} that allows for extrapolation of these dependencies to predict the impact of future changes in social mobility and climate.

### Epidemic mortality data explained by mobility, population, demographics, depletion of susceptible population and weather, throughout the first wave of COVID-19

We considered a spatio-temporal dataset containing 3933 estimates of the exponential growth measure, *λ*_{14}, covering the three month period of 8 March 2020 – 8 June 2020 in the 187 US counties for which information on COVID-19 mortality and all potential driving factors, below, were available (the main barrier was social mobility information, which limited us to a set of counties that included 69% of US mortality). A joint, simultaneous, linear fit of these data to 12 potential driving factors (Table 1) revealed only 7 factors with *independent* statistical significance. Re-fitting only to these variables returned the optimal fit for the considered factors (BIC = −5951; *R*^{2} = 0.674).

We found, not surprisingly, that higher population density, median age, and social mobility correlated with positive exponential growth, while population sparsity, specific humidity, and susceptible depletion correlated with exponentially declining mortality. Notably the coefficients for each of these quantities was in the 95% confidence intervals of those found in the analysis of metropolitan regions (and vice versa). Possibly the most surprising dependency was the negative correlation, at ≃ −3.7*σ* between *λ*_{14} and the *total* number of annual deaths in the county. In fact, this correlation was marginally more significant than a correlation with log(population), which was −3.3*σ*. One possible interpretation of this negative correlation is that the number of annual death is a proxy for the number of potential outbreak clusters. The larger the number of clusters, the longer it might take for the epidemic to spread across their network, which would (at least initially) slow down the onset of the epidemic.

### Nonlinear model

To obtain more predictive results, we developed a mechanistic nonlinear model for infection (see Supplementary Material for details). We followed the standard analogy to chemical reaction kinetics (infection rate is proportional to the product of susceptible and infectious densities), but defined the generation interval (approximately the incubation period) through the excursion probability in a 1D random walk, modulated by an exponential rate of exit from the infected class. This approach resulted in a *renewal equation* (Heesterbeek and Dietz, 1996; Champredon and Dushoff, 2015; Champredon, Dushoff, and Earn, 2018), with a distribution of generation intervals that is more realistic than that of standard SIR/SEIR models, and which could be solved formally (in terms of the Lambert W function) for the growth rate in terms of the infection parameters:

The model has four key dependencies, which we describe here, along with our assumptions about their own dependence on population, demographic, and climate variables. As mortality (on which our estimate of growth rate is based) lags infection (on which the renewal equation is based), we imposed a fixed time shift of Δ*t* for time-dependent variables:

We assumed that the susceptible population, which feeds new infections and drives the growth, is actually a sub-population of the community, consisting of highly-mobile and frequently interacting individuals, and that most deaths occurred in separate sub-population of largely immobile non-interacting individuals. Under these assumptions, we found (see Supp. Mat.) that the susceptible density,

*S*(*t*), could be estimated from the cumulative per capita death fraction,*f*_{D}, as: where*D*_{tot}is the cumulative mortality count,*N*is the initial population, and the initial density is*S*(0) =*k*PWD.We assumed that the logarithm of the “rate constant” for infection,

*β*, depended linearly on social mobility,*m*, specific humidity,*h*, population sparsity,*γ*, and total annual death,*A*_{D}, as: where a barred variable represents the (population-weighted) average value over all US counties, and where the mobility and humidity factors were time-shifted with respect to the growth rate estimation window: ℳ =*m*(*t*− Δ*t*) and ℋ =*h*(*t*− Δ*t*).The characteristic time scale to infectiousness,

*τ*, is intrinsic to the biology and therefore we assumed it would depend only on the median age of the population,*A*. We assumed a power law dependence: where we fixed the pivot age,*A*_{0}, to minimize the error in*τ*_{0}.The exponential rate of exit from the infected class,

*d*, was assumed constant, since we found no significant dependence for it on other factors in our analysis of US mortality. From the properties of the Lambert W function, when the infection rate or susceptibility density approach zero — through mobility restrictions or susceptible depletion — the growth rate will tend to*λ ≈*−*d*, its minimum value.

With these parameterizations, we performed a nonlinear regression to *λ*_{14}(*t*) using the entire set of US county mortality incidence time series (Table 2). Compared to the linear model of the previous section (Table 1b), the fit improved by 7.6*σ* (BIC = − 6008; *R*^{2} = 0.724), despite both having 9 free parameters. Through the estimated parameter values, the model makes predictions for an individual’s probability of becoming infectious, and the distributions of incubation period and generation interval, all as a function of the median age of the population (see Supplementary Material).

The model was very well fit to the mortality growth rate measurements for counties with a high mortality (Figure 3). More quantitatively, the scatter of measured growth rates around the best-fit model predictions was (on average) only 13% larger than the measurement errors, independent of the population of the county ^{2}.

Importantly, when the model was calibrated on only a subset of the data — e.g., all but the final month for which mobility data is available — its 68% confidence prediction for the remaining data was accurate (Figure 4) given the known mobility and weather data for that final month. This suggests that the model, once calibrated on the first wave of COVID-19 infections, can make reliable predictions about the ongoing epidemic, and future waves, in the United States.

### Predictions for relaxed mobility restrictions, the onset of summer, and the potential second wave

Possibly the most pressing question for the management of COVID-19 in a particular community is the combination of circumstances at which the virus fails to propagate, i.e., at which the growth rate, estimated here by *λ*_{14}, becomes negative (or, equivalently, the reproduction number *R*_{t} falls below one). In the absence of mobility restrictions this is informally called the threshold for “herd immunity,” which is usually achieved by mass vaccination (e.g., John and Samuel, 2000; Fine, Eames, and Heymann, 2011). Without a vaccine, however, ongoing infections and death will deplete the susceptible population and thus decrease transmission. Varying the parameters of the nonlinear model individually about their Spring 2020 population-weighted mean values (Figure 5) suggests that this threshold will be very much dependent on the specific demographics, geography, and weather in the community, but it also shows that reductions in social mobility can significantly reduce transmission prior the onset of herd immunity.

To determine the threshold for herd immunity in the absence or presence of social mobility restrictions, we considered the “average US county” (i.e., a region with population-weighted average characteristics), and examined the dependence of the growth rate on the cumulative mortality. We found that in the absence of social distancing, a COVID-19 mortality rate of 0.13% (or 1300 per million population) would bring the growth rate to zero. However, changing the population density of this average county shows that the threshold can vary widely (Figure 5).

Examination of specific counties showed that the mortality level corresponding to herd immunity varies from 10 to 2500 per million people (Figure 9). At the current levels of reported COVID-19 mortality, we found that, as of June 22nd, 2020, only 128 *±* 55 out of 3142 counties (inhabiting 9.4 *±* 2.1% of US population) have surpassed this threshold at 68% confidence level (Figure 8). Notably, New York City, with the highest reported per capita mortality (2700 per million) has achieved mobility-independent herd immunity at the 10*σ* confidence level, according to the model (Figure 6). A few other large-population counties in New England, New Jersey, Michigan, Louisiana, Georgia and Mississippi that have been hard hit by the pandemic also appear to be at or close to the herd immunity threshold. This is not the case for most of the United States, however (Figure 8). Nationwide, we predict that COVID-19 herd immunity would only occur after a death toll of 340, 000 *±* 61, 000, or 1058 *±* 190 per million of population.

We found that the approach to the herd immunity threshold is not direct, and that social mobility restrictions and other non-pharmaceutical interventions must be applied carefully to avoid excess mortality beyond the threshold. In the absence of social distancing interventions, a typical epidemic will “overshoot” the herd immunity limit (e.g., Handel, Longini Jr, and Antia, 2007; Fung, Antia, and Handel, 2012) by up to 300%, due to ongoing infections (Figure 6). At the other extreme, a very strict “shelter in place” order would simply delay the onset of the epidemic; but if lifted (see Figures 6 and 7), the epidemic would again overshoot the herd immunity threshold. A modest level of social distancing, however — e.g., a 33% mobility reduction for the average US county — could lead to fatalities “only” at the level of herd immunity. Naturally, communities with higher population density or other risk factors (see Figure 5), would require more extreme measures to achieve the same.

Avoiding the level of mortality required for herd immunity will require long-lasting and effective non-pharmaceutical options, until a vaccine is available. The universal use of face masks has been suggested for reducing the transmission of SARS-CoV-2, with a recent meta-analysis (Chu et al., 2020a) suggesting that masks can suppress the rate of infection by a factor of 0.07–0.34 (95% CI), or equivalently Δ ln(transmission) = −1.9 *±* 0.4 (at 1*σ*). Using our model’s dependence of the infection rate constant on mobility, this would correspond to an equivalent social mobility reduction of . Warmer, more humid weather has also considered a factor that could slow the epidemic (e.g., Wang et al., 2020a; Notari, 2020; Xu et al., 2020a). Annual changes in specific humidity are (Figure 10b in Supplementary Material), which can be translated in our model to an effective mobility decrease of . Combining these two effects could, in this simple analysis, yield a modestly effective defense for the summer months:. Therefore, this could be a reasonable strategy for most communities to manage the COVID-19 epidemic at the aforementioned −33% level of mobility needed to arrive at herd immunity with the least excess death. More stringent measures would be required to keep mortality below that level. Of course, this general prescription would need to be fine-tuned for the specific conditions of each community.

## Discussion and Conclusions

By simultaneously considering the time series of mortality incidence in every US county, and controlling for the time-varying effects of local social distancing interventions, we demonstrated for the first time a dependence of the epidemic growth of COVID-19 on population density, as well as other climate, demographic, and population factors. We further constructed a realistic, but simple, first-principles model of infection transmission that allowed us to extend our heuristic linear model of the dataset into a predictive nonlinear model, which provided a better fit to the data (with the same number of parameters), and which also accurately predicted late-time data after training on only an earlier portion of the data set. This suggests that the model is well-calibrated to predict future incidence of COVID-19, given realistic predictions/assumptions of future intervention and climate factors. We summarized some of these predictions in the final section of Results, notably that only a small fraction of US counties (with less than ten percent of the population) seem to have reached the level of herd immunity, and that relaxation of mobility restrictions without counter-measures (e.g., universal mask usage) will likely lead to increased daily mortality rates, beyond that seen in the Spring of 2020.

In any epidemiological model, the infection rate of a disease is assumed proportional to population density (Jong, Diekmann, and Heesterbeek, 1995), but, to our knowledge, its explicit effect in a real-world respiratory virus epidemic has not been demonstrated. The universal reach of the COVID-19 pandemic, and the diversity of communities affected have provided an opportunity to verify this dependence. Indeed, as we show here, it must be accounted for to see the effects of weaker drivers, such as weather and demographics. A recent study of COVID-19 in the United States, working with a similar dataset, saw no significant effect due to population density (Hamidi, Sabouri, and Ewing, 2020), but our analysis differs in a number of important ways. First, we have taken a dynamic approach, evaluating the time-dependence of the growth rate of mortality incidence, rather than a single static measure for each county, which allowed us to account for the changing effects of weather, mobility, and the density of susceptible individuals. Second, we have included an explicit and real-time measurement of social mobility, i.e., cell phone mobility data provided by Google (Fitzpatrick and Karen, 2020), allowing us to control for the dominant effect of intervention. Finally, and perhaps most importantly, we calculate for each county an estimate of the “lived” population density, called the population-weighted population density (PWD) (Craig, 1985), which is more meaningful than the standard population per political area. As with any population-scale measure, this serves as a proxy — here, for estimating the average rate of encounters between infectious and susceptible people — but we believe that PWD is a better proxy than standard population density, and it is becoming more prevalent, e.g., in census work (Dorling and Atkins, 1995; Wilson, 2012).

We also found a significant dependence of the mortality growth rate on specific humidity (although since temperature and humidity were highly correlated, a replacement with temperature was approximately equivalent), indicating that the disease spread more rapidly in drier (cooler) regions. There is a large body of research on the effects of temperature and humidity on the transmission of other respiratory viruses (Moriyama, Hugen-tobler, and Iwasaki, 2020; Kudo et al., 2019), specifically influenza (Barreca and Shimshack, 2012). Influenza was found to transmit more efficiently between guinea pigs in low relative-humidity and temperature conditions (Lowen et al., 2007), although re-analysis of this work pointed to absolute humidity (e.g., specific humidity) as the ultimate controller of transmission (Shaman and Kohn, 2009). Although the mechanistic origin of humidity’s role has not been completely clarified, theory and experiments have suggested a snowballing effect on small respiratory droplets that cause them to drop more quickly in high-humidity conditions (Tellier, 2009; Noti et al., 2013; Marr et al., 2019), along with a role for evaporation and the environmental stability of virus particles (Morawska, 2005; Marr et al., 2019). It has also been shown that the onset of the influenza season (Shaman et al., 2010; Shaman, Goldstein, and Lipsitch, 2011) — which generally occurs between late-Fall and early-Spring, but is usually quite sharply peaked for a given strain (H1N1, H3N2, or Influenza B) — and its mortality (Barreca and Shimshack, 2012) are linked to drops in absolute humidity. It is thought that humidity or temperature could be the annual periodic driver in the resonance effect causing these acute seasonal out-breaks of influenza (Dushoff et al., 2004; Tamerius et al., 2011), although other influences, such as school openings/closings have also been implicated (Earn et al., 2012). While little is yet known about the transmission of SARS-CoV-2 specifically, other coronaviruses are known to be seasonal (Moriyama, Hugentobler, and Iwasaki, 2020; Neher et al., 2020), and there have been some preliminary reports of a dependence on weather factors (Xu et al., 2020b; Schell et al., 2020). We believe that our results represent the most definitive evidence yet for the role of weather, but emphasize that it is a weak, secondary driver, especially in the early stages of this pandemic where the susceptible fraction of the population remains large (Baker et al., 2020). Indeed, the current early-summer rebound of COVID-19 in the relatively dry and hot regions of the Southwest suggests that the disease spread will not soon be controlled by seasonality.

We developed a new model of infection in the framework of a renewal equation (see, e.g., Champredon, Dushoff, and Earn, 2018 and references therein), which we could formally solve for the exponential growth rate. The incubation period in the model was determined by a random walk through the stages of infection, yielding a non-exponential distribution of the generation interval, thus imposing more realistic delays to infectiousness than, e.g., the standard SEIR model. In this formulation, we did not make the standard compartmental model assumption that the infection of an individual induces an autonomous, sequential passage from exposure, to infectiousness, to recovery or death; indeed, the model does not explicitly account for recovered or dead individuals. This freedom allows for, e.g., a back passage from infectious to noninfectious (via the underlying random walk) and a variable rate of recovery or death. We assumed only that the exponential growth in mortality incidence matched (with delay) that of the infected incidence — the primary dynamical quantity in the renewal approach — and we let the cumulative dead count predict susceptible density — the second dynamical variable in the renewal approach — under the assumption that deaths arise from a distinct subset of the population, with lower mobility behavior than those that drive infection (see Supplementary Material). Therefore, we fitted the model to the (rolling two-week estimates of the) COVID-19 mortality incidence growth rate values, *λ*_{14}, for all counties and all times, and used the per capita mortality averaged over that period, *f*_{D}, to determine susceptible density. Regression to this nonlinear model was much improved over linear regression, and, once calibrated on an early portion of the county mortality incidence time series, the model accurately predicted the remaining incidence.

Because we accounted for the precise effects of social mobility in fitting our model to the actual epidemic growth and decline, we were then able to, on a county-by-county basis, “turn off” mobility restrictions and estimate the level of cumulative mortality at which SARS-CoV-2 would fail to spread even without social distancing measures, i.e., we estimated the threshold for “herd immunity.” Meeting this threshold prior to the distribution of a vaccine should not be a goal of any community, because it implies substantial mortality, but the threshold is a useful benchmark to evaluate the potential for local outbreaks following the first wave of COVID-19 in Spring 2020. We found that a few counties in the United States have indeed reached herd immunity in this estimation — i.e., their predicted mortality growth rate, assuming baseline mobility, was negative — including counties in the immediate vicinity of New York City, Detroit, New Orleans, and Albany, Georgia. A number of other counties were found to be at or close to the threshold, including much of the greater New York City and Boston areas, and the Four Corners, Navajo Nation, region in the Southwest. All other regions were found to be far from the threshold for herd immunity, and therefore are susceptible to ongoing or restarted outbreaks. These determinations should be taken with caution, however. In this analysis, we estimated that the remaining fraction of susceptible individuals in the counties at or near the herd immunity threshold was in the range of 0.001% to 5% (see Supplementary Materials). This is in strong tension with initial seroprevalence studies (Rosenberg et al., 2020; Havers et al., 2020a) which placed the fraction of immune individuals in New York City at 7% in late March and 20% in late April, implying that perhaps 75% of that population remains susceptible today. We hypothesize that the pool of susceptible individuals driving the epidemic in our model is a subset of the total population — likely those with the highest mobility and geographic reach — while a different subset, with very low baseline mobility, contributes most of the mortality (see Supplementary Material). Thus, the near total depletion of the susceptible pool we see associated with herd immunity corresponds to the highly-mobile subset, while the low-mobility sub-set could remain largely susceptible. One could explicitly consider such factors of population heterogeneity in a model — e.g., implementing a saturation of infectivity as a proxy for a clustering effect (Capasso and Serio, 1978; Mollison, 1985; De Boer, 2007; Farrell et al., 2019) — but we found (in results not shown) that the introduction of additional of parameters left portions of the model unidentifiable. Despite these cautions, it is interesting to note that the epidemic curves (mortality incidence over time) for those counties that we have predicted an approach to herd immunity are qualitatively different than those we have not. Specifically, the exponential rise in these counties is followed by a peak and a sharp decline — rather than the flattening seen in most regions — which is a typical feature of epidemic resolution by susceptible depletion.

At the time of this writing, in early Summer 2020, confirmed cases are again rising sharply in many locations across the United States — particularly in areas of the South and West that were spared significant mortality in the Spring wave. The horizon for an effective and fully-deployed vaccine still appears to be at least a year away. Initial studies of neutralizing antibodies in recovered COVID-19 patients, however, suggest a waning immune response after only 2–3 months, with 40% of those that were asymptomatic becoming seronegative in that time period (Long et al., 2020). Although the antiviral remdesivir (Beigel et al., 2020; Grein et al., 2020; Wang et al., 2020b) and the steroid Dexametha-sone (Horby et al., 2020) have shown some promise in treating COVID-19 patients, the action of remdesivir is quite weak, and high-dose steroids can only be utilized for the most critical cases. Therefore, the management of this pandemic will likely require non-pharmaceutical intervention — including universal social distancing and mask-wearing, along with targeted closures of businesses and community gathering places — for years in the future. The analysis and prescriptive guidance we have presented here should help to target these approaches to local communities, based on their particular demographic, geographic, and climate characteristics, and can be facilitated through our online simulator dashboard. Finally, although we have focused our analysis on the United States, due to the convenience of a diverse and voluminous data set, the method and results should be applicable to any community worldwide, and we intend to extend our analysis in forthcoming work.

## Data Availability

All the data used in this work have been collected from the following publicly available repositories: https://www.wolfram.com/covid-19-resources/ https://www.google.com/covid19/mobility/ https://github.com/nytimes/covid-19-data https://github.com/CSSEGISandData/COVID-19 https://www2.census.gov/programs-surveys/popest/datasets/ https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html ftp.ncdc.noaa.gov https://ghsl.jrc.ec.europa.eu/ghs_pop2019.php https://www.census.gov/data/tables/time-series/demo/popest/

## Supplementary Material

### Data & Methods

#### Datasets, Resources, Definitions

All data for cases and mortality, demographics, mobility, and weather were incorporated into the publicly-available Wolfram COVID-19 Dataset and the Wolfram|Alpha Knowledgebase (*COVID-19 Data & Resources* 2020). COVID-19 confirmed case and mortality data were obtained from both the NYTimes and Johns Hopkins University Github repositories (Johns Hopkins University, CSSE, 2020; The New York Times, 2020); the former was used for the analysis initial case data in metropolitan regions, while the average of the two data sets was used for all other analyses. In each case, daily confirmed counts were utilized. Demographic data by county, including people per household, estimated 2019 population, annual births, and annual deaths were obtained from the US Census 2019 *County Population Estimates* data set (United States Census, 2019a). Median ages were determined from the US Census 2018 *County Characteristics Resident Population Estimates* data set (United States Census, 2018). For the Median Age, Wolfram|Alpha has curated the raw data from *United States Census Bureau, American Community Survey 5-Year Estimates: B01002, the Median Age By Sex, American FactFinder*; for the People per Household and Annual Death, the source of curated data is *United States Census Bureau, State & County QuickFact*s. County outline polygons were obtained from the US Census 2019 *TIGER/Line shapefiles* database (United States Census, 2019c). Local weather data (Figure 10 was obtained from the NOAO *Global Surface Summary of the Day* (GSOD) database (National Oceanic and Atmospheric Observatory, 2020). The nearest WBAN station with daily dew point and pressure values (for calculation of specific humidity), and daily average temperature was chosen for each county or metropolitan region. Weather data was averaged over a two-week period for *λ*_{14}, and over a window equal to the growth period for metropolitan regions. Google’s *COVID-19 Community Mobility Reports* dataset (Fitzpatrick and Karen, 2020), specifically “Workplace mobility,” was used to estimate the human social mobility in each county (Figure 10).

Population-weighted population density (or, population weighted density, PWD) (Craig, 1985; Wilson, 2012; Dorling and Atkins, 1995), was calculated using the Global Human Settlement Population raster dataset (European Commission Joint Research Centre, 2019), which contains 250 m-resolution population values worldwide, taken from census data. The value of PWD for a county — or for a set of counties, in the metropolitan region analysis — was calculated as the population-weighted average of density over all (250 m)^{2}-area pixels contained within the region, i.e.,
where *p*_{j} is the value (i.e., the population) of the *j*th pixel, *a*_{j} = 0.0625 *k*m^{2} is the area of each pixel (the GHS-POP image uses the equal-area Molleweide projection), and ∑_{i}*p _{i}* is the total population of the region. This measure has also been called the

*lived population density*because it is the population density experienced by the average person.

In high density counties, the population weighted density PWD is close to the mean density of the county *D* = Pop*/*Area, suggesting a uniform distribution of population (see Figure 11). However, in lower density counties, the mean density is much lower than the population weighted density, due to heterogeneous dense pockets of population amidst vast empty spaces outlined by political boundaries. To represent the degree to which the population density changes across the region (county or metropolitan region) we define the *population sparsity index, γ*. Assuming that the population-weighted population density declines approximately as a power law with “pixel” area, PWD , we define:

In other words we estimate the assumed power-law decline using two data points. The distribution of *γ* and its correlation with county population and population density are shown in Figure (12). We can see that *γ* ranges from 0.09 (i.e., very uniform) for the most populous/dense counties to 0.88 (i.e. very sparse) for least populated/dense counties. For reference, the value of *γ* for New York City is 0.14.

#### Initial growth of confirmed cases for metropolitan regions

For each of the top 100 metropolitan regions (United States Census, 2019b), a logarithmic-scale population heat map, windowed from the full GHS-POP raster image, was used to select a minimal connected set of US counties enclosing the region of population enhancement. In this process, overlap and merger reduced the total number of metropolitan regions under consideration to 89.

As is discussed in the main text, nearly every metropolitan region saw, in mid-March 2020, an exponential increase of daily confirmed cases, followed by a flat/plateau period of nearly constant daily confirmed cases. In a few cases, the second phase — primarily caused by the country-wide lockdown — lasted only days or weeks (possibly signaling a depletion of the susceptible population, see discussion in main text), but for most metropolitan regions the plateau persisted for months (indeed, persists or is again increasing at the time of this writing). Thus, the initial value of the exponential growth rate, *λ*, of daily confirmed cases could be reliably and automatically estimated by fitting the case numbers to a logistic function
where *t*_{0} represents the transition time from exponential growth to a constant, *f*_{max} is the plateau value in case numbers, and *f*_{logistic}(*t*) *∝* exp[*λt*] for *t ≪ t*_{0}. Fits were performed on the logarithm of the case numbers, yielding the maximum likelihood estimation of parameters under the assumption lognormally-distributed errors (an analysis of the fit residuals, not shown, confirmed this assumption: case number fluctuations exhibit a variance far in excess of Poisson noise, but are well modeled by a log-normal probability density function with constant width), and associated estimates of the variance in each parameter. To avoid polluting the exponential growth phase with singular early cases, a “detection limit” of 3 was imposed, and all daily case values less than or equal to that limit were ignored in fitting. The only manual intervention required for this fit was the specification of the upper limit of its range, i.e., the *end* of the plateau region, for each data set.

To analyze the effect of demographic, population, mobility, and weather variables on this initial growth rate, we perform a weighted linear regression to the lambda values (and their standard errors) of the 89 metropolitan regions.

To choose representative cities for the visual examples in Figures 1A and 1C, we performed an additional logistic fit to the mortality incidence data of each region and retained for Figure 1C only those that had (1) less than 15% error in both growth rates, and (2) |*λ*_{case} −*λ*_{death} |< 0.15*d*^{−1}. This was done in an effort to specifically comment on or highlight only those cities for which the growth rate was accurately determined, and was well correlated with the more reliable measure, mortality growth, that we used for the remainder of the analysis.

#### Linear Dynamical Model of Mortality Data

A standard weighted least squares analysis was performed on the measured exponential growth rate, *λ*_{14}, as a function of demographic, mobility, population and weather variables, with weights equal to inverse root of the estimated variance.

#### Nonlinear Model

We construct a model where, in the standard analogy to chemical reaction kinetics, the incidence of infections per unit area at time, *i*(*t*), is proportional to the product of the density of susceptible individuals, *S*(*t*), and the density of infected individuals, *I*(*t*). But, we allow for the rate constant for infection^{3} in the encounter, *β*, to depend on the infected individual’s “stage” of infection, *C*, with *C* = 0 immediately following infection. The incidence then has the form:
where ℐ (*C, t*) is the density of infected per stage at time *t*, and the first equality expresses that we neglect changes

to the susceptible population by all means other than infection. The density of infected individuals is found by integration over the stages of infection,

If the rate constant were taken to be independent of stage, i.e., , we would obtain the familiar expression . We will assume spatial homogeneity and that the total density of individuals is constant and equal to *S*(0) for a particular region, but, that the density could vary when comparing different regions.

We assume that an infected individual’s evolution through the stages of infection, *C*, follows a Gaussian random walk in time, but modulated by an exponential rate, *d*, of death or recovery. Therefore we have
where *a* is the “age” of an infection (time since infection), and the probability density function for the stage at a given age is given by
where *τ* is the characteristic time scale of the random walk^{4}. Integrating the expression for ℐ (*C, t*) over all stages and taking the derivative with respect to time yields the familiar expression *İ*(*t*) = *i*(*t*) - *dI*(*t*), showing that the model reduces to the SIR model if a stage-independent rate constant, , is assumed.

As we show here, using the random walk to specify the dependence of infection stage on time allows for both a non-exponential distribution of delays to infectiousness (which is more realistic than that of the simplest model with incubation, the SEIR model) and a formal solution for the exponential growth rate. Inserting the expression for ℐ (*t, C*) into the incidence equation yields
which is in the form of a *renewal equation* (Heesterbeek and Dietz, 1996; Champredon and Dushoff, 2015; Champredon, Dushoff, and Earn, 2018), with the bracketed expression being the expected infectivity of an individual with infection age *a*. To obtain the simplest nontrivial incubation period, we assume that — where Θ(*x*) is the Heaviside step function — meaning that an infected individual is only infectious once they reach stage *C* = 1, and the infection rate constant is otherwise unchanging. This implies that the incidence is
where
is the probability that an individual infected *a* time units ago is infectious.

If we now assume that the density of susceptibles is constant over some interval of time, and that the incidence grows (or decays) exponentially in that interval, *i*(*t*) = *A*e^{λt}, we find
which, assuming *λ* + *d >* 0 (i.e., the exponential growth rate cannot go below −*d*), can be integrated to obtain

This expression for (*λ* + *d*) has a formal solution in terms of the *Lambert W-function*, with simple asymptotic forms:

For the early stages of the epidemic, when we can assume that the population of susceptibles is approximately constant and large, we see that the growth rate depends approximately linearly on the square of the logarithm of the density. In later stages, when either the base contact rate declines due to social distancing interventions, or the population of susceptibles decreases, we see that the exponential rate takes the value *λ ≈* −*d*.

In practice, we utilize the exact Lambert W-function expression as our “nonlinear model” for fitting *λ*_{14}, where we parameterize *β* and *τ* by the demographic, population, and weather variables (see main text). To estimate the susceptible density, , in this procedure we must use the reported mortality statistics. Thus far we have not specified the dynamics of death. We now make the assumption that the probability of death increases proportionally to the number of exposures an individual experiences. As we prove in a separate section, below, this implies that the susceptible density is related to the fraction of dead in the community, *f*_{D} = *D*_{tot}*/N* (where *D*_{tot} is the cumulative mortality and *N* is the total population), by *S*(*t*) = *S*(0) exp [−*C*_{D} *f*_{D}].

The *basic reproduction number, R*_{0}, and the distribution of *generation intervals, g*(*t*_{g}), are defined (Champredon and Dushoff, 2015; Nishiura, 2010) through the function *F*(*a*):

The generation interval (or, generation time), *t*_{g}, is the time between infections of an infector-infectee pair, and is often estimated from clinical data by the *serial interval*, which is the time between the start of symptoms (Britton and Scalia Tomba, 2019), and the basic reproduction number is the average number of infectees produced by a single infected individual, assuming a completely susceptible population. These quantities can be calculated exactly for our model, as
and
where the expected value and variance of the generation interval are then:

## Extended Results and Analysis

### Relation between the remaining susceptible density, *S*(*t*) and the death fraction, *f*_{D}(*t*)

In epidemic models the infection of susceptible individuals is typically determined by
where *I*_{*} is the density of infectious (contagious) individuals, and for our model, *βSI*_{*} is the right-hand side of Eqn. 7. This can be solved, formally, as:

Alternatively, the susceptible density can be expressed in terms of the cumulative number of infected individuals, *I*_{tot}, i.e.,
where *f*_{I} (*t*) = *I*_{tot}*/N*, with *N* the total population. When fitting the exponential growth rate of mortality, *λ*_{14}(*t*) to our nonlinear model, Eqn. (16) (see main text), we must estimate the value of the susceptible density driving growth at that time. Without any reliable information about the true infected or infectious populations, we must do so using the mortality statistics. We show here how the previous two equations can be used, along with reasonable assumptions about distinct sub-populations driving infection and death, to determine a relationship between the reported cumulative mortality (per capita) and the remaining susceptible density.

Our basic assumption is that there are two different categories of susceptible individuals underlying the dynamics of the epidemic: (A) highly mobile individuals with a large geographic reach that frequently interact with other individuals (in particular, infectious individuals) and thus drive the dynamics of infection (these could be termed “super-spreaders” (Liu, Eggo, and Kucharski, 2020)); and (B) essentially non-mobile individuals that have quite rare contacts with infectious individuals, but have a much higher probability of death once infected, and therefore make up the majority of the mortality burden. The dynamics of each susceptible population is governed by an equation of the form in Eqn. (21), with a common density of infectious individuals, *I*_{*}, but with different rate constants, *β*_{A} *≫ β*_{B}. From Eqn. (22), we see that the susceptible densities of the two populations are then related, at any time, by:

Expressing the non-mobile population in terms of the cumulative fraction infected, we have
and, assuming that the infection fatality rate (IFR) is a constant factor, *f*_{D}(*t*) = IFR × *f*_{I} (*t* −Δ*t*), where Δ*t* is the delay from infection to death, we can write:

Finally, having assumed that the ratio of infection rates is large, we can approximate this as:

The “A” category of individuals, as defined above, are exactly those individuals driving the infection in our model (and, presumably, in the real world), and, therefore, the susceptible density *S*_{A} is exactly that which must be estimated for Eqn. (16). On the other side, with people aged 65 and over accounting for ∼ 80% of COVID-19 deaths, and with approximately ∼45% of deaths linked to nursing homes, the mortality statistics are clearly tracing individuals similar to category “B.” Therefore, we use this relationship,
to estimate the susceptible density in terms of the reported per capita mortality, where we assume *S*(0) is proportional to the population weighted density (PWD).

We also considered the standard approach, in which the population is a single homogeneous group. In that case, the susceptible density could be estimated as

In testing both models, we found that the two-component population scenario was preferred by the data at the ∼10*σ* confidence level, with the homogeneous population model failing to capture the observed dependence of the growth rate on the per capita mortality (Figure 13).

The broader implications of our assumption of two populations is that the required proportion of individuals with immunity for “herd immunity” to take effect, is lower than population with homogeneous mobility characteristics, i.e., the epidemic will slow as a significant proportion of the “super-spreader” category have been infected (category A, above), regardless of the level of infection and immunity in the rest of the population. Indeed, the effect of population heterogeneity on lowering the “herd immunity” threshold for COVID-19 was recently noted (Britton, Ball, and Trapman, 2020), and will be important in interpreting the results of randomized serology tests across the entire population (Havers et al., 2020b).

### Incubation Time

The nonlinear epidemic model described above posits that the incubation of SARS-COV2 virus within an infected individual can be modelled by a stochastic random walk starting at zero, with excursions beyond *±* 1 corresponding to episode(s) of infectiousness. This makes our model distinct from the standard SE^{m}I^{n}R compartmental models (see, e.g., (Champredon, Dushoff, and Earn, 2018)), where the progress of the disease is only in one direction — *E*^{1} → *E*^{2} →… →*I*^{1} →*I*^{2} →… — while in our model (Figure 14), the individual can jump back and forth between different stages (with the obvious exception of Death), with a constant exit rate of *d* for quarantine, recovery, or death. This can be described using a (leaky) diffusion equation:

Based on this picture, and the best-fit parameters to the US county mortality data (Table 2), we can infer the probabilities associated with the different stages of the disease. For example, by looking at the steady-state solutions of Equation (29), we can compute the probability that an exposed individual (who starts at *C* = 0) will ever become infectious (i.e., make it beyond *C* = *C*_{inf} = 1):

This is plotted as a function of the median age of the community in Figure (15a). For example, for the median age of all US counties, *A* = 37.4-yr, we get:

i.e., less than 12% of exposed individuals will ever be able to infect others, although this fraction increases in older communities.

Next, we can compute the distribution of times for the onset of infectiousness, i.e., the incubation period. This can be done by using a first crossing probability of a random walk, which we did by solving Equation (29) using a discrete Fourier series in the (0, +1) interval. The resulting probability density is shown in Figure (15b), again showing a shorter incubation period in older communities.

Finally, we can compute the probability density function for the generation interval, *g*(*t*_{g}), defined in Equation (19) (Figure 15c). This shows a similar qualitative dependence on age as the incubation period, but the median incubation period is, as expected, shorter than the generation interval for each age group. Using Eqn. (20) and our parameterization of *τ*, we find a mean generation interval of

This estimate is much longer than those found by tracking the serial interval (time from between the start of symptoms for an infector-infectee pair) in COVID-19 patients (Ganyani et al., 2020; Nishiura, Linton, and Akhmetzhanov, 2020), which are on the order of 5–10 d. It is possible that the long tail of these distributions, generated by the slow asymptotic exponential decay at rate *d ≈* 0.06*d*^{−1}, raises the mean generation interval, while a clinical study, is necessarily biased toward shorter serial intervals.

### Error Diagnostics and Forecasting COVID-19 Mortality

One of the most pressing questions in any exercise in physical modelling is whether we have a good understanding of the uncertainty in the predictions of the model. While we have an estimate of the measurement uncertainties for the mortality growth rates, *λ*_{14}, which we discussed in the main text, we also should characterize whether the deviation of the best-fit model from the measurements are consistent with statistical errors. To evaluate this, we can look at the average of the ratio of the variance of the model residuals to that of the measurement errors, otherwise known as reduced *χ*^{2}, or . This is shown in Figure (16), demonstrating that we see no systematic error in model that is significantly bigger than statistical errors, across counties with different populations.

As another consistency check, Table (3) examines whether the parameters of the model change significantly from urban counties with large, uniform populations, to rural counties with a small and more sparse population (Figures 11-12). From counties with enough COVID mortality data, roughly those with population ≳10^{6} inhabit half of the total population, which we chose as our threshold, separating large from small counties. We notice no statistically significant difference, and Table (3) even suggests that Fisher errors quoted here might be overestimating the true errors. This comparison brings further confidence in the universality of the nonlinear model across geography and demography.

On average, we find that (either county-weighted or population-weighted) , suggesting that the model errors are only 13% bigger than statistical errors.

We further compare the model-prediction vs measured mortality growth rate in Figure (17) for all our available data. We find that the 1-*σ* error in the model prediction (in excess measurement errors) is on average *σ*_{14} = 0.0180, i.e. 1.8% error in the daily mortality growth rate. This is shown in Figure (17) as the red region, which compares the model prediction with the observed mortality growth rates. We can also see that there appears to be no significant systematic deviation from the predictions, at least for *λ*_{14} < 0.23/day.

Given an understanding of the physical model and its uncertainties, we can provide realistic simulations to forecast the future of mortality in any community, similar to those provided in the main text (Figure 4), which can be made on-demand using our online dashboard: https://wolfr.am/COVID19Dash.

In order to perform these simulations, we follow these simple steps. To predict the daily mortality on day *t* + 1, *D*(*t* + 1), we use the prior 13 days of *D*(*t*), as well as the total mortality up to that point:

Use Equation (1), plugging in prior total mortality, county information, weather, mobility and parameters in Table 2 to find

*λ*_{14}. Every simulation uses a random realization of model parameters (from their posterior fits), which remain fixed through that simulation.Add the random model uncertainty to

*λ*_{14}using: where*g*_{t ′}s are random independent numbers drawn from a unit-variance normal distribution. This captures the model uncertainty mentioned above, while ensuring that it remains correlated across the 14 days that are used to define*λ*_{14}.Having fixed the logarithmic slope for daily mortality

*λ*_{14}, find the best-fit intercept and its standard error for ln[*D*(*t*′) + 1*/*2] for the preceding 13 days, i.e.*t*−12*≤ t ′ ≤t*, which can then be used to find a random realization for ln[*D*(*t*+ 1) + 1*/*2]Advance to next day and return to step 1.

## Acknowledgement

We are indebted to helpful comments and discussions by our colleagues, in particular Bruce Bassett, Ghazal Geshnizjani, David Spergel, and Lee Smolin. NA is partially supported by Perimeter Institute for Theoretical Physics. Research at Perimeter Institute is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development Canada and by the Province of Ontario through the Ministry of Colleges and Universities.

## Footnotes

↵

^{1}The doubling time is ln 2 divided by the exponential growth rate.↵

^{2}See Supplementary Material for more detailed discussion of Error Diagnostics.↵

^{3}In a physical picture of collisions, the rate constant of infection is*β*(*C*) = ⟨*σv*⟩_{eff}(*C*), i.e., the scattering cross section of an encounter between a susceptible individual and an infectious individual in stage*C, σ*, multiplied by their relative velocity,*v*.↵

^{4}More precisely:*C*is the absolute value of the position of a 1D random walker, taking one step every Δ*t*, with step size drawn from a normal distribution with mean zero and variance Δ*t/τ*. The variance of the walker position at time*t*is then*t/τ*