Abstract
We apply a model developed by The COVID-19 Response Team [S. Flaxman, S. Mishra, A. Gandy, et al., “Estimating the number of infections and the impact of non-pharmaceutical interventions on COVID-19 in 11 European countries,” tech. rep., Imperial College London, 2020.] to estimate the total number of SARS-CoV-2 infections in the United States. Across the United States we estimate as of 6 April 2020 the fraction of the population infected was 4.8% [3.2%, 8.0%], 39 times the portion of the population with a positive test result. Excluding New York state, which we estimate accounts for over half of infections in the United States, we estimate an infection rate of 2.0% [1.7%, 2.6%].
We include the timing of each state’s implementation of interventions including encouraging social distancing, closing schools, banning public events, and a lockdown / stay-at-home order. We assume fatalities are reported correctly and infer the number and timing of infections based on the infection fatality rate measured in populations that were tested universally for SARS-CoV-2. Underreporting of deaths would drive our estimates to be too low. This model does not include effects of herd immunity; in states where the estimated infection rate is very high - namely, New York - our estimates may be too high.
I. Introduction
The first case of SARS-CoV-2, the virus that causes COVID-19, diagnosed in the United States was discovered on January 21st, 2020 [1]. Since then, limited availability of SARS-CoV-2 tests has interfered with efforts to track the spread of the virus. As of April 6, 2020, according to Johns Hopkins University there were 396,200 reported cases in the United States [2]. The true number of cases could be far higher.
Europe has faced similar constraints on SARS-CoV-2 testing. The COVID-19 Response Team (CRT) at Imperial College London has responded to this challenge, in part, by developing a Bayesian hierarchical model to estimate the true number of infections and the impact of policies intended to slow the epidemic [3]. In this paper, we apply the CRT model to individual states in the United States, plus Puerto Rico and Washington, D.C. We remove states with fewer than 10 reported fatalities. We also reproduce the CRT estimates for 11 European countries with an additional week of data on reported fatalities.
We briefly summarize the CRT model here, but refer to [3] for full details. The model takes three inputs from each modeled population: 1) the age distribution, 2) the time series of known COVID-19 fatalities, and 3) the start dates of four mitigation policies (encouraging social distancing, closing schools and universities, banning public events, and a lockdown / stay-at-home order). It also relies on four values estimated from prior research: 1) the infection fatality ratio (IFR) by age, 2) the serial interval distribution, 3) the infection-to-onset distribution, and 4) the onset-to-death distribution. These values are not extracted from population data based on testing people who are already symptomatic. Instead, they are measured from several cases in which all members of a small population at high risk of infection were tested and observed, whether or not they showed symptoms. We discuss the IFR further in section III.
There are several possible sources of error we do not account for. First, we assume COVID-19 fatalities are reported correctly. If they are underreported, the infection estimates here will be underestimates. Second, the model extrapolates exponential growth rates without reference to the size of the population. If the estimated attack rate approaches the level of herd immunity, the model likely overestimates the total number of infections. Third, we assume the IFR is known exactly. Propagating the error from the IFR would increase the width of our confidence intervals.
The model estimates the baseline reproduction number of the virus without intervention, R_0, and the impact of the considered interventions. It fits one value for each of these variables across all populations simultaneously. The impact of any given intervention is the same for all populations. Most of the interventions were implemented in Europe before the United States. Therefore, the model estimates the impact of these interventions mostly based on the trajectory of the epidemic in Europe. Using the number of fatalities as of a given time, we can infer the number of infections in the past, and extrapolate them to the present using the fitted reproduction numbers under different policy regimes. The model does not use the reported number of infections.
This approach differs from the CRT’s model in [4], which is a detailed SIR model. Instead, this model more closely matches that of the UW IHME [5], which forecasts exponential growth rates after different policies are implemented based on historical growth rates in countries and states that previously implemented similar policies.
II. Results
As in Europe, we estimate infections in the United States are much higher than reported infections. As of 6 April 2020 we estimate the attack rate in the United States was 4.8% [3.2%, 8.0%], 39 times the reported infections of 396,200. The difference between estimated and reported infections varies substantially by state. In New York, there are an estimated 9,000,000 infections, 70 times the reported value of 130,000. Whereas, in Oregon, our estimate of infections is only 8.9 times the reported value.
In Figure 1 and Table 1 we report the estimated fraction of the population infected in each state. The estimated value in New York is so high that it is likely close to herd immunity. If the spread is already slowing in New York due to herd immunity, the model would overestimate the number of infections. The state with the second highest attack rate, New Jersey, at 15.60% [7.12%, 29.29%], is well below herd immunity.
In Table 2, we compare the weighted average attack rate in the United States to the same 11 European countries as in [3].
In each population, the model infers the time series of infections that most likely gave rise to the observed time series of fatalities. In the appendix, in Figure A1 through Figure A52, we show the estimated and reported time series of infections and fatalities. Model parameters are fit simultaneously across all populations. The second panel in each of these figures shows how well the model reproduces the observed fatalities. The gap between estimated and reported fatalities can be interpreted, in part, as a measure of how well the model reproduces the epidemic in each population.
III. Methods
We retrieved dates for policy interventions as follows. We took lockdown dates from the New York Times rounded to the nearest day (lockdowns going into effect past noon are recorded as the next day) [6]. School closure dates are from [7]. Following [3], if a school closure began on a Monday, we set the date of the school closure to the previous Saturday. We set social distancing and the ban of public events to Mar 16, 2020 for all states, when federal guidelines were put into place [8]. See the dates we assign for each region in each state in the Appendix (Table A1). Age distributions for each state come from the US Census Bureau [9]. The time series of reported fatalities and cases in the United States was crowd sourced from Wikipedia [1]. The time series of reported fatalities and cases in the 11 European countries comes from the European Centre of Disease Control (ECDC) [10]. We change the name of the country “Georgia” to “Georgia(country)” in the ECDC data set to avoid confusion with the state of Georgia. In the ECDC data we also delete the entries for Puerto Rico, for which we get duplicate data as a US territory. For the 11 European countries, IFR comes from [3]. All other IFR’s are age-adjusted from [11].
All of our code and input data are available at https://drive.google.com/drive/folders/1DDST4e3B875wxsVApJTKLn94seM5IZbx?usp=sharing.
Data Availability
All of our code and input data are available at the Google Drive link below.
https://drive.google.com/drive/folders/1DDST4e3B875wxsVApJTKLn94seM5IZbx?usp=sharing