ABSTRACT
The dynamic characterization of the COVID-19 outbreak is critical to implement effective actions for its control and eradication but the information available at a global scale is not sufficiently reliable to be used directly. Here, we integrate multiple data sources through dynamical constraints to quantify its temporal evolution and controllability around the world and within the United States. Overall, the numbers of actively infectious individuals have remained high beyond targeted controllability, with worldwide estimates of 10.24 million on November 24, 2020, totaling in 266.1 million cumulative infections growing at a rate of 11.12 million new infections per week. The actively infectious population reached a local maximum of 7.33 million on July 16, 2020 and remained virtually stagnant at a global scale, with growth rates for most countries around zero that compensated each other, until reverting to net growth on September 22, 2020. We validated the approach, contrasting with prevalence data and the effects of nonpharmaceutical interventions, and we identified general patterns of recession, stabilization, and resurgence. The diversity of dynamic behaviors of the outbreak across countries is paralleled by those of states and territories in the United States, converging to remarkably similar global states in both cases. Our results offer precise insights into the dynamics of the outbreak and an efficient avenue for the estimation of the prevalence rates over time.
INTRODUCTION
The global spread of the COVID-19 outbreak has had a major global impact. As of November 24, 2020, there have been 1.67 million reported deaths and 72.4 million confirmed cases [1]. Massive travel restrictions, imposed quarantines, lockdowns, and other nonpharmaceutical interventions (NPIs) around the world have been able to slow down the progression of the outbreak, but their gradual lifting has already led to its deadly resurgence in multiple areas [2]. Currently, it is still unclear how best to proceed to balance the economic, societal, ethical, and health trade-offs present. The most widespread quantification, based on the basic reproduction number, is insufficient to address the different trade-offs [3]. It indicates whether the outbreak is growing up or dying out, but it misses crucial information, including the timescales of the processes, the latent potential for a resurgence of the outbreak, and the possibility of actively controlling and tracing the infectious population.
Obtaining a detailed dynamic characterization of the outbreak, however, has proved to be particularly challenging and has been achieved only over small controlled populations. This characterization involved testing for the causative virus SARS-CoV-2, identifying contacts and the infection time, and following up the clinical evolution [4-7]. The available epidemiological information has not been sufficiently reliable to be used directly and the detailed characterization is not feasible at a global scale. Methods based on compartmental models [8] or Bayesian approaches [9] have relied on location-specific information that is not readily available at a global scale.
Here, we uncovered optimal dynamical constraints to integrate the available detailed clinical information with demographic data and epidemiological curves, and we applied them to the quantification of the key determinants of the temporal evolution and controllability of the COVID-19 outbreak, including the number of infectious individuals, the growth rate of the infectious population, and the size of the infected population. We considered explicitly countries in the world and states and territories within the United States (US) with at least 30 COVID-19 reported deaths, which covers 97.4% of the world and 99.9% of the US population.
RESULTS
Optimal dynamical constraints
The approach considers the dynamics of the infectious population, nI(t), at time t described through the expression which establishes the definition of the (per capita) growth rate kG(t). This expression is completely general and independent of the underlying dynamics of the infection.
The underlying infection dynamics dictates the relationship of kG(t) and nI(t) with the different epidemiological quantities. Based on an infection-age structured mathematical description [10, 11], we have developed an approach to uncover the optimal dynamic constraints for these relationships in terms of delays and scaling factors (Supplementary Information).
We find that nI(t) is optimally related to the rate of increase of the expected cumulative deaths, nD, at a later time t + τD according to where IFR is the infection fatality rate and τG is the average generation time. This equation explicitly takes into account that, on average, an infectious individual within nI(t) has been infected at time and potentially dies with probability IFR at a time τI + τOD after infection. Here, is the variance of the generation time, and τI and τOD are the incubation and symptom onset-to-death average times, respectively. The values of these characteristic times have been estimated in days as τG = 6.5, σG = 4.2, τI = 6.4, and τOD = 17.8 from precise follow up of specific groups of patients [5, 12, 13], which leads to (Supplementary Information).
A general assumption of the approach is that the IFR for each age group remains the same for all locations [5] and that the overall IFR for each location is obtained as the average over its specific population’s age distribution.
The growth rate is obtained directly from Eqs. (1) and (2) as which is related to the time-varying reproduction number, Rt, through the Euler–Lotka equation (Supplementary Information).
The expected cumulative number of infected individuals at a time t, nT(t), follows from which is obtained also from the dynamical constraints (Supplementary Information).
To compare with prevalence studies, we used the dynamical constraints (Supplementary Information) to obtain the relationship of the expected number of seropositive individuals, nSP(t), with the infected population, where τSP is the average seroconversion time after infection. We also obtained the relationship of the expected number of positive reverse transcription polymerase chain reaction (RT-PCR) testing individuals, nTP(t), with the infectious population, where τTP is the average time for testing positive after infection and ΔtTP is the average number of days of positive testing. The values of these additional characteristic times have been estimated in days as τSP = 13.4, τTP = 14.4, and ΔtTP = 20 from clinical studies [13-16] (Supplementary information).
Implementation
Eqs. (1) – (4) completely characterize the dynamics of the outbreak from the expected cumulative deaths, the age structure of the population, and the general clinical parameters of the infection. We used a workflow (Fig. 1A) that incorporates explicitly the death counts compiled by the Johns Hopkins University Center for Systems Science and Engineering [1], the age structure reported by the United Nations for countries [17] and the US Census for states and territories [18], previously estimated aged-stratified IFR [5], and other previously estimated clinical parameters [5, 12-16]. The expected number of daily deaths was inferred using density estimation from the death curves after preprocessing to minimize reporting artifacts (Supplementary Information). Eqs. (5) and (6) were used to set the appropriate scale and delay to compare with seroprevalence studies.
Validation against prevalence data and NPIs
To validate the approach, we contrasted the estimated infectious and infected populations with the results from available antibody seroprevalence and RT-PCR testing studies for countries in the world and locations in the US with at least 30 reported deaths (Fig. 1B). The observed and estimated values agree with each other within the 95% credibility intervals (CrI), with a 1.54-fold accuracy over almost a 1,000-fold variation and a correlation coefficient (ρ) on a logarithmic scale of ρ = 0.94.
We combined into a separate analysis (Fig. 1C) the data of two additional comprehensive antibody seroprevalence studies within the US, which included 49 [19] and 38 [20] states with non-zero prevalence values. For the states present in the two studies, we considered their average values. The estimated values agree with the observed prevalence with 1.62-fold accuracy and ρ = 0.81. The agreement for the combined data is better than for the data of each of the studies independently (1.65-fold accuracy and ρ = 0.74 for one study [19] and 1.74-fold accuracy and ρ = 0.77 for the other study [20]) and better than the agreement of both studies with each other (ρ = 0.55 for the 38 states common to both studies), indicating that the estimations of the approach fall within the observed variability of the prevalence studies. Indeed, the approach is effectively unbiased collectively, with the (geometric) average of the estimations being just a factor 1.12 (globally) and 1.14 (US) larger than that of the observations.
Overall, the validation of the approach shows that it can reproduce the available prevalence studies over a factor 615.0 between minimum and maximum values for countries and states with 72.0% (globally) and 79.6% (US) of the estimates within a factor 2 of the observed values, and with 100.0% (globally) and 93.9% (US) within a factor 3. There are no countries and only three states with values outside the factor 3 boundary. In the cases of Rhode Island and New Hampshire, the estimated infectious population is larger than the ones reported by one study [20], which is consistent with the observed age-specific seroprevalence biased to older populations. In the case of Oregon, the estimated infectious population is smaller than the ones reported [19, 20], which is consistent with the observed age-specific seroprevalence biased to younger populations [20]. In the case of Oregon, in addition, only 338 of the 1123 excess deaths during the outbreak before the collection of the specimens for analysis were attributed to COVID-19 [21].
We also validated the ability of the approach to capture the effects of NPIs (Fig. 1D). The inferred timings of the peak infectiousness (maximum infectious population) are concordant with the dates of the major early country-wide lockdowns in Europe [9], with an overall average deviation of 0.0 days between lockdown and peak dates and a mean absolute deviation of 2.4 days. This concordance also indicates that there is only a small variability in the average timing between infection and death among those countries.
Outbreak progression motifs
We analyzed explicitly the time evolution of the growth rate, the infectious population, the cumulative number of infections, and the relationship between growth rate and infectious population for all countries and states with at least 30 deaths (Supplementary Fig. S1 and Table 1).
The characterization of the dynamics in terms of the growth rate and infectious population (Supplementary Fig. S1) shows that there are prototypical types of behavior, or motifs (highlighted with representative examples in Fig. 2). Locations either contained (Figs. 2A-2C) or amplified (Figs. 2E-2F) the outbreak in its initial stages. In the case of initial containment, the behavior branched subsequently into contained sporadic resurgences (Fig. 2A), uncontrolled resurgence of the infectious population over the initial maximum infectiousness (Fig. 2B), and sustained decrease of the outbreak (Fig. 2C). The specific behavior depended on the success of the measures implemented, e.g. targeted control and moderate lockdowns, and their subsequent relaxation [22]. In the case of initial amplification, the dynamics proceeded in diverse ways, including an increasing infectious population converging to zero growth (Fig. 2D), a fast-evolving infectious population switching from positive to negative growth (Fig. 2E), and fast convergence to subsequent sustained residual growth (Fig. 2F). In general, the locations that reached a substantial negative growth rate are those that implemented long-term strict lockdowns, whereas zero or small positive growth rates correspond to intermediate measures with partial restrictions [22]. In many cases, lifting restrictions has led to fast switching from negative to positive sustained growth, as illustrated by United Kingdom, and Italy (Fig. 2E), indicating the inability of these locations to contain the outbreak even at low values of the infectious population.
Global dynamics
At a global scale, the outbreak is characterized by two early local maxima of the overall worldwide infectious population on January 25, 2020 and March 25, 2020, coincidental with the regression of the outbreak in China, initially, and in Europe, afterward, reaching the highest local maximum of 7.33 million active infections (Figs. 3A and 3B) on July 16, 2020 with a subsequent increasing trend since September 22, 2020. In the US, there has been an overall maximum of the infectious population on March 28, 2020 with 1.41 million active infections and a local maximum on July 15, 2020 (Figs. 3D and 3E). The estimated infected population is 266.1 million, growing at a rate of 11.12 million new infections per week, for the World (Fig. 3C) and 31.8 million, growing at a rate of 1.50 million new infections per week, for the US (Fig. 3F), which is a factor 3.7 for countries in the World and 2.5 for locations in the US higher than the corresponding reported cases [1].
State of the outbreak across locations and its controllability
At a local scale, the per capita infectious population of countries and states has only exceptionally crossed the 1% value (11 out of 153 countries and 11 out of 53 states) (Fig. 4). The infected populations have surpassed the 10% of the total population only for 20 countries and 18 states (Fig. 4), with just one country and one state reaching the 20% value. These results confirm that relying on herd immunity is not a realistic option. Controlling the outbreak by actively tracing and isolating newly and potentially infected individuals has been successfully implemented in South Korea, with the use of large resources and with occasional outbreaks that required short-term extended human-interaction restrictions, and almost successfully in Japan, with voluntary business closures and other restrictions [22]. These countries have always remained below 7 actively infectious cases per 100,000 individuals (0.007% of their population), with average values of 2.2 (South Korea) and 2.5 (Japan) since March 1, 2020. None of the states and 31 of the countries analyzed have had a per capita number of infectious individuals below the average value of South Korea since May 1, 2020, which indicates that 61.1% of the World (79.5% excluding China) and 99.9% of the US human population reside in countries or states that have not allowed targeted controllability of the outbreak with the resources used by South Korea. This inability to control the outbreak as soon as restrictions are lifted, even at low values of the infectious population but above the South Korea average value, is illustrated by United Kingdom and Italy (Fig. 2E).
From local to global dynamics
The collective properties of the individual local dynamics, as quantified by the distribution of growth rates across countries and states over time (Fig. 5), shows a progressive double stabilization of the outbreak before a subsequent growth period starting in late September 2020. At a local scale, growth rates for most countries, whether positive or negative, decreased in absolute value, leading to a slowdown of the dynamics. At a global scale, positive and negative values have converged towards statistically compensating each other, decreasing further the overall net growth of the infectious population. Specifically, the latest estimates of the growth rates on November 2, 2020 for countries in the world have a median value of 0.002/days, with 80% of the countries within the narrow range of values from −0.014/days to 0.022/days, which implies a slow local dynamics with half-lives and doubling times of 48.5 and 30.8 days, respectively. For locations in the US, the median value of the growth rates is 0.017/days, with 80% of locations ranging from 0.005/days (138.2 days doubling time) to 0.024/days (28.7 days doubling time).
DISCUSSION
The dynamical constraints we have obtained through a detailed infection-age mathematical description of the outbreak allowed us to find the optimal time delays and scaling factors to connect the evolution of the reported death counts over time with those of the infectious and infected populations. Overall, integrating these constraints through a workflow with essential preprocessing steps showed that the approach can consistently infer the precise timing of NPIs and estimate prevalence data across countries in the world and locations in the US. A prominent feature of the approach is its ability to provide reliable results even for low death counts, which overcomes the major limitations of choosing between unreliable infection case data (highly dependent on testing rates) or noisy death counts as input to the inference problem [9].
The approach assumes a general age-stratified IFR. In general, these quantities are expected to depend potentially on specific features of the population and the medical care facilities available. The available studies show a minimal variability among different countries and other locations that reported on prevalence [23]. It also assumes an age-uniform exposure (attack rate), which is consistent with data for other respiratory diseases [5] and holds to a large extent when there is information available for COVID-19 [19, 24, 25]. We have also assumed a constant generation interval typical of non-confinement locations, which has been observed to shorten in some cases by NPIs [26]. Prevalence studies can also depend on the diminishing antibody levels after infection [14, 27], collecting and processing specimens for analysis [28], and potential biases towards specific population groups [19]. In addition, there might be a degree of under-reporting of COVID-19 deaths, as suggested by excess mortality not attributable to other causes than COVID-19 [21]. Our results show that all these potential deviations on the assumptions, on the data, and on prevalence studies collectively have only a restricted impact on the approach, with 72.0% (globally) and 79.6% (US) of the estimates within a factor 2 of the observed values and 100.0% (globally) and 93.9% (US), within a factor 3. This accuracy of the estimations is highly remarkable in the context of the observed prevalence spread over a factor 615.0 between minimum and maximum values, ranging from 0.02% to 12.30%.
The analysis in terms of the growth rate–infectious population trajectories has revealed universal types of behavior of the outbreak for countries around the world and locations within the US. This information can be used to anticipate the response to enacting, modifying, or lifting NPIs. The most marked example is the response to strict lockdowns across countries in Europe (e.g. United Kingdom, Italy, Belgium, Spain, France, Germany, Austria, Netherlands, and Switzerland) and Northeast states in the US (e.g. New York, New Jersey, Massachusetts, Pennsylvania, and District of Columbia). They followed the same type of behavior (fast decrease of the growth rate from high to sustained negative values) until major restrictions were lifted in the European countries [22], turning the growth rate of their infectious populations into highly sustained positive values. Our results show that those European countries had small actively infectious populations but not as small as required for targeted controllability. They also show that most Northeast states in the US are in a similar resurgence path but at much earlier stages, with many of their NPIs still in place and with markedly smaller growth rates, which makes their reaching as deadly a resurgence as in Europe still avoidable.
At a global scale, the outbreak has reached a net growth rate fluctuating near zero values but with a high infectious population. A similar state has also been present in the US. This type of fluctuating states, with long stagnant overall infectious population periods and median growth rate close to zero, is expected of bounded unsynchronized fluctuating populations [29], such as those from uncoordinated locations aiming at just preventing an unrestricted expansion of the outbreak rather than at its eradication. This widespread feature is present for both countries in the world and locations within the US. Considering the NPIs implemented [22], our results show that there have been locations with interventions to move the growth rate towards zero values and that there have been locations switching on and off severe measures to decrease temporarily the active infectious population. Despite not growing substantially since reaching its highest local maximum of 7.33 million active infections on July 16, 2020, the high value of the global infectious population attained is currently leading to 11.12 million new infections per week that replace the same ballpark number of individuals that stop being infectious. This high turnover makes the control of any potential resurgence extremely costly.
At a local scale, our results show a highly variable temporal evolution of the infectious populations, both over time for each location and across locations. Having an up-to-date estimate of the infectiousness of populations would allow policymakers to better implement travel planning among locations. The approach has proven to accurately track the effects of local NPIs. We also expect it to play a fundamental role in evaluating the progress of vaccination efforts, especially considering the challenges present, such as waning immunity levels and pathogen evolution [30].
Data Availability
Details for accessing the data sources are provided in the manuscript.
SUPPLEMENTARY INFORMATION
Methods; Supplementary Table S1; Supplementary Fig. S1.
DATA AVAILABILITY
Datasets of the trajectories of the dynamic state of the COVID-19 outbreak for all the countries in the world and states and territories in the United States analyzed are available at https://github.com/Covid19Dynamics/trajectories.
SUPPLEMENTARY INFORMATION
MATHEMATICAL DESCRIPTION
Infection-age structured dynamics
We consider the infection-age structured dynamics of the number of individuals uI(t, τ) at time t who were infected at time t − τ given by with boundary condition Here, τ is the time elapsed after infection, referred to as infection age, and is the incidence, with kI(t, τ) being the rate of secondary transmissions per single primary case [10, 11].
The solution is obtained through the method of characteristics as for t ≥ τ and uI(t, τ) = 0 for t < τ. The resulting renewal equation, , is used as the basis for the definitions of the reproduction number and the probability density of the generation time .
The infectious population is given by which considers that an individual remains potentially infectious after a time τ from infection with probability Therefore, in terms of the incidence, we have Additionally, we consider the expected cumulative number of infections, nT(t), expressed in terms of the overall accumulated incidence as and the dynamics of the expected cumulative deaths, nD(t), which takes into account that deaths occur with probability given by the infection fatality rate, IFR, at times after infection given by the convolution of the probability density functions of the incubation, fI, and symptom onset-to-death, fOD, times.
Similarly, the variation of the expected number of seropositive individuals at a time t, nSP(t), is expressed as where fSP is the probability density function of the seroconversion time after infection, and the expected number of individuals with positive RT-PCR testing nPT(t), as where PTP(τ) is the probability that an infected individual would test positive at a time τ after infection.
Dynamical constraints
To obtain a closed set of equations from the expressions involving an integral of a function A with the incidence j, we perform a series expansion of the incidence around the infection-age time τA, with the value of τA chosen as The specific value of τA leads directly to a first-order approximation, because .
Using this approach, we obtain where τG and are the average and variance of the generation time, respectively, and where τI and τOD are the incubation and symptom onset-to-death average times, respectively, which leads to up to 𝒪(j″). These expressions are used to estimate the infectious population nI(t) from the daily deaths, , at time and the cumulative infected population nT(t) from the cumulative deaths, nD, at time t + τI + τOD, leading to Similarly, we obtain where τSP is the average seroconversion time after infection and ΔtTP is the average number of days an individual tests positive, which up to 𝒪(j″) leads to Combining Eqs. (22) and (23) with Eqs. (18) and (19) leads to which is used to validate the values of the estimated infectious population nI(t) from RT-PCR testing results, nTP, at time and the cumulative infected population nT(t) from seropositivity testing, nSP, at time t + τSP.
APPROACH WORKFLOW
Expected deaths
The raw cumulative death counts over time, nW(t), are obtained from the Johns Hopkins University Center for Systems Science and Engineering [1] for countries and for US locations.
The daily death counts ΔnW(t) = nW(t) – nW(t − 1) are considered to contain reporting artifacts if they are negative or if they are unrealistically large. This last condition is defined explicitly as larger than 4 times its previous 14-day average value plus 10 deaths, , from a non-sparse reporting schedule with at least 2 consecutive non-zero values before and after the time t, .
Reporting artifacts identified at time t are considered to be the result of previous miscounting. The excess or lack of deaths are imputed proportionally to previous death counts. Explicitly, death counts are updated as with for all i ≥ 0. In this way, ΔnW(t) is assigned its previous seven-day average value.
The expected daily deaths, ΔnD(t), are obtained through a density estimation multiscale functional, fde, as ΔnD(t) = fde(ΔnW(t)), which leads to the estimation of the expected cumulative deaths at time t as . Specifically, With where ma14(·) is a centered moving average with window size of 14 days and rgσ (·) is a centered rolling average through a Gaussian window with standard deviation σ.
Reporting delays
We consider an average delay of two days between reporting a death and its occurrence. This value is obtained by comparing the daily death counts reported for Spain [1] and their actual values [31] from February 15 to March 31, 2020. The values of the root-mean-squared deviation between reported and actual deaths shifted by 0, 1, 2, 3, and 4 days are 77.9, 58.4, 38.5, 58.7, and 88.6 deaths respectively.
Infection fatality rate (IFR)
The infection fatality rate is computed assuming homogeneous attack rate as where IFRa is the previously estimated IFR for the age group a [5] and ga is the population in the age group a as reported by the United Nations for countries [17] and the US Census for states [18].
Clinical parameters
We obtained the values of the average τG and standard deviation σG of the generation time from Ref. [12], of the averages of the incubation τI and symptom onset-to-death τOD times from Refs. [5, 13], and of the average number of days ΔtTP of positive testing by an infected individual from Refs. [14, 16]. The average time at which an individual tested positive after infection τTP was computed as τTP = τI − 2 + ΔtTP/2, where we have assumed that on average an individual started to test positive 2 days before symptom onset. The average seroconversion time after infection τSP was estimated as τI plus the 7 days of 50% seroconversion after symptom onset reported in Ref. [15].
Dynamical constraints implementation with discrete time
We implemented the dynamical constraints to compute the infectious and infected population as outlined in the main text and as detailed in the previous section of this document, using days as time units. Time delays were rounded to days to assign daily values.
The first derivative of the cumulative number of deaths is computed as with ΔnD(t) = nD(t) − nD(t − 1).
The growth rate was computed explicitly from the discrete time series as the centered 7-day difference The 7-day value was chosen to mitigate reporting artifacts.
Confidence and credibility intervals
Confidence intervals associated with death counts were computed using bootstrapping with 10,000 realizations [32]. These confidence intervals were combined with the credibility intervals of the IFR in infectious and infected populations assuming independence and additivity on a logarithmic scale.
Fold accuracy
The fold accuracy, FA, is explicitly computed as where |·| is the absolute value function, is the ith observation, is its corresponding estimation, and N is the total number of observations.
Inference and extrapolation
Because of the delay between infections and deaths, inference for the values of the growth rate and infectious populations ends on November 2, 2020 and for the values of the infected populations ends on October 29, 2020. Extrapolation to the current time (November 24, 2020) is carried out assuming the last growth rate computed.
Reproduction number
The quantities Rt and kG(t) are related to each other through the Euler–Lotka equation, , which considers in the renewal equation . Generation times can generally be described through a gamma distribution with and , which leads to Rt = (1 + kG(t)/β)α for kG(t) > −β and Rt = 0 for kG(t) ≤ −β. In the case of the exponentially distributed limit (α ≃ 1) or small values of kG(t)/β, it simplifies to Rt = 1 + kG(t)τG for kG(t) > −1/τG and Rt = 0 for kG(t) ≤ −1/τG.
SUPPLEMENTARY TABLE
SUPPLEMENTARY FIGURE
ACKNOWLEDGMENTS
J.M.G.V. acknowledges support from Ministerio de Ciencia e Innovación under grant PGC2018-101282-B-I00 (MCI/AEI/FEDER, UE). L.S. acknowledges support from the University of California, Davis.
REFERENCES
REFERENCES
References are supplied in the main text of the manuscript.