Abstract
The SIR epidemiological equations model new affected and removed cases as roughly proportional to the current number of infected cases. The present report adopts an alternative that has been considered in the literature, in which the number of new affected cases is proportional to the α ≤ 1 power of the number of infected cases. After arguing that α = 1 models exponential growth while α < 1 models polynomial growth, a simple method for parameter estimation in differential equations subject to noise, the random-time transformation RTT of Bassan, Meilijson, Marcus and Talpaz 1997, will be reviewed and applied in an attempt to settle the question as to the nature of Covid19.
1 Introduction
The SIR epidemiological model - Kermack & McKendrick [5]
Let S(t) = K − X(t) be the number of susceptible cases at time t, expressed in terms of the number of affected cases X(t) and the possibly unknown parameter K (that may or may not be the population size N usually substituted for K). Let R(t) be the removed cases at time t, dead or recovered.
Let I(t) = S(t) − R(t) be the number of infected cases at time t. The common formulation of the SIR (Susceptible, Infected, Removed) epidemiological model is the variant of the system of ODE where K = N and g(x) = x. The parameter β(t) is generally estimated by smooth local regression of X-increments with respect to the RHS of (1). The basic reproduction number is of primary importance, as its transition from above to below 1 indicates whether the epidemic is spreading or dwindling, whether the number of infected cases I increases or decreases with time.
Equations (2) and (3) as well as the linear appearance of the susceptible cases in (1) are quite straightforward under a stationary regime, but the role of I(t) in (1) needs some attention. The effective vicinity of a susceptible case need not be the entire infected cohort, just as the asymptote of affected cases need not be population size N.
This report will illustrate on the Covid19 2020 data that the SIR system of equations may accept an almost exact solution in which β (and not only γ) is constant, provided g() is modelled as g(x) = xα for some α < 1 as has been repeatedly suggested in the literature (see Bjørnstad, Finkenstädt and Grenfell in their various publications, such as [4] and the references therein).
K is the asymptote to which the number X of affected cases converges, the maximal sub-population size to become affected. This scenario-dependent random object K will be treated as a parameter.
Had the function g, modelling the impact of the current extent of infection on the emergence of new cases, been known, parameter estimation via these differential equations would have provided (unjustifiably) accurate estimates of the asymptote K. Once g(x) = x is rejected in favor of g(x) = xα for some α < 1, this shape parameter α has to be estimated together with (K, β, γ), and the overall effect on K is less clear-cut.
Exponential or asymptotically constant growth?
The cases α = 1 and α < 1 are not just variations on a mathematical abstraction. This dichotomy differentiates in principle between exponential and sub-exponential growth. If population is infinite,
The α = 1 equations i.e., dI(t) = (β − γ)I(t)dt, lead to I(t) = I(0) exp {(β − γ)t}, and then X(t) = X(0) exp {βt}, R(t) = R(0) exp {γt}. This is exponential growth.
The α < 1 equations i.e., admit a constant solution under which are parallel linear functions. I.e., new affected cases and new removed cases are invariant in time, with an invariant level of infected cases. This is asymptotically linear growth of the cumulative number of affected cases, with an initial convex polynomial behavior.
But population size is finite, and the target sub-population may be even smaller. If Equation (4) is replaced by Equation (1) with K re-incorporated and g(x) = xα, solutions invariably show the typical Covid19 behavior of empirical data, number of infected cases I(t) that rather than converge to I0 defined in (9), increase to a maximum (for which I0 is a sort of upper bound) and then decrease, towards zero hopefully.
A word of warning explaining “a sort of”, all parameters are estimated jointly under the given empirical data. It is not clear that these parameters, especially β, apply verbatim to other K-scenarios. Take into account also that there are other built-in assumptions throughout, that recovered cases become immune, and prevailing conditions don’t change. These are issues for further thought.
All countries analyzed in the sequel based on data until early June 2020, with the possible exception of those with insuficient data, display α values far below 1 (and K values far below N). The profile likelihood function of alpha (likelihood function maximized over all other parameters per value of alpha, Murphy and Van Der Vaart [7]) is generally quite flat, so α is somewhat noisily estimated.
At the end of May 2020 most countries and the world as a whole displayed number of infected cases increasing with time, making estimation more speculative than for countries where the basic reproduction number R0 crossed already below 1.
Figure 1, based roughly on Brazil data analysis, displays the logarithm of the theoretical solution to the SIR equations for 500 days with γ = 0.05 and arbitrarily chosen K = 500.000, starting with one affected and infected case. On the left α = 0 and β = 10000, in the middle α = 0.75 and β = 2, on the right α = 1 and β = 0.125. The values of β for α = 0 and α = 1 were calibrated so that the maximal number of infected cases will mimic that of the figure in the middle.
Figure 2 displays the logarithms of the empirical number of affected and infected cases in France, Germany and Italy. The observed data conform with the sub-exponential growth represented by 0 < α < 1.
Remark on the R0 transition below 1
It is quite popular these days to welcome this transition, wrongly referred to as leaving behind exponential growth in favor of a period of recovery. This transition occurs at the moment when the number of infected cases is at a maximum, where care should be perhaps strengthened rather than relaxed. All graphs in the sequel show that the recovery period until the number of infected cases is at a safer low level “in the current wave”, is long and slow. The USA is at the transition point in early June, but if conditions stay stationary, the current wave is predicted to be still at unsafe levels in November. Germany and Italy are predicted to experience unsafe levels at least until the end of August.
2 The role of α for finite populations
Consider the solution to the SIR equations for K = 500.000 and γ = 0.01. For each value of α from 0 to 1 (inclusive), at intervals of 0.05, β is determined so that the maximal value of Y is . These 21 solutions to the SIR equations are plotted in the left side of Figure 3. It is apparent that α has a strong effect on the time it takes for the disease to become noticeable. The right side of Figure 3 displays exactly the same data, shifted in time so as to start from the day when the number of affected cases X reaches . All 21 functions are similar to each other, and no wonder there is a relatively high variability in the estimation of α. This nuisance parameter is like a chemical catalyzer - important for the statistical analysis but not very relevant for the result.
Initial data, that could have been informative in estimating α, is generally unreliable. Hence, analysis is restricted to the data past a threshold on the number of affected cases X. The positive side is that the conclusions are then quite insensitive to this lack of knowledge about α. The introduction of this rather diffuse α prevents spurious as-if accuracy in the estimation of the other parameters.
In the benefit of focus, we refrain from studying further the possibilities opened by this section. Suffice to stay at this stage that log(β) comes out a perfectly linear function of α. A subject for further study.
3 Preliminary data handling and analysis
Equation (2) expresses the reasonable premise that new removed cases are proportional to the number of infected cases. I.e., the cumulative number R of removed cases should be proportional to the cumulative sum of currently infected cases. A linear regression with slope γ and zero intercept should manifest this relationship. After checking that this is roughly so, the empirically measured affected cases X are kept intact but its division into removed cases R and infected cases I is modified minimally so that the regression relation will hold. The method by which this pre-processing has been done can be found in the Appendix, together with an 8-country illustration of the effect of pre-processing. Figures 4-9 display on the right the raw and pre-processed data for a number of countries. This pre-processing regression provides interim estimates of γ very close to the MLE estimate derived from the RTT method to be described.
The data of Italy and USA show close agreement between raw and pre-processed versions, with the pre-processed version smoothing relatively sharp jumps in the raw data.
The data of Brazil show close agreement, but still at an early stage in the epidemic.
The data of Germany and Switzerland show some disagreement in the number of infected cases. Regression pre-processing can be a tool for reverse engineering. It is apparent from the data that both countries failed to recognize recovered cases until March 25, and from then on Switzerland updated the number of recovered cases every week or so.
A similar reporting problem prevents proper analysis of the Swedish data, that updates the number of recovered cases seven times in the four months of Corona monitoring. It would have been interesting to watch more closely at the epidemic development in a country that imposed no regulatory measures.
The data of Belgium is of special interest. It has been announced that Belgium recorded as affected cases also those under doubt. As a result, it appears as if it holds record high deaths per million, and number of infected cases that is still growing on June 7th, unlike all its neighbors. Pre-processing is in sharp disagreement with the raw data, placing Belgium in the same standard stage as its neighbors, well past the transition to R0 below 1, and suggesting lethality not above standard European levels.
The data of Chile and France in Figure (10) show drastic corrections in the number of recovered cases, rendering analysis difficult. Although Chilean raw and modified data show agreement when the last 10 days are omitted Figure (12), the later drastic corrections place accuracy in doubt.
The data of Iran (Figure (10) and Israel (Figure 11) show clear evidence of exit from steady conditions. However, Israel pre-processed data show (Figure 12) some disagreement with pre-processed data similarly to Germany and Switzerland, already before relaxing restrictions.
Equation (8) states that early in the epidemic for some t0. The parameter α can thus be initially estimated by the one that maximizes the correlation coefficient between (I(t))1−α and time, on a properly chosen time interval. If this correlation coefficient is close enough to 1, the α-model can be adopted, and the date t0 when the epidemic started, can be roughly assessed, together with α and β. Results are omitted in favor of the analysis in the sequel, but the maximal achieved correlation coefficient exceeded 0.98 in all nine countries tried.
Data for analysis consist of the empirical X data and the I and R data modified by pre-processing. The next section will introduce parameter estimation of K, α, β, γ under the SIR equations.
4 RTT applied to Covid19 2020
The paradigm to be adopted is that α is a (possibly location-dependent) characteristic of the condition, while K is a local-scenario parameter that reflects behavior and regulatory measures. It is important to estimate all parameters jointly in order to obtain as accurate an estimate of α, β, γ as feasible. This will allow the calculation of I0 in (9) as an estimated upper bound (corresponding to an infinite population) on the ongoing number of infected cases as well as of the slope of the linear functions in (10), the condition incidence per unit time.
The RTT method to solve differential equations with data subject to noise
The relatively novel theoretical contribution of this report is the RTT method to mimic systems of deterministic differential equations by stochastic counterparts, simpler than, and conceptually and practically different from, the more common stochastic differential equations based on Diffusion processes.
Parameter estimation (β, γ, K, α) will be performed by the Random Time Transformation (RTT) method developed by Bassan, Marcus, Meilijson and Talpaz [2], motivated by the notion of Skew Product in Ergodic Theory (see, e.g., Krengel [6]). Unlike Diffusion methods that place noise vertically, the RTT method adopts the solution to the deterministic system of differential equations, but considers it as evaluated at a random time process that advances on the average like chronological time. This random time is modelled in practice as a Gaussian process (Brownian Motion or Ornstein-Uhlenbeck) and this provides a likelihood model for the estimation of parameters inherent in the SIR (or otherwise) system of ODE. The differential terms (g(I(t)) max(0, 1 − X(t)/K) and I(t)) in equations (1,2) identify the Jacobian term in the likelihood function, for the application of MLE, including both point estimates and standard errors. As will be seen, for fixed K and α the RTT method identifies β and γ directly, without reference to the Gaussian part. The likelihood model to be described thus provides a profile likelihood in terms of K and α only.
Empirical data consist of (X1, R1), (X2, R2), …, (Xn, Rn), from which the infected case totals Yj = Xj − Rj can be inferred and then modified if needed, as described in Section 3. The sequences X and R are assumed or forced to be non-decreasing. The plan to be pursued is to solve the system above of ODE globally, as adequately as feasible, and if this plan succeeds and produces smooth, calculable, functions (x, r, i) that tightly fit the empirical data, it will provide a prediction of the maximal future damage K that X will sustain, as well as of the timing of the transition from increasing to decreasing number of infected cases, or warn that such a transition is not due to happen in the foreseeable future. Furthermore, these functions, that may involve newly defined parameters that epidemiologists have no classical interpretation for, may replace the sparse and noisy empirical data for standard analysis.
With this in mind, here is a detailed description of the RTT method. No attempt will be made to solve the SIR system analytically. Instead, a small increment of time is set, and the ODE is solved numerically as a difference equation. Interpreting as time the indices of the empirical data, M = 100 is a reasonable choice. Numerical methods other than the naïve choice (11) to solve differential equations may be substituted, of which the most obvious is to apply i and K − x in the RHS of (11) evaluated at instead of j − 1. It only makes a negligible difference. The overly exact choice M = 100 reduced the need to address this issue.
Fix the parameters β, γ, K and the function g, initiate functions x and r as X1 and R1 respectively, initiate i as X1 − R1 and proceed with the definition for j ≥ 2 Regular Least Squares essentially estimate parameters by minimizing a sum of squares of the vertical errors X(i) − x(Mi), R(i) − r(Mi), I(i) − i(Mi). Diffusion methods adopt the ODE formulation as drift and incorporate some model for the diffusion term. The Fokker-Planck equations provide transition densities, that define the likelihood function. The RTT idea works instead with horizontal errors:
Define the random time trajectory as starting as T1(1) = 1, T2(1) = 1. For m = 2, 3, …, n, let T1(m) be the smallest j for which x(j) ≥ Xj and let T2(m) be the smallest j for which r(j) ≥ Rj. Better yet, let T1(m) and T2(m) be the linear interpolants between j − 1 and j that achieve the values Xm and Rm exactly.
Now solve for β and γ so that T1(n) = T2(n) = n. That is, incremental time has average 1 in both equations. Define Δ1(m) = T1(m + 1) − T1(m) and Δ2(m) = T2(m + 1) − T2(m), for m = 1, 2, …, m − 1 as the (mean-1) increments of the T1 amd T2 processes.
If population size N was substituted for K and the identity function for g, work is over. For quality-of-fit sanity check, plot the (x, r, i) solution with the (X, R, I) data, that agree at both time endpoints.
Alternatively, consider K and α as free parameters, to be estimated from the data.
The likelihood function induced by the RTT method
View the incremental times (Δ1(m), Δ2(m)) as observations from a bivariate mean-zero Gaussian distribution, and let Σ be their empirical covariance matrix. Up to a multiplicative constant, the normal density evaluated at these data is .
Alternatively, view the processes T1(m) − m and T2(m) − m as bivariate Gaussian random walk bridges. These two models yield equivalent Gaussian density functions. As a result, the simpler as-if i.i.d. formulation is adopted. For details on the random walk bridge covariance function and its inverse, consult [3].
To obtain the likelihood function, the above density must be multiplied by the Jacobian of the transformation. This can be easily seen to be 1 over the product over the sample of the differential terms βg(Xm − Rm) max(0, 1 − Xm/K) and γ(Xm − Rm), perhaps evaluated at and instead of Xm and Rm. It doesn’t make a difference, as the main regularization role of the Jacobian is to penalize the Gaussian Least Squares into producing smooth solutions.
The parameters K and α are MLE-estimated by maximizing the logarithm of this profile likelihood function, and their standard errors (and correlation coefficient, if needed) are estimated as usual, via the empirical Fisher information.
5 Covid19 analysis in six countries
Section 3 hinged on the relation between raw and pre-processed data, as displayed by the right side of Figures 4-9. The left side extends the RTT solutions to SIR a few months into the future, and reports 95% pointwise confidence intervals.
The analysis of the Brazilian data leads to a high value of α, indicating fast, nearly exponential growth. Covid19 started relatively late, and analysis is brought as illustration only.
The analysis of the USA data indicates that the transition to R0 below 1 is around the end of monitoring for this report. Estimation on data until May 25th (Table 2) predicts maximal number of affected cases K = 2.7 million, but it turns out to be an optimistic assessment, revised June 7th (Table 1) to 3.37 million. According to these assessments, if status quo of conditions is maintained, the current wave of the epidemic would still be quite strong in December. Under the same conditions, Germany and Italy would see the epidemic nearly over by September.
The analysis of the Belgian data should be interpreted as a qualitative assessment that the epidemic progress in Belgium is essentially similar to that of its neighbors and casualties have been exaggerated, but it would not be reasonable to express quantitative conclusions.
6 A remark on incremental new cases
In principle, the increments X(i) − X(i − 1) should be independent, Poisson distributed with some local mean, provided by the differential equations. Poisson random variables Z, positive and with variance equal to the mean, should display when this mean is at least 5 or so, the remarkable property that has standard deviation very close to, and fast converging to . Whether or not this theoretical fact is satisfied empirically by the data can be checked without reference to the differential equations. Simply, choose a window size WS such as 5 or 10, do for each i linear regression of F (i − WS : i + WS) on time (i − WS : i + WS), and record the averages and slopes or correlation coefficients of these lines as well as the standard deviations of the residuals. These standard deviations are supposed to estimate , or else assist in data diagnosis or identify a batch-size of arrivals. Italy and USA display weekly seasonality in which significantly less cases are reported on weekends, and delayed reporting could be more the rule than the exception. In any case, the empirical standard deviations are so much bigger than that probabilistic modelling methods based on the Poisson hypothesis are not justified.
7 Appendix 1: Correction of the number of infected cases
Let B be the n by n matrix with zeros above the diagonal and ones on and below the diagonal. Let A be the 2n by n matrix that has γB in the first n rows and γB plus the identity matrix in the last n rows. Let V be a column vector with R in the top half and X in the bottom half. The regression equation AÎ ≈ V precisely expresses that R should be β times the cumulative sum of the I values, and X should be the latter vector (approximately R) plus I. The “regression coefficients” Î are a compromise to manifest this requirement. Once determined, the removed cases are re-defined as . The initial parts of the vectors and Î are further modified if necessary to prevent negative values or violation of monotonicity of .
Data Availability
The data analyzed in this work is taken from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.
Acknowledgements
Thanks are due to Ilan Eshel from prompting this study and to Vahid Bokharaie, Amit Huppert, Eytan Ruppin, Laura Sacerdote and David Steinberg for helpful suggestions.
The data analyzed in this work is taken from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.
Footnotes
E-mail: nitayalon{at}tauex.tau.ac.il