Data-Based Analysis, Modelling and Forecasting of the novel Coronavirus (2019-nCoV) outbreak ============================================================================================ * Cleo Anastassopoulou * Lucia Russo * Athanasios Tsakris * Constantinos Siettos ## Abstract Since the first suspected case of coronavirus diesease-2019 (COVID-19) on December 1st, 2019, in Wuhan, Hubei Province, China, a total of 40,235 confirmed cases and 909 deaths have been reported in China up to February 10, 2020, evoking fear locally and internationally. Here, based on the publicly available epidemiological data (WHO, CDC, ECDC, NHC and DXY) for Hubei from January 11 to February 10, 2020, we provide estimates of the main epidemiological parameters, i.e. the basic reproduction number (*R*) and the infection, recovery and mortality rates, along with their 90% confidence intervals. As the number of infected individuals, especially of those with asymptomatic or mild courses, is suspected to be much higher than the official numbers, which can be considered only as a sample of the actual numbers of infected and recovered cases in the population, we have repeated the calculations under a second scenario that considers twenty times the number of confirmed infected cases and forty times the number of recovered, leaving the number of deaths unchanged. Our computations and analysis were based on a mean field Susceptible-Infected-Recovered-Dead (SIRD) model. Thus, based on the SIRD simulations, the estimated average value of *R* in both scenarios was found to be 2.5, the one computed considering the period from the 11th of January until the 18th of January, using the official counts of the confirmed cases was found to be 4.6, while the one computed under the second scneario was found to be 3.2. Furthermore, on the estimated parameters from both scenarios, we provide tentative three-week forecasts of the evolution of the outbreak at the epicenter. Our forecasting flashes a note of caution for the presently unfolding outbreak in China. Based on the official counts for confirmed cases, the simulations suggest that the cumulative number of infected will surpass 80,000 (as a lower bound) and could reach 160,000 (with an upper bound of 330,000) by February 29. Regarding the number of deaths, simulations forecast that on the basis of the up to the 10th of February data and estimations of the actual number of infected and recovered in the population, the death toll might exceed 7,000 by February 29. However, our analysis further reveals a significant decline of the mortality rate to which various factors may have contributed, such as the severe control measures taken in Hubei, China (e.g. quarantine and hospitalization of infected individuals), but mainly it is due to the fact that the actual number of infected and recovered cases in the population is estimated to be more than twenty times higher than reported, thus resulting in a much lower mortality rate, which according to our computations is of the order of 0.15%. *K*eywords * Coronavirus * Covid-19 * SARS-CoV-2 * Mathematical Modeling * Data Analysis * Forecasting ## 1 Introduction An outbreak of “pneumonia of unknown etiology” in Wuhan, Hubei Province, China in early December 2019 has spiraled into an epidemic that is ravaging China and threatening to reach a pandemic state [1]. The causative agent soon proved to be a new betacoronavirus related to the Middle East Respiratory Syndrome virus (MERS-CoV) and the Severe Acute Respiratory Syndrome virus (SARS-CoV). The novel coronavirus SARS-CoV-2 disease has been named “COVID-19” by the World Health Organization (WHO) and on January 30, the COVID-19 outbreak was declared to constitute a Public Health Emergency of International Concern by the WHO Director-General [2]. Despite the lockdown of Wuhan and the suspension of all public transport, flights and trains on January 23, a total of 40,235 confirmed cases, including 6,484 (16.1%) with severe illness, and 909 deaths (2.2%) had been reported in China by the National Health Commission up to February 10, 2020; meanwhile, 319 cases and one death were reported outside of China, in 24 countries [3]. The origin of COVID-19 has not yet been determined although preliminary investigations are suggestive of a zoonotic, possibly of bat, origin [4, 5]. Similarly to SARS-CoV and MERS-CoV, the novel virus is transmitted from person to person principally by respiratory droplets, causing such symptoms as fever, cough, and shortness of breath after a period believed to range from 2 to 14 days following infection, according to the Centers for Disease Control and Prevention (CDC) [1, 6, 7]. Preliminary data suggest that older males with comorbidities may be at higher risk for severe illness from COVID-19 [6, 8, 9]. However, the precise virologic and epidemiologic characteristics, including transmissibility and mortality, of this third zoonotic human coronavirus are still unknown. Using the serial intervals (SI) of the two other well-known coronavirus diseases, MERS and SARS, as approximations for the true unknown SI, Zhao et al. estimated the mean basic reproduction number (*R*) of SARS-CoV-2 to range between 2.24 (95% CI: 1.96-2.55) and 3.58 (95% CI: 2.89-4.39) in the early phase of the outbreak [10]. Very similar estimates, 2.2 (95% CI: 1.4-3.9), were obtained for *R* at the early stages of the epidemic by Imai et al. 2.6 (95% CI: 1.5-3.5) [11], as well as by Li et al., who also reported a doubling in size every 7.4 days [1]. Wu et al. estimated the R0 at 2.68 (95% CI: 2.47–2.86) with a doubling time every 6.4 days (95% CI: 5.8–7.1) and the epidemic growing exponentially in multiple major Chinese cities with a lag time behind the Wuhan outbreak of about 1–2 weeks [12]. Amidst such an important ongoing public health crisis that also has severe economic repercussions, we reverted to mathematical modelling that can shed light to essential epidemiologic parameters that determine the fate of the epidemic [13]. Here, we present the results of the analysis of time series of epidemiological data available in the public domain (WHO, CDC, ECDC, NHC and DXY) from January 11 to February 10, 2020, and attempt a three-week forecast of the spreading dynamics of the emerged coronavirus epidemic in the epicenter in mainland China. ## 2 Methodology Our analysis was based on the publicly available data of the new confirmed daily cases reported for the Hubei province from the 11th of January until the 10th of February. Based on the released data, we attempted to estimate the mean values of the main epidemiological parameters, i.e. the basic reproduction number *R*, the infection (*α*), recovery (*β*) and mortality rate (*γ*), along with their 90% confidence intervals. However, as suggested [14], the number of infected, and consequently the number of recovered, people is likely to be much higher. Thus, in a second scenario, we have also derived results by taking twenty times the number of reported cases for the infected and forty times the number for the recovered cases, while keeping constant the number of deaths that is more likely to be closer to the real number. Based on these estimates we also provide tentative forecasts until the 29th of February. Our methodology follows a two-stage approach as described below. The basic reproduction number is one of the key values that can predict whether the infectious disease will spread into a population or die out. *R* represents the average number of secondary cases that result from the introduction of a single infectious case in a totally susceptible population during the infectiousness period. Based on the reported data of confirmed cases, we provide estimations of the *R* from the 16th up to the 20th of January in order to satisfy as much as possible the hypothesis of *S*≈*N* that is a necessary condition for the computation of *R*. We also provide estimations of *β, γ* over the entire period using a rolling window of one day from the 11th of January to the 16th of January to provide the very first estimations. Furthermore, we exploited the SIRD model to provide an estimation of the infection rate *α*. This is accomplished by starting with one infected person on the 16th of November, which has been suggested as a starting date of the epidemic [6], run the SIR model until the 10th of February and optimize *α* to fit the reported confirmed cases from the 11th of January to the 10th of February. Below, we describe analytically our approach. Let us start by denoting with *S*(*t*), *I*(*t*), *R*(*t*), *D*(*t*), the number of susceptible, infected, recovered and dead persons respectively at time *t* in the population of size *N*. For our analysis we assume that the total number of the population remains constant. Based on the demographic data for the province of Hubei *N* = 59*m*. Thus, the discrete SIRD model reads: ![Formula][1] ![Formula][2] ![Formula][3] ![Formula][4] The above system is defined in discrete time points *t* = 1, 2, …, with the corresponding initial condition at the very start of the epidemic: *S*(0) = *N* − 1, *I*(0) = 1, *R*(0) = *D*(0) = 0. Initially, when the spread of the epidemic starts, all the population is considered to be susceptible, i.e. *S* ≈ *N*. Based on this assumption, by Eq.(2),(3),(4), the basic reproduction number can be estimated by the parameters of the SIRD model as: ![Formula][5] A problem with the approximation of the epidemiological parameters *α, β* and *γ* and thus *R*, from real-world data based on the above expressions is that in general, for large scale epidemics, the actual number of infected *I*(*t*) persons in the whole population is unknown. Thus, the problem is characterized by high uncertainty. However, one can attempt to provide some coarse estimations of the epidemiological parameters based on the reported confirmed cases using the approach described next. ### 2.1 Identification of epidemiological parameters from the reported data of confirmed cases Lets us denote with Δ*I*(*t*) = *I*(*t*) −*I*(*t*− 1), Δ*R*(*t*) = *R*(*t*) −*R*(*t* −1), Δ*D*(*t*) = *D*(*t*) −*D*(*t*− 1), the reported new cases of infected, recovered and deaths at time *t*, with *C*Δ*I*(*t*), *C*Δ*R*(*t*), *C*Δ*D*(*t*) the cumulative numbers of confirmed cases at time *t*. Thus: ![Formula][6] where, *X* = *I, R, D*. Let us also denote by **CX**(*t*) = [*CX*(1), *CX*(2), …, *CX*(*t*)]*T*, the *t*×1 column vector containing the corresponding cumulative numbers up to time *t*. On the basis of Eqs.(2), (4), one can provide a coarse estimation of the parameters *R*, *β* and *γ* as follows. Starting with the estimation of *R*, we note that as the province of Hubei has a population of 59m, one can reasonably assume that for any practical means, at least at the beginning of the outbreak, *S* ≈ *N*. By making this assumption, one can then provide an approximation of the expected value of *R* using Eq.(5) and Eq.(2), Eq.(3), Eq.(4). In particular, substituting in Eq.(2), the terms *βI*(*t* − 1) and *γI*(*t* − 1) with Δ*R*(*t*) = *R*(*t*) − *R*(*t* − 1) from Eq.(3), and Δ*D*(*t*) = *D*(*t*) − *D*(*t* − 1) from Eq.(4) and bringing them into the left-hand side of Eq.(2), we get: ![Formula][7] Adding Eq.(3) and Eq.(4), we get: ![Formula][8] Finally, assuming that for any practical means at the beginning of the spread that *S*(*t* − 1) ≈ *N* and dividing Eq.(7) by Eq.(8) we get: ![Formula][9] Note that one can use directly Eq.(9) to compute *R* with regression, without the need to compute first the other parameters, i.e. *β, γ* and *α*. At this point, the regression can be done either by using the differences per se, or by using the corresponding cumulative functions (see that by summing up both sides of Eq.(7) and Eq.(8) over time one can use the cumulative functions instead of the differences for the calculation of *R* using Eq.(9)). Here, we have chosen the second option as a standard practice to reduce the noise included in the differences. Thus, based on the above, a coarse estimation of *R* and its corresponding confidence intervals can be provided by solving a linear regression problem using least-squares problem as: ![Formula][10] The prime ′ is for the transpose operation. The estimation of the mortality and recovery rates was based on the simple formula used by the National Health Commission (NHC) of the People’s Republic of China [15] (see also [16]), that is the ratio of the cumulative number of recovered/deaths and that of infected at time *t*. Thus, a coarse/simplistic estimation of the mortality can be calculated solving a linear regression problem for the corresponding cumulative function by least squares as: ![Formula][11] In a similar manner, the recovery rate can be computed by: ![Formula][12] As the reported data are just a sample of the actual number of infected and recovered cases including the asymptomatic and/or mild ones, we have repeated the above calculations by taking twenty times the reported number of infected and forty times the reported number of recovered in the population, while leaving the reported number of deaths the same considering that their cataloguing is close to the actual number of deaths due to COVID-19. ### 2.2 Estimation of the infection rate from the SIRD model Having estimated the expected values of the parameters *β* and *γ*, an approximation of the infected rate *α*, that is not biased by the assumption of *S* = *N* can be obtained by using the SIRD simulator. In particular, in the SIRD model we set ![Graphic][13] and ![Graphic][14], and set as initial conditions one infected person on the 16th of November and run the simulator until the last date for which there are available data (here up to the 10th of February). Then, the value of the infection rate *α* can be found by “wrapping” around the SIRD simulator an optimization algorithm (such as a nonlinear least-squares solver) to solve the problem: ![Formula][15] where ![Formula][16] where, ![Graphic][17] are the differences between the cumulative new cases of the infected, recovered and dead people resulting from the SIRD simulator and the reported (or, in the second scenario the five times the reported infected and recovered) real-world cumulative cases at time *t*; *w*1, *w*2, *w*3 correspond to scalars serving in the general case as weights to the relevant functions. For the solution of the above optimization problem we used the function “lsqnonlin” of matlab [17] using the Levenberg-Marquard algorithm. ## 3 Results As discussed, we have derived results using two different scenarios (see in Methodology). For each scenario, we first present the results obtained by solving the least squares problem as described in section 2.1 using a rolling window of an one-day step. The first time window was that from the 11th up to the 16th of January i.e. we used the first six days to provide the very first estimations of the epidemiological parameters. We then proceeded with the calculations by adding one day in the rolling window as described in the methodology until the 10th of February. We also report the corresponding 90% confidence intervals instead of the more standard 95% because of the small size of the data. For each window, we also report, the corresponding coefficients of determination (*R*2) representing the proportion of the variance in the dependent variable that is predictable from the independent variables, and the root mean square of error (RMSE). The estimation of *R* was based on the data until January 20, in order to satisfy as much as possible the hypothesis underlying its calculation by Eq.(9). Then, we used the SIRD model to provide an estimation of infection rate by “wrapping” around the SIRD simulator the optimization algorithm as described in section 2.2. Finally, we provide tentative forecasts for the evolution of the outbreak based on both scenarios until the end of February. ### 3.1 Scenario I: Results obtained using the exact numbers of the reported confirmed cases Figure1 depicts an estimation of *R* for the period January 16-January 20. Using the first six days from the 11th of January, ![Graphic][18] results in ∼ 4.80 (90% CI: 3.36-6.67); using the data until January 17, ![Graphic][19] results in ∼ 4.60 (90% CI: 3.56-5.65); using the data until January 18, ![Graphic][20] results in ∼ 5.14 (90%CI: 4.25-6.03); using the data until January 19, ![Graphic][21] results in ∼ 6.09 (90% CI: 5.02-7.16); and using the data until January 20, ![Graphic][22] results in ∼ 7.09 (90% CI: 5.84-8.35) ![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F1.medium.gif) [Figure 1:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F1) Figure 1: Scenario I. Estimated values of the basic reproduction number (*R*) as computed by least squares using a rolling window with initial date the 11th of January. The solid line corresponds to the mean value and dashed lines to lower and upper 90% confidence intervals. ![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F2.medium.gif) [Figure 2:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F2) Figure 2: Scenario I. Estimated values of the recovery ![Graphic][23] and mortality ![Graphic][24] rate as computed by least squares using a rolling window (see section 2.1). Solid lines correspond to the mean values and dashed lines to lower and upper 90% confidence intervals Figure2 depicts the estimated values of the recovery (*β*) and mortality (*γ*) rates for the period January 16 to February 10. The confidence intervals are also depicted with dashed lines. Note that the large variation in the estimated values of *β* and *γ* should be accounted to the small size of the data and data uncertainty. This is also reflected in the corresponding confidence intervals. As more data are taken into account, this variation is significantly reduced. Thus, using all the available data from the 11th of January until the 10th of February, the estimated value of the mortality rate *γ* is ∼2.94% (90% CI: 2.89%-3.0%) and that of the recovery rate is ∼0.05 (90% CI: 0.045-0.055) corresponding to ∼20 days (90% CI: 18-22). It is interesting to note that as the available data become more, the estimated recovery rate increases significantly from the 31th of January (see Fig.2). In Figures3,4,5, we show the coefficients of determination (*R*2) and the root of mean squared errors (*RMSE*), for ![Graphic][25] and ![Graphic][26], respectively. ![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F3.medium.gif) [Figure 3:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F3) Figure 3: Scenario I. Coefficient of determination (*R*2) and root mean square error (*RMSE*) resulting from the solution of the linear regression problem with least-squares for the basic reproduction number (*R*) ![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F4.medium.gif) [Figure 4:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F4) Figure 4: Scenario I. Coefficient of determination (*R*2) and root mean square error (*RMSE*) resulting from the solution of the linear regression problem with least-squares for the recovery rate (*β*) ![Figure 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F5.medium.gif) [Figure 5:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F5) Figure 5: Scenario I. Coefficient of determination (*R*2) and root mean square error (RMSE) resulting from the solution of the linear regression problem with least-squares for the mortality rate (*γ*) As described in the methodology, we have also used the SIRD simulator to provide an estimation of the infection rate by optimization with *w*1=1, *w*2=2, *w*3=2. Thus, we performed the simulations by setting *β*=0.05 and *γ*=0.0294, and as initial conditions one infected, zero recovered and zero deaths on November 16th 2019, and run until the 10th of February. The optimal, with respect to the reported confirmed cases from the 11th of January to the 10th of February, value of the infected rate (*α*) was ∼0.201 (90% CI: 0.199-0.203). This corresponds to a mean value of the basic reproduction number ![Graphic][27]. Note that this value is different compared to the value that was estimated using solely the reported data. Finally, using the derived values of the parameters *α, β, γ*, we have run the SIRD simulator until the end of February. The results of the simulations are given in Figures6,7,8. Solid lines depict the evolution, when using the expected (mean) estimations and dashed lines illustrate the corresponding lower and upper bounds as computed at the limits of the confidence intervals of the estimated parameters. ![Figure 6:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F6.medium.gif) [Figure 6:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F6) Figure 6: Scenario I. Simulations until the 29th of February of the cumulative number of infected as obtained using the SIRD model. Dots correspond to the number of confirmed cases from the 16th of January to the 10th of February. The initial date of the simulations was the 16th of November with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters *α* = 0.201, *β* = 0.05, *γ* = 0.0294;dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters. ![Figure 7:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F7.medium.gif) [Figure 7:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F7) Figure 7: Scenario I. Simulations until the 29th of February of the cumulative number of recovered as obtained using the SIRD model. Dots correspond to the number of confirmed cases from the 16th of January to the 10th of February. The initial date of the simulations was the 16th of November with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters *α* = 0.201, *β* = 0.05, *γ* = 0.0294;dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters. ![Figure 8:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F8.medium.gif) [Figure 8:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F8) Figure 8: Scenario I. Simulations until the 29th of February of the cumulative number of deaths as obtained using the SIRD model. Dots correspond to the number of confirmed cases from 16th of January to the 10th of February. The initial date of the simulations was the 16th of November with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters *α* = 0.201, *β* = 0.05, *γ* = 0.0294;dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters. As Figures 6,7 suggest, the forecast of the outbreak at the end of February, through the SIRD model is characterized by high uncertainty. In particular, simulations result in an expected number of ∼160,000 infected cases but with a high variation: the lower bound is at ∼84,000 infected cases while the upper bound is at 330,000 cases. Similarly for the recovered population, simulations result in an expected number of ∼70,000, while the lower and upper bounds are at ∼44,000 and ∼100,000, respectively. Finally, regarding the deaths, simulations result in an average number of 40,000, with lower and upper bounds, ∼23,000 and ∼72,000, respectively. However, as more data are released it appears that the mortality rate is much lower than the predicted with the current data and thus the death toll is expected, hopefully to be significantly less compared with the predictions. Furthermore, simulations reveal that the actual number of deaths is significantly lower than the lower bound of the simulations. This suggests that the mortality rate is considerably smaller than the estimated one based on the reported official data. Thus, it is expected that the actual numbers of the infected and consequently the recovered cases are considerably larger than the reported ones. Hence, we assessed the dynamics of the epidemics by considering a different scenario that we present in the following subsection. ### 3.2 Scenario II. Results obtained based by taking twenty times the number of infected and forty times the number of recovered people with respect to the confirmed cases For our illustrations, we assumed that the number of infected is twenty times the number of the confirmed infected and forty times the number of the confirmed recovered people. Figure9 depicts an estimation of *R* for the period January 16-January 20. Using the first six days from the 11th of January to 16th of January, ![Graphic][28] results in 3.25 (90% CI: 2.37-4.14); using the data until January 17, ![Graphic][29] results in 3.13 (90% CI: 2.50-3.76); using the data until January 18, ![Graphic][30] results in 3.42 (90% CI: 2.91-3.93); using the data until January 19, ![Graphic][31] results in 3.96 (90% CI: 3.36-4.57) and using the data until January 20, ![Graphic][32] results in 4.58 (90% CI: 3.81-5.35). It is interesting to note that the above estimation of *R* is close enough to the one reported in other studies (see in the Introduction for a review). ![Figure 9:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F9.medium.gif) [Figure 9:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F9) Figure 9: Scenario II. Estimated values of the basic reproduction number (*R*) as computed by least squares using a rolling window with initial date the 11th of January. The solid line corresponds to the mean value and dashed lines to lower and upper 90% confidence intervals. Figure10 depicts the estimated values of the recovery (*β*) and mortality (*γ*) rates for the period January 16 to February 10. The confidence intervals are also depicted with dashed lines. Note that the large variation in the estimated values of *β* and *γ* should be accounted to the small size of the data and data uncertainty. This is also reflected in the corresponding confidence intervals. As more data are taken into account, this variation is significantly reduced. Thus,using all the (scaled) data from the 11th of January until the 10th of February, the estimated value of the mortality rate *γ* now drops to ∼ 0.147% (90% CI: 0.144%-0.149%) while that of the recovery rate is ∼ 0.1 (90% CI: 0.091-0.11) corresponding to ∼ 10 days (90% CI: 9-11 days). It is interesting also to note, that as the available data become more, the estimated recovery rate increases slightly (see Fig.10), while the mortality rate seems to be stabilized at a rate of ∼ 0.147%. In Figures 11,12,13, we show the coefficients of determination (*R*2) and the root of mean squared errors (*RMSE*), for ![Graphic][33] and ![Graphic][34], respectively. ![Figure 10:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F10.medium.gif) [Figure 10:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F10) Figure 10: Scenario II. Estimated values of the recovery ![Graphic][35] rate and mortality ![Graphic][36] rate, as computed by least squares using a rolling window (see section 2.1). Solid lines correspond to the mean values and dashed lines to lower and upper 90% confidence intervals ![Figure 11:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F11.medium.gif) [Figure 11:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F11) Figure 11: Scenario II. Coefficient of determination (*R*2) and root mean square error (*RMSE*) resulting from the solution of the linear regression problem with least-squares for the basic reproduction number (*R*). ![Figure 12:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F12.medium.gif) [Figure 12:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F12) Figure 12: Scenario II. Coefficient of determination (*R*2) and root mean square error (*RMSE*) resulting from the solution of the linear regression problem with least-squares for the recovery rate (*β*). ![Figure 13:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F13.medium.gif) [Figure 13:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F13) Figure 13: Scenario II. Coefficient of determination (*R*2) and root mean square error (*RMSE*) resulting from the solution of the linear regression problem with least-squares for the mortality rate (*γ*). Again, we used the SIRD simulator to provide estimation of the infection rate by optimization setting *w*1 = 1, *w*2 = 400, *w*3 = 1 to balance the residuals of deaths with the scaled numbers of the infected and recovered cases. Thus, to find the optimal infection run, we used the SIRD simulations with *β* = 0.1, and *γ* = 0.00147 and as initial conditions one infected, zero recovered, zero deaths on November 16th 2019, and run until the 10th of February. The optimal, with respect to the reported confirmed cases from the 11th of January to the 10th of February value of the infected rate (*α*) was ∼0.25 (90% CI: 0.248-0.252). This corresponds to a mean value of the basic reproduction number ![Graphic][37]. Finally, using the derived values of the parameters *α, β, γ*, we have run the SIRD simulator until the end of February. The simulation results are given in Figures14,15,16. Solid lines depict the evolution, when using the expected (mean) estimations and dashed lines illustrate the corresponding lower and upper bounds as computed at the limits of the confidence intervals of the estimated parameters. ![Figure 14:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F14.medium.gif) [Figure 14:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F14) Figure 14: Scenario II. Simulations until the 29th of February of the cumulative number of infected as obtained using the SIRD model. Dots correspond to the number of confirmed cases from 16th of Jan to the 10th of February. The initial date of the simulations was the 16th of November with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters *α* = 0.25, *β* = 0.1, *γ* = 0.00147; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters. Again as Figures 15,16 suggest, the forecast of the outbreak at the end of February, through the SIRD model is characterized by high uncertainty. In particular, in Scenario II, by February 29, simulations result in an expected actual number of 1.9m infected cases in the whole population with a lower bound at 660,000 and an upper bound at ∼4.6m cases. Similarly, for the recovered population, simulations result in an expected actual number of 1.3m, while the lower and upper bounds are at 540,000 and 2.9m, respectively. Finally, regarding the deaths, simulations under this scenario result in an average number of 19,000, with lower and upper bounds, 7,000 and 47,000. Table 1 summarizes the above results for both scenarios. View this table: [Table 1:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/T1) Table 1: Model parameters, their computed values and forecasts for the Hubei province under two scenarios: (I) using the exact values of confirmed cases or (II) using estimations for infected and recovered (twenty and forty times the number of confirmed cases, respectively). ![Figure 15:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F15.medium.gif) [Figure 15:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F15) Figure 15: Scenario II. Simulations until the 29th of February of the cumulative number of recovered as obtained using the SIRD model. Dots correspond to the number of confirmed cases from 16th of January to the 10th of February. The initial date of the simulations was the 16th of November, with one infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters *α* = 0.25, *β* = 0.1, *γ* = 0.00147; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters. ![Figure 16:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/02/25/2020.02.11.20022186/F16.medium.gif) [Figure 16:](http://medrxiv.org/content/early/2020/02/25/2020.02.11.20022186/F16) Figure 16: Scenario II. Simulations until the 29th of February of the cumulative number of deaths as obtained using the SIRD model. Dots correspond to the number of confirmed cases from the 16th of November to the 10th of February. The initial date of the simulations was the 16th of November with zero infected, zero recovered and zero deaths. Solid lines correspond to the dynamics obtained using the estimated expected values of the epidemiological parameters *α* = 0.25, *β* = 0.1, *γ* = 0.00147; dashed lines correspond to the lower and upper bounds derived by performing simulations on the limits of the confidence intervals of the parameters. We note, that the results derived under the Scenario II seem to better reflect the actual situation as the reported number of deaths is within the average and lower limits of the SIRD simulations. In particular, the reported number of deaths on the 22th February was 2,346, while the lower bound of the forecast is 2,900. This indicates an even lower mortality rate than that of 0.147%, thus an even larger actual number of infected (and recovered) cases in the whole population. At this point we should make clear and underline that the above forecasts should not be considered by any means as actual/ real-world predictions for the evolution of the epidemic as they are based on yet incomplete knowledge of the characteristics of the epidemic, the high uncertainty regarding the actual number of already infected and recovered people and the effect of the measures already taken to combat the epidemic. ## 4 Discussion and Concluding Remarks In the digital and globalized world of today, new data and information on the novel coronavirus and the evolution of the outbreak become available at an unprecedented pace. Still, crucial questions remain unanswered and accurate answers for predicting the dynamics of the outbreak simply cannot be obtained at this stage. We emphatically underline the uncertainty of available official data, particularly pertaining to the true baseline number of infected (cases), that may lead to ambiguous results and inaccurate forecasts by orders of magnitude, as also pointed out by other investigators [1, 14, 18]. Of note, COVID-19, which is thought to be principally transmitted from person to person by respiratory droplets and fomites without excluding the possibility of the fecal-oral route [19] had been spreading for at least over a month and a half before the imposed lockdown and quarantine of Wuhan on January 23, having thus infected unknown numbers of people. The number of asymptomatic and mild cases with subclinical manifestations that probably did not present to hospitals for treatment may be substantial; these cases, which possibly represent the bulk of the COVID-19 infections, remain unrecognized, especially during the influenza season [18]. This highly likely gross under-detection and underreporting of mild or asymptomatic cases inevitably throws severe disease courses calculations and death rates out of context, distorting epidemiologic reality. Another important factor that should be taken into consideration pertains to the diagnostic criteria used to determine infection status and confirm cases. A positive PCR test was required to be considered a confirmed case by China’s Novel Coronavirus Pneumonia Diagnosis and Treatment program in the early phase of the outbreak [20]. However, the sensitivity of nucleic acid testing for this novel viral pathogen may only be 30-50%, thereby often resulting in false negatives, particularly early in the course of illness. To complicate matters further, the guidance changed in the recently-released fourth edition of the program on February 6 to allow for diagnosis based on clinical presentation, but only in Hubei province [20]. The swiftly growing epidemic seems to be overwhelming even for the highly efficient Chinese logistics that did manage to build two new hospitals in record time to treat infected patients. Supportive care with extracorporeal membrane oxygenation (ECMO) in intensive care units (ICUs) is critical for severe respiratory disease. Large-scale capacities for such level of medical care in Hubei province, or elsewhere in the world for that matter, amidst this public health emergency may prove particularly challenging. We hope that the results of our analysis contribute to the elucidation of critical aspects of this outbreak so as to contain the novel coronavirus as soon as possible and mitigate its effects regionally, in mainland China, and internationally. ## Data Availability All data referred to in the manuscript are available. [https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6](https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6) ## Footnotes * cleoa{at}med.uoa.gr * lucia.russo{at}cnr.it * atsakris{at}med.uoa.gr * constantinos.siettos{at}unina.it * Received February 11, 2020. * Revision received February 23, 2020. * Accepted February 25, 2020. * © 2020, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## References 1. [1]. Q Li, X Guan, P Wu, and et al. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia, jan 2020. 2. [2].World Health Organization. WHO Statement Regarding Cluster of Pneumonia Cases in Wuhan, China, 2020. 3. [3].World Health Organization. Novel coronavirus(2019-nCoV). Situation report 21. Geneva, Switzerland: World Health Organization; 2020, 2020. 4. [4]. Roujian Lu, Xiang Zhao, Juan Li, Peihua Niu, Bo Yang, Honglong Wu, Wenling Wang, Hao Song, Baoying Huang, Na Zhu, Yuhai Bi, Xuejun Ma, Faxian Zhan, Liang Wang, Tao Hu, Hong Zhou, Zhenhong Hu, Weimin Zhou, Li Zhao, Jing Chen, Yao Meng, Ji Wang, Yang Lin, Jianying Yuan, Zhihao Xie, Jinmin Ma, William J Liu, Dayan Wang, Wenbo Xu, Edward C Holmes, George F Gao, Guizhen Wu, Weijun Chen, Weifeng Shi, and Wenjie Tan. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet, jan 2020. 5. [5]. Peng Zhou, Xing-Lou Yang, Xian-Guang Wang, Ben Hu, Lei Zhang, Wei Zhang, Hao-Rui Si, Yan Zhu, Bei Li, Chao-Lin Huang, Hui-Dong Chen, Jing Chen, Yun Luo, Hua Guo, Ren-Di Jiang, Mei-Qin Liu, Ying Chen, Xu-Rui Shen, Xi Wang, Xiao-Shuang Zheng, Kai Zhao, Quan-Jiao Chen, Fei Deng, Lin-Lin Liu, Bing Yan, Fa-Xian Zhan, Yan-Yi Wang, Geng-Fu Xiao, and Zheng-Li Shi. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature, feb 2020. 6. [6]. Nanshan Chen, Min Zhou, Xuan Dong, Jieming Qu, Fengyun Gong, Yang Han, Yang Qiu, Jingli Wang, Ying Liu, Yuan Wei, Jiaan Xia, Ting Yu, Xinxin Zhang, and Li Zhang. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study. The Lancet, jan 2020. 7. [7]. A Patel, DB Jernigan, and 2019 nCoV CDC Response Team. Initial public health response and interim clinical guidance for the 2019 novel coronavirus outbreak - united states, december 31, 2019-february 4, 2020. MMWR Morb Mortal Wkly Rep, feb 2020. 8. [8]. C Hunag, Y Wang, X Li, and et al. Clinical features of patients infected with 2019 novel coronavirus in wuhan, china. Lancet, jan 2020. 9. [9]. Dawei Wang, Bo Hu, Chang Hu, Fangfang Zhu, Xing Liu, Jing Zhang, Binbin Wang, Hui Xiang, Zhenshun Cheng, Yong Xiong, Yan Zhao, Yirong Li, Xinghuan Wang, and Zhiyong Peng. Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus–infected pneumonia in wuhan, china. JAMA, feb 2020. 10. [10]. Shi Zhao, Qianyin Lin, Jinjun Ran, Salihu S Musa, Guangpu Yang, Weiming Wang, Yijun Lou, Daozhou Gao, Lin Yang, Daihai He, and Maggie H Wang. Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in china, from 2019 to 2020: A data–driven analysis in the early phase of the outbreak. Int J Infect Dis, jan 2020. 11. [11]. N Imai, A Cori, I Dorigatti, and et al. Report 3: Transmissibility of 2019-ncov. Int J Infect Dis, dec 2019. 12. [12]. Joseph T Wu, Kathy Leung, and Gabriel M Leung. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in wuhan, china: a modelling study. The Lancet, jan 2020. 13. [13].Constantinos I. Siettos and Lucia Russo. Mathematical modeling of infectious disease dynamics. Virulence, 4(4):295–306, may 2013. 14. [14]. Peng Wu, Xinxin Hao, Eric H Y Lau, Jessica Y Wong, Kathy S M Leung, Joseph T Wu, Benjamin J Cowling, and Gabriel M Leung. Realtime tentative assessment of the epidemiological characteristics of novel coronavirus infections in wuhan, china, as at 22 january 2020. Eurosurveillance, 25(3), jan 2020. 15. [15].NHC. NHS Press Conference, Feb. 4 2020 - National Health Commission (NHC) of the People’s Republic of China, feb 2020. 16. [16]. A. C. Ghani, C. A. Donnelly, D. R. Cox, J. T. Griffin, C. Fraser, T. H. Lam, L. M. Ho, W. S. Chan, R. M. Anderson, A. J. Hedley, and G. M. Leung. Methods for Estimating the Case Fatality Ratio for a Novel, Emerging Infectious Disease. American Journal of Epidemiology, 162(5):479–486, 09 2005. 17. [17].The Mathworks, Inc., Natick, Massachusetts. MATLAB R2018b, 2018. 18. [18]. Manuel Battegay, Richard Kuehl, Sarah Tschudin-Sutter, Hans H. Hirsch, Andreas F. Widmer, and Richard A. Neher. 2019-novel coronavirus (2019-nCoV): estimating the case fatality rate – a word of caution. Swiss Medical Weekly, feb 2020. 19. [19]. J Gale. Coronavirus May Transmit Along Fecal-Oral Route, Xinhua Reports, feb 2020. 20. [20].Johns Hopkins Center for Health Security. Daily updates on the emerging novel coronavirus from the Johns Hopkins Center for Health Security. February 9, 2020, feb 2020. [1]: /embed/graphic-1.gif [2]: /embed/graphic-2.gif [3]: /embed/graphic-3.gif [4]: /embed/graphic-4.gif [5]: /embed/graphic-5.gif [6]: /embed/graphic-6.gif [7]: /embed/graphic-7.gif [8]: /embed/graphic-8.gif [9]: /embed/graphic-9.gif [10]: /embed/graphic-10.gif [11]: /embed/graphic-11.gif [12]: /embed/graphic-12.gif [13]: /embed/inline-graphic-1.gif [14]: /embed/inline-graphic-2.gif [15]: /embed/graphic-13.gif [16]: /embed/graphic-14.gif [17]: /embed/inline-graphic-3.gif [18]: /embed/inline-graphic-4.gif [19]: /embed/inline-graphic-5.gif [20]: /embed/inline-graphic-6.gif [21]: /embed/inline-graphic-7.gif [22]: /embed/inline-graphic-8.gif [23]: F2/embed/inline-graphic-9.gif [24]: F2/embed/inline-graphic-10.gif [25]: /embed/inline-graphic-11.gif [26]: /embed/inline-graphic-12.gif [27]: /embed/inline-graphic-13.gif [28]: /embed/inline-graphic-14.gif [29]: /embed/inline-graphic-15.gif [30]: /embed/inline-graphic-16.gif [31]: /embed/inline-graphic-17.gif [32]: /embed/inline-graphic-18.gif [33]: /embed/inline-graphic-19.gif [34]: /embed/inline-graphic-20.gif [35]: F10/embed/inline-graphic-21.gif [36]: F10/embed/inline-graphic-22.gif [37]: /embed/inline-graphic-23.gif