## Abstract

We propose a new epidemic model (SuEIR) for forecasting the spread of COVID-19, including numbers of confirmed and fatality cases at national and state levels in the United States. Specifically, the SuEIR model is a variant of the SEIR model by taking into account the untested/unreported cases of COVID-19, and trained by machine learning algorithms based on the reported historical data. Besides providing basic projections for confirmed and fatality cases, the proposed SuEIR model is also able to predict the peak date of active cases, and estimate the basic reproduction number (). In particular, the forecasts based on our model suggest that the peak date of the US, New York state, and California state are 06/01/2020, 05/10/2020, and 07/01/2020 respectively. In addition, the estimated of the US, New York state, and California state are 2.5, 3.6 and 2.2 respectively. The prediction results for all states in the US can be found on our project website: https://covid19.uclaml.org, which are updated on a weekly basis, and have been adopted by the Centers for Disease Control and Prevention (CDC) for COVID-19 death forecasts (https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-us.html).

## 1 Introduction

The novel coronavirus disease (COVID-19), an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (Chan et al., 2020; WHO, 2020b), has emerged into a global pandemic and lead to 233,839 death toll in the world as of April 30, 2020 (WHO, 2020a). The treatments for COVID-19 are still under investigation and in early stages. There are no specific vaccines or medicines that showed significant effectiveness on COVID-19 so far. As a consequence, one of the best ways to prevent the spread of COVID-19 in the short term is to follow the mitigation strategies such as social distancing, quarantine, and isolation. For example, the state of California in the US has issued mandatory stay-at-home order on March 19, shutting down all non-essential businesses. Only essential services, such as grocery stores, pharmacies, delivery restaurants, have remained open, and residents who need to leave home to take part in essential activities are advised to practice social distancing.

With the increasing availability of public data on COVID-19, more and more researches (Flaxman et al., 2020; Bendavid et al., 2020; Sutton et al., 2020; Altieri et al., 2020; Bertozzi et al., 2020; Murray et al., 2020) have been carried out to understand and prevent the spread of COVID-19 from different aspects. Among them, one important research direction is to model and forecast the spread of COVID-19, such as predicting the peak of the active cases on the virus and the size of the coronavirus outbreak. Such results can help government agencies better understand the overall impact of the disease and also facilitate policy makers in terms of pandemic preparedness and response such as allocating the medical resources.

One widely used method for modeling the spread of infectious disease is to use epidemic models such as Susceptible-Infected-Resistant (SIR) (Kermack and McKendrick, 1927) and Susceptible-Exposed-Infected-Removed (SEIR) (Hethcote, 2000). Such epidemic models are quite useful in describing the dynamics of transmission and are well-suited for predicting the peak of active cases on the virus. From the decision-making perspective, the peak prediction is able to forewarn the health system when to expect a surge in cases. Furthermore, the reproduction number (Fraser et al., 2009) estimated by the epidemic model can be directly used to measure the effectiveness of the intervention strategies such as social distancing and quarantine.

Several recent work used epidemic models such as the SIR and SEIR models (Imai et al., 2020; Li et al., 2020a; Wu et al., 2020; Kucharski et al., 2020; Read et al., 2020; Tang et al., 2020; Ferguson et al., 2020) to simulate the spread of COVID-19 in different regions and were able to forecast the size and severity of such epidemic outbreak. Some other work (Chinazzi et al., 2020; Kraemer et al., 2020; Dandekar and Barbastathis, 2020) also applied these epidemic models to study the role of quarantine controls such as travel restrictions in the spread of COVID-19. Most of the aforementioned studies consider classical epidemic models, e.g., the SIR and SEIR models, and base their analyses on the public reported data. However, it is often the case that the number of publicly reported cases (including confirmed cases and recovered cases) are much less than their real numbers as many infectious cases have not been tested due to test capability and asymptomatic patients, or even possibly under-reporting (Li et al., 2020b). As a result, classical epidemic models such as SIR and SEIR cannot accurately characterize the epidemic evolution of COVID-19 without taking such unreported cases into consideration. In addition, most existing work is focused on the nation-wide prediction. Nevertheless, it is also very important and beneficial to provide state and county level forecasts to assist local public health departments and governments to prevent the spread of COVID-19.

The goal of this paper is to make good use of the current public data on COVID-19 to better understand the spread of the coronavirus and to facilitate informed decisions by policy makers. In order to achieve this goal, we develop a new epidemic model, called the SuEIR model, to forecast the active cases and deaths of COVID-19 by taking the untested/unreported cases into consideration. In addition, we use machine learning based methods to train our model, which enables us to train the model efficiently. Based on the proposed model, we are able to make accurate predictions on the numbers of confirmed cases and fatality cases for nation, states and and counties. Moreover, our model can also predict the peak dates of active cases and estimate the basic reproduction number () of different states in the US.

### 1.1 The SIR and SEIR Models

In this subsection, we briefly introduce two classical epidemic models, i.e., the SIR (Kermack and McKendrick, 1927) and SEIR models (Hethcote, 2000), which have been adopted in many previous work to study the epidemic outbreaks such as SARS (Fang et al., 2006; Saito et al., 2013; Smirnova et al., 2019) and the ongoing COVID-19 (Read et al., 2020; Tang et al., 2020; Wu et al., 2020).

The SIR model. The SIR model is an epidemic model that shows the change of infection rate over time. More specifically, it characterizes the dynamic interplay among the susceptible individuals (S), infectious individuals (I) and removed individuals (R) (including recovered and deceased) in a certain place. In the SIR model, the susceptible individuals may become infectious individuals over time, which depends on the spread rate of the virus, often called the contact rate. Recovered individuals are assumed to be immune to the virus and thus cannot become susceptible again. To characterize this dynamics, we use *S _{t}, I_{t}, R_{t}* to represent the number of susceptible, infectious, and removed individuals at time

*t*, respectively. Suppose that the total population in a certain area is fixed as

*N*, then the evolving equations of the above variables over time are defined as follows: where

*β*is the contact rate between the susceptible and infectious groups, and

*γ*is the transition rate between the infectious and removed groups. The above ordinary differential equations indicate that at every time unit the total number of susceptible individuals will decrease by a quantity −

*βI*, which will transit into the infectious group. Apart from the increase from the transition of susceptible individuals, the size of the infectious group will also decrease by a factor of

_{t}S_{t}/N*γ*.

The SEIR model. For many diseases, there is often an incubation period during which individuals who have been exposed to the virus may not be as contagious as the infectious individuals. Therefore, it is necessary to separately model these cases as the “Exposed” group, and this gives rise to the SEIR model. The dynamics of the SEIR model introduces a new compartment *E _{t}*, which models the number of individuals that are exposed to the disease but have not developed obvious symptoms. Among all the exposed cases, there are only a

*σ*fraction of people who will develop observable symptoms in a time unit. Therefore, the dynamic of this model can be defined by the following ordinary differential equations:

Compared with the SIR model, the SEIR model has more elaborated model parameters. The parameters *σ, β* and *γ* can be learned from the reported data.

Reproduction number. An important quantity to characterize the dynamic of a pandemic is the basic reproduction number , which is the expected number of cases directly generated by one case in a population where all individuals are susceptible to infection (Fraser et al., 2009). The basic reproduction number in the SIR and SEIR models can be computed as .

## 2 Methods

In this section, we propose a new epidemic model and a machine learning method to train this model.

### 2.1 The SuEIR Model

It is observed that COVID-19 has an incubation period ranging from 2 to 14 days (Lauer et al., 2020). It has also been observed that individuals who have been exposed to the coronavirus can also infect the susceptible group during this period. In addition, it is often the case that the number of reported cases (including confirmed cases and recovered cases) are less than their real numbers as many exposed cases have not been tested, which will not pass to the next compartment. However, such important factors cannot be characterized by the classical epidemic models such as the SIR and SEIR models. We also observe that directly applying SIR or SEIR model to fit the reported data will lead to unreasonable predictions. Therefore, we proposed a new epidemic model that takes the untested/unreported cases as well as the “silent spreaders” into consideration. We call our model the SuEIR model and it is illustrated in Figure 1.

In particular, the compartment Exposed in our model is considered as the individuals that have already been infected and have not been tested. Therefore, they also have the capability to infect the susceptible individuals. Moreover, some of such individuals can receive a test and be further passed to the Infectious compartment (as well as reported to the public), while the others will recover but not appear in the publicly reported cases. Therefore, we introduce a new parameter 0 *< μ <* 1 in the evolution dynamics of *I _{t}* to characterize the ratio of the exposed cases that are confirmed and reported to the public, which we call it the

*discovery rate*. This discovery rate reflects the unreported/undiscovered cases, which is an important latent factor in the dynamics of the epidemic model. As a result, we propose to use the following ordinary differential equations to describe our proposed SuEIR model: where

*β*denotes the contact rate between the susceptible and “infected” groups (including both exposed and infectious compartments in Figure 1),

*σ*denotes the ratio of cases in the exposed compartments that are either confirmed as infectious or dead/recovered without confirmation,

*μ*is the discovery rate of the infected cases, and

*γ*denotes the transition rate between the compartments

*I*and

*R*.

### 2.2 Parameter Learning for the SuEIR Model

In this subsection, we will introduce our proposed machine learning method for training the SuEIR model. In addition, we will also present the detailed configurations used in our experiments.

Model training. As aforementioned in this paper, our model can be described by the ODE (1), which is determined by the parameters ** θ** = (

*β*,

*σ*,

*γ*,

*μ*). In particular, given the model parameters

**and initial quantities**

*θ**S*

_{0},

*E*

_{0},

*I*

_{0}, and

*R*

_{0}, we can compute the number of individuals in each group (i.e.,

*S*,

*E*,

*I*, and

*R*) at time

*t*, denoted by , , and , via applying standard numerical ODE solvers onto the ODE (1). Then we propose to learn the model parameter by minimizing the following logarithmic-type mean square error (MSE): where with

*I*and

_{t}*R*denote the reported numbers of infected cases and removed cases (including both recovered cases and fatality cases) at time t (i.e., date), and p is the smoothing parameter used to ensure numerical stability. Note that given

_{t}*S*

_{0},

*E*

_{0},

*I*

_{0}and

*R*

_{0}, and can be described as differentiable functions of the parameter

**. Then the model parameter can be learnt by applying standard gradient based optimizer (e.g., BFGS) onto the loss function (2) under the constraint that**

*θ**β, σ, γ, μ ∊[0,1]*.

Estimation of the number of removed cases *R _{t}*.

Note that *I _{t}* and

*R*in our model determine the number of “current” infectious cases (a.k.a., active cases) and the removed cases, i.e., the sum of recovered and fatality cases, respectively. However, most of the reported data only include the number of confirmed cases, i.e., sum of infected cases and removed cases

_{t}*I*+

_{t}*R*. In order to train the model, we need to get

_{t}*I*and

_{t}*R*separately. In addition, the SuEIR model can only predict the number of removed cases, while in many cases people are more interested in the number of fatality cases. Therefore, in order to enable the training of the SuEIR model, as well as provide the predictions for the number of fatality cases, we have to: (1) estimate the number of removed cases; (2) determine the number of active cases in the reported data by subtracting the estimated number of removed cases. In order to do so, we propose to use the following exponential function to model the ratio between the daily increased fatality cases and the removed cases, where

_{t}*a,b >*0 are parameters controlling the shape of the exponential function and

*t*denotes the number of days since the starting date. In order to demonstrate its effectiveness, we evaluate the approximation error based on the reported data in four countries: US, China, France, and Italy, which have separately reported fatality and recovered cases. More specifically, given parameters

*a, b*and the number of fatality cases, we are able to estimate the corresponding number of removed cases. Then the optimal parameters

*a*and

*b*are obtained by minimizing the MSE between the reported number of removed cases and the estimated one on different dates. The results are displayed in Figure 2, which clearly shows that the exponential functions can well describe the ratio between the daily increased numbers of fatality and removed cases. For each state and county in the US, we try several different choices of

*a*and

*b*around the optimal ones we obtained for the US, and pick the one with the smallest validation error.

Initialization. In terms of the initialization, we directly set and ^{1}. Additionally, one can typically set , where *N* is the total population of the region (which can be either a country or a state/county). However, since most of the states/counties in the US have already issued the stay-at-home offer, the actual total number of cases in the SuEIR model will be strictly less than *N*. Thus we set for some *N _{0} < N*. Moreover, it is worth noting that the initialization of

*E*, i.e., , is a bit tricky since we do not know the number of infected cases before testing them. It is not reasonable to set since generally there has already existed a large number of infected cases when the local governments began to test. Therefore, we propose to use a validation set to choose the optimal initial estimates

*N*

_{0}and when training our model.

Validation set. To determine the initial values of *N*_{0} and *E*_{0}, we first divide our data into the training data set and the validation data set. In detail, we choose the data in the most recent 7 days as the validation set, while treating the remaining as the training set. For example, suppose we have the data up to May 10, 2020, the data after May 3, 2020 will be used as the validation set, and the data up to May 3, 2020 will be used as the training set. We then do a grid search on different combinations of *N*_{0} and *E*_{0} and train different models on the training set accordingly. Finally, we choose the combination of N_{0} and E_{0} with the smallest validation loss (evaluated using the loss function (2)) along with the best model parameters (i.e., P, 7, a, *y*) to build the SuEIR model for prediction.

### 2.3 Confidence Interval

Given the initial quantities *S _{0}, E_{0}, I_{0}, R_{0}*, we can solve the optimization problem in (2) to obtain the model parameter . To assess the confidence of our estimator, we construct the confidence interval of

**following the previous work (Ma, 2020). More specifically, for a valid model parameter**

*θ***, we can compute the loss**

*θ**L*(

**) in (2), and construct the test statistic as , which represents the loglikelihood ratio between the point estimator and**

*θ***. Note that**

*θ***contains four free parameters (i.e.,**

*θ**β*,

*σ, γ*and

*μ*) while is fixed. By Wilks’s Theorem (Wilks, 1938), we know that follows distribution asymptotically. As a result, we can compare with the (1 −

*α*) quantile of the distribution and determine whether

**is in the confidence interval or not. In our experiment, we apply grid search on both sides of the point estimator to find the boundary of the confidence interval.**

*θ*### 2.4 Computation of the Basic Reproduction Number

We can also compute the basic reproduction number based on our proposed SuEIR model. Note that our model has a different dynamics from that of SIR and SEIR models. Thus we cannot directly apply the standard computation method of for the SIR or SEIR model to compute such number. Instead, we use the method proposed in Heffernan et al. (2005) to calculate based on the next-generation matrix. In specific, let x = (*x*_{1},…,*x*_{4})^{⊤} with *x _{i}* being the number of infected individuals in the compartment

*i*. Then we denote by function

*F*(x) the rate of new infections in compartment

_{i}*i*, and denote by and the rate of individuals transferred out of the compartment

*i*and the rate of individuals transferred into the compartment

*i*by all other means respectively. Let ,

*F*(x) = (

*F*

_{1}(x),…,

*F*

_{4}(x))

^{⊤}and

*V*(x) = (

*V*

_{1}(x),…,

*V*

_{4}(x))

^{⊤}The ODE (1) can be rewritten as

*dx/dt*=

*F*(

*x*) −

*V*(

*x*) with

Note that the disease-free equilibrium of our model is x* = (*N*, 0, 0, 0)^{⊤}. Let F and V be the partial Jacobian matrices of functions *F*(x) and *V* (x) with respect to the number of individuals in the “infective” compartments (both *E* and *I* compartments in the SuEIR model), i.e., *x*_{2} and *x*_{3}, i.e.,

Then the next-generation matrix **G** = **FV**^{−1} can be computed as follows:

Note that is given by the largest eigenvalue of next generation matrix **G** (Heffernan et al., 2005). Therefore, it is easy to show that the basic reproduction number of our proposed SuEIR model is

In contrast, the basic reproduction number for SIR and SEIR is .

## 3 Results

In this section, we present the forecast results, including confirmed cases and deaths, peak date and reproduction number, by our method.

Data collection. We use the data from the Johns Hopkins University Center for Systems Science and Engineering^{2} (Dong et al., 2020) to train our model for national-level forecasts. To train the state-level models, we use the data from The New York Times.^{3}In addition, we use the data from 03/22/2020 (most states have already issued the stay-at-home order by this date) to 05/10/2020 to train our models. More specifically, we use the reported data from 03/22/2020 to 05/03/2020 to train the SuEIR model while using the data from 05/04/2020 to 05/10/2020 for validation.

Prediction Results. For the interest of space, we present the forecast results of our models for the US and states with more than 40, 000 total cases, including New York, New Jersey, Illinois, Massachusetts, California, Pennsylvania, Michigan, Florida and Maryland. For more forecast results, please refer to our forecast website https://covid19.uclaml.org

Table 1 summarizes the projected death and its corresponding 95% confidence interval in the aforementioned regions from 05/12/2020 to 05/18/2020. The results show that our predictions are very close to the reported data, which suggests that our method performs very well in terms of the death forecasts. We also show the long term death forecasts by our method in Table 2. Our results suggest that by June 30, the projected death for the US is 123.4 × 10^{3} (95% CI 109.7× 10^{3}-140.4 × 10^{3}).

We demonstrate the projected number of confirmed cases by our approach along with its 95% confidence interval in Table 3. The results suggest that our predictions in terms of the number of confirmed cases are also reasonably accurate. For example, the reported number of confirmed cases in the US on 05/18/2020 is 1508 × 10^{3}, and our projected number is 1496 × 10^{3} (95% CI 1443 × 10^{3}-1572 × 10^{3}), which underestimates 12 × 10^{3} cases. In addition, the projected number of long term confirmed case is presented in Table 4. It shows that by 06/30/2020, the projected number of confirmed cases for the US is 1900 × 10^{3} (95% CI 1638 × 10^{3}-2362 × 10^{3}).

Our method can also forecast the peak date in different regions, i.e., the date with the largest number of active cases, as shown in Table 5. It can be seen that the projected peak date of the US is 06/01/2020, New York state is 05/10/2020, New Jersey is 05/19/2020, Illinois is 06/07/2020, Massachusetts is 05/23/2020, California is 07/01/2020, Pennsylvania is 05/20/2020, Michigan is 05/11/2020, Florida is 06/14/2020, Maryland is 05/27/2020.

Table 6 summarizes the basic reproduction number estimated by (4) in different regions, which characterizes the spread of the virus at the beginning of the epidemic. The results vary for different states, which are consistent with the severity of the coronavirus outbreak in these regions since mid March. For example the values of the states in the Northeastern US (e.g., NY: 3.6, NJ: 4.5, MA:4.2) are significantly higher than those of other states (e.g. CA: 2.2, MI: 2.1, FL: 2.4).

## 4 Discussion

We developed a novel epidemic model called SuEIR to infer the unreported cases of individuals contacting COVID-19. Based on this new model, we further develop a machine learning approach to forecast the numbers of confirmed and fatality cases in the US.

Our model can provide accurate short-term (daily ahead) projections for both confirmed cases and fatality cases at national and state levels, which demonstrates its effectiveness. In the long term, the prediction results by our model suggest that the numbers of confirmed cases and death will keep increasing rapidly within one month. In particular, at the end of June, our model forewarns that there will be approximately 2 millions confirmed infectious cases and 120K reported deaths in the United States.

Our model uses training data since 03/22/2020 at which most states have already issued stay-at-home order, and assumes that the contact rate maintains the same level during the training and prediction period. However, starting in May, many states have already lifted the restrictions of businesses and public spaces and considered reopening that allows people to go back to restaurants and offices and places of worship. It remains unclear how these reopening orders affect the contact rate as well as the spread of the virus and therefore our current model does not take this into consideration.

Moreover, we found that for most states, the learned discover rate (i.e., *μ*) is less than 0.1, which implies that a large fraction of “Exposed” individuals will finally recover/die without being tested and reported. This further suggests that the actual number of infected cases in the US may be more than 10 millions, while most of them are not counted. This result is consistent with the recent findings by the researchers from the University of Southern California (Sood et al., 2020), which show that 4.65% (CI: [2.8%, 5.6%]) of Los Angeles residents have already contracted the COVID-19 virus, which is approximately 23 times more than the official reported numbers.

## Footnotes

↵1 Here we omit the numbers of removed cases and recovered cases at the initialization by setting and to be the reported numbers of confirmed cases and fatality cases.