Forecasting COVID-19 Number of Cases by Implementing ARIMA and SARIMA with Grid Search in the United States

COVID-19 has surged in the United States since January 2020. Since then, social distancing and lockdown have helped many people to avoid infectious diseases. However, this did not help the upswing of the number of cases after the lockdown was finished. Modeling the infectious disease can help the health care providers and governors to plan ahead for obtain the needed resources. In this manner, precise short-term determining of the number of cases can be imperative to the healthcare system. Many models have been used since the pandemic has started. In this paper we will compare couple of time series models like Simple Moving Average, Exponentially Weighted Moving Average, Holt-Winters Double Exponential Smoothing Additive, ARIMA, and SARIMA. Two models that have been used to predict the number of cases are ARIMA and SARIMA. A grid search has been implemented to select the best combination of the parameters for both models. Results show that in the case of modeling, the Holt-Winters Double Exponential model outperforms Exponentially Weighted Moving Average and Simple Moving Average while forecasting ARIMA outperforms SARIMA.

to the mysterious nature of the virus, quarantine was the first and continuous response used to prevent the spread of the disease. During the pandemics, several strategies were adopted toward controlling the spread of the disease. Statistical and mathematical modeling have been helping politicians and healthcare governors control and prepare for the outbreak and adopt various strategies. Data analytic have been used in various researches such as medical and finance [2]. Time-series techniques are prevalent in modeling and predictions of the data series indexed with time. One study have used Linear regression, SIR model and logistic regression to predict the number of infected individuals for four countries. [3]. A study performed by Chen et al. used the ARIMA model to forecast property crime in china. They have shown that the ARIMA model prediction is very accurate [4]. SARIMA has been used by Szeto et al. to model and forecast traffic. The results showed only a ten percent error in the forecast [5]. Time-series analyses have been used in many publications to predict and forecast the number of infected by the infectious disease. For example, in a study, Lai modeled the number of individuals infected by SARS through the Box-Jenkins model and ARIMA [6]. Martinez in his paper has used SARIMA to model the incidence of dengue. The results indicate that the SARIMS model can help disease control predict the number of cases precisely [7]. In another work, Roy et al. used the ARIMA model to represent the COVID-19 spread, and to evaluate their model, they have calculated RMSE and MAE [8]. Tandon et al. predicted the rise in the number of cases for a short time using ARIMA time series [9]. Koyuncu et al. have used SARIMA to investigate how COVID-19 has affected maritime by forecasting RWI/ISL [10]. In another study, Ceylan has used ARIMA and SARIMA models to investigate the prevalence of COVID-19 in three countries of Italy, France, and Spain [11]. ARIMA Models have been used in many papers to predict the overview of the infectious diseases [12], [13], [14], [15], [16], [17], [18]. In this paper, we are exploring five time-series models. We use SMA, EWMA, and Holt Winter Double Exponential to describe the data and model the prevalence of the confirmed cases and deceased cases. Afterward, we use ARIMA and SARIMA to model and predict the number of infected individuals. All the analyses are performed on four states of the United States. States are selected from 4 different geographic locations to describe the pandemic distribution perfectly. Selected states are Alabama, Massachusetts, California, and Washington. All of the analyses are performed for both the number of confirmed cases and the number of deceased cases. For evaluation of the models' performance and predictions, RMSE (Root Mean Square Error) and MSE (Mean Square Error) indicate the precision of each prediction. As a result, the performance of the three models, SARIMA and ARIMA, are compared based on the modeling and prediction. As the number of reported cases is the number of positive tested individuals, the data might not be accurate. There are many infected individuals that they do not take a test or they would not use medical treatment for the disease.

Data
To perform further analysis on the United States data set, We are using the John Hopkins data repository of COVID-19, the most reliable source of data on COVID-19 [19]. Time series data provided by John Hopkins GitHub repository consists of the daily number of confirmed cases and related states. Due to the high number of data and the fact that modeling all the states is away from the scope of the paper, we are just representing the model and forecasting for four states, including Alabama, Washington, California, and Massachusetts.

Simple Moving Average
As mentioned before, we apply the Simple Moving Average technique to model the infected and deceased cases. The simple moving average is calculated through a total of the recent number of cases and dividing by the total number of periods involved. The formula would be as follow: We model four states with two parameters of 10 days and 30 days with the simple moving average. We can see here in the case of confirmed cases, as the moving average parameter goes up, the model performance decreases, which is reasonable due to the nature of the simple moving average. Also, for short-term modeling, a simple moving average performs fine. In all eight figures, we can see a decrease in the number of cases for late March.  The primary difference between an EWMA and an SMA is the sensitivity of each one to the data. The sensitivity to the data means SMA assigns uniform weight to all of the data, while EMA gives more weight to current data. The newest data will impact the moving average more, while older data has less impact on the average. The EWMA is calculated as follow: . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2021. ; https://doi.org/10.1101/2021.05.29.21258041 doi: medRxiv preprint   We use 30 days of the recent data to calculate EWMA. The following plots show the EWMA on the data of four states with a parameter of 30 days. To compare these models together, we can see if the SMA param-   . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Holt Winters Double Exponential Smoothing Additive
Exponential smoothing is a methodology for persistently reexamining a forecast in the light of later experience [20]. The Holt-Winters' model consists of two variants: the additive and the multiplicative. The additive technique is suitable when the occasional varieties are generally steady through the series. The multiplica-   Whether the occasional variety is viewed as independent of the level of the neighborhood mean or as being relative to it [21]. Winter argues that some other factors can influence seasonal estimates. The paper mentions that seasonal factors will rise to fill the lack of . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.    [22]. Double Exponential models are used when the data has a trend. The Holt-Winter seasonal method contains three smoothing equations. The first equation is to update the level with parameter α, the second one is to update the trend with parameter β , and the third one is to update the seasonality with parameter γ. We use the model as the following: Note that we use Holt Winter Double Exponential Smoothing just for modeling, not for prediction. Results show that Holt-Winter Double Exponential Smoothing can be a great representative of the models with trend and seasonality. Although due to the seasonal variations, which are roughly constant through the series, it is evident that the additive model would perform better rather than the multiplicative model, we still apply both models. Based on the RMSE, we can conclude that the additive model is outperforming the multiplicative model on COVID-19 data. For both models, span = 30 days and α = 2 span+1 is selected. In the following figures, we can see HW model is outperforming EWMA30.  [23]. Using the data and the difference between time series values, ARIMA models can predict future values of the time series. ARIMA models are used in situations where data is non-stationarity. Therefore, the non-stationarity can be removed by applying differencing one or a couple of times. AR stands for Auto Regression, I as Integrated, and MA for Moving Average, which are three parameters of the ARIMA models. P stands for the order of the AR model, d the degree of difference, and q as the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2021.    order of the MA model. We know that ARIMA(p,q,d) is defined as:   Where L is the lag operator. Here p and q represent how high of an order we should do for each of those components. We see p on the left-hand side for AR and q on the right-hand side for MA. Then we difference our data with the lag operator. In the ARIMA model, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2021. ; https://doi.org/10.1101/2021.05.29.21258041 doi: medRxiv preprint choosing the parameters by the programmer would not be the best choice since there might be human errors in reading the autocorrelation data, thus it would be better to use a grid search. To select the differencing terms, we used a test of stationary, augmented Dickey-Fuller test, and seasonality, the Canova-Hansen test for seasonal models. The best combination of p, d, and q has been chosen based on the AIC. RMSE (Root Mean Square Error) and MSE (Mean Square Error) have been used in this study to distinguish the ARIMA model accuracy for different states and compare different models together. We can see here that the model is performing differently for different states. We set the model as an iterative model to test different combinations of p,d, and q. This grid search provides the best combination to minimize the AIC. Each iteration provides an ARIMA model for the train data set. The grid search stores the lowest error for each iteration. Finally, the iteration stops after testing ten different models that would not perform better than the least error. Therefore, the parameters are selected, and the model is solid for forecasting. Forecasting is performed for 90 days. The model performs better for Alabama, and then the following states are California, Washington, and Massachusetts.   gression, D represents differencing, Q represents moving average coefficient, and m represents the number of data points in each seasonal cycle. Again to determine the parameters, two approaches of ACF and PACF could be applied. In this paper, we have used the grid search algorithm mentioned in section 2.5. Each parameter we use for each state is stated in the figures. Again we use RMSE and MSE to measure the accuracy of the mod-. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted June 1, 2021.           We have also used the ARIMA and SARIMA techniques to model and predict the number of cases. These models could help governors and healthcare providers to manage, plan and prepare for the peaks of this and similar diseases. The results were discussed using RMSE and MSE errors to be able to compare the performance of the methods to model and predict the number of cases. We have noticed that in the case of the three presented models, the Holt-Winters Double Exponential Smoothing Additive model outperforms the other two in both cases of the infected individuals and the deceased cases. Also, MSE shows that ARIMA models could be a better forecaster for the number of infected individuals rather than SARIMA. As the research limitation, we can mention the limited number of observations in the data set, which only covers 16 months. As it pertains to improving forecasting accuracy, the deep learning method could be applied to this data set.

Declaration of interest statement
The authors report no conflict of interest.

Funding
The authors did not receive support from any organization for this study.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 1, 2021. ; https://doi.org/10.1101/2021.05.29.21258041 doi: medRxiv preprint