Evaluating Short-term Forecast among Different Epidemiological Models under a Bayesian Framework

Forecasting of COVID-19 daily confirmed cases has been one of the several challenges posed on the governments and health sectors on a global scale. To facilitate informed public health decisions, the concerned parties rely on short-term daily projections generated via predictive modeling. We calibrate stochastic variants of growth models and the standard SIR model into one Bayesian framework to evaluate their short-term forecasts. In summary, it was noted that none of the models proved to be golden standards across all the regions in their entirety, while all outperformed ARIMA in a predictive capacity as well as in terms of interpretability.


Introduction
COVID-19, a respiratory disease caused by the novel coronavirus SARS-CoV-2, has rapidly become an ongoing global pandemic. It has become the leading cause of death in the United States (U.S.) and is still spreading fast in almost every other country. Given the extent of the physical and economic suffering caused by the pandemic, there is an urgent public health need to better predict the spread of COVID-19 both locally and nationwide. Since the emergence of the COVID-19 outbreak, several predictive modeling approaches have been proposed to predict trends of the disease to come up with effective policies and measures to minimize casualties.
Broadly speaking, there are five types of approaches to forecasting the number of new cases variants of six different growth models (i.e. logistic, generalized logistic, Richards, generalized Richards, Bertalanffy, and Gompertz) and the standard SIR model, all of which can be specified by an ordinary differential equation(ODE), into one flexible Bayesian modeling framework. The main reason that why we limit our analysis to these two modeling approaches is that they can not only make both short and long-term forecasts, but also provide useful insights to understand the disease dynamics of COVID-19. The growth models provide an empirical approach without a specific basis on the mechanisms that give rise to the observed patterns in the cumulative infection data, while the compartmental models incorporate key mechanisms involved in the disease transmission dynamics in order to explain patterns in the observed data.
In our Bayesian modeling framework, the bottom-level is a negative binomial model that directly models infection count data and accounts for the over-dispersed observational errors. The top-level is from a choice of growth or compartmental models that characterizes a certain disease transmission dynamic through ODE(s). The Markov chain Monte Carlo (MCMC) algorithm is used to sample from the posterior distribution. The short-term forecasts are made from the resulting MCMC samples. We perform the rolling-origin cross-validation procedure to compare the prediction error of different stochastic models. In terms of regions, we consider the top 20 countries in terms of confirmed case numbers for a country-level analysis. It is observed that 1) as the models learned more and more data, the predictive performance improved in general for all the regions; 2) none of the models proved to be golden standards across all the regions in their entirety, while the ARIMA model underperformed all stochastic models proposed in the paper. Besides, we offer a graphical interface that allows users to interact with the future trends of COVID-19 at different geographic locations in the U.S. based on the real-time  data. This daily updated web portal is currently being used to inform local policy-makers and the general public.
The rest of the paper is organized as follows. Section 2 introduces a variety of stochastic epidemiological models under a unified Bayesian framework. The detailed MCMC algorithms, posterior inference, and forecasting procedures are described in Section 3. Section 4 mainly discusses an algorithmic approach to measure and compare the predictive performance of the proposed models and the benchmark ARIMA model. Besides, we analyze the COVID-19 data for the top 20 countries with the most confirmed cases, including the U.S. Finally, we conclude with remarks in Section 5 and provide information about implementations in Section 6.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101 In this section, we present a bi-level Bayesian framework for predicting new daily confirmed cases during a pandemic in a closed population. Section 2.2 introduces the bottom level, which directly models the observed counts while accounting for measurement errors. Section 2.3 and 2.4 describe two choices of the top level, characterizing the epidemic dynamics through growth curve or compartmental trajectories, respectively. Before introducing the main components, we summarize the possibly observable data as follows.

Data notations
Let C = (C 1 , . . . , C T ) be a sequence of cumulative confirmed case numbers observed at T successive equally spaced points in time (e.g. day) in a specific region, where each entry C t ∈ N for t = 1, . . . , T . let C 0 be the initial value andĊ = (Ċ 1 , . . . ,Ċ T ) be the lag one difference of C, whereĊ 1 = C 1 − C 0 and each following entryĊ t = C t − C t−1 , t = 2, . . . , T , i.e. the difference between two adjacent observations. In the analysis and modeling of a series of infectious disease daily report data, the time-series data could also be the cumulative death numbers, recovery case numbers, or their sums, denoted by D (Death), E (Recovery), and R (Removed), and their corresponding new case numbers, denoted byḊ,Ė, andṘ. Assuming a closed population with size N , the time-series data could also be the number of susceptible people, denoted by S, with each entry S t = N − C t . In reality, only confirmed cases and deaths are reported in most regions.
Recovery data are not available or suffer from under-reporting issues even if existing. Thus, our main goal is to make predictions of the future trend of an infectious diseases only based on the confirmed cases C.

Bottom-level: Time-series count generating process
We consider the new case number observed at time t, i.e.Ċ t , are sampled from a negative binomial (NB) model,Ċ t ∼ NB(g(C t−1 , Θ), φ), t = 2, . . . , T as it automatically accounts for measurement errors and uncertainties associated with the counts.
Here, we use NB(µ, φ), µ, φ > 0 to denote a NB distribution with expectation µ and dispersion 1/φ. We assume this stochastic process is a Markov process, where the present state (at time . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101 t) depends only upon its previous state (at time t − 1). Therefore, the NB mean is a function, denoted by g(·), of the case number observed at time t − 1, characterized by a set of interpretable/uninterpretable model parameters Θ. With this parameterization, the NB variance is µ + µ 2 /φ, indicating that φ controls the variance of measurement error. A small value leads to a large variance to mean ratio, while a large value approaching infinity reduces the NB model to a Poisson model with the same mean and variance. The probability mass function of a NB Thus, we can write the full data likelihood as (1) For the prior distribution of the dispersion parameter φ, we choose a gamma distribution, φ ∼ Ga(a φ , b φ ). We recommend small values, such as a φ = b φ = 0.001, for a non-informative setting (Gelman et al., 2006). Note that the proposed framework can be also viewed as a stochastic discrete-time state-space model with a negative binomial error structure. The proposed Bayesian models, on average, mimics the epidemic dynamics and is more flexible than those deterministic epidemiological models, as it accounts for measurement error and has the potential to incorporate existing information into the prior structure of Θ.
The underlying assumption is that the rate of growth of a population, organism, or infectious individuals eventually declines with size. The logistic curve (also known as sigmoid curve) is the simplest growth curve of continuous time u ∈ R. It is a non-negative symmetric 'S'-shaped curve with equation y(u) = K 1+exp(−λ(u−u 0 )) , where u 0 is the midpoint, K is the maximum value, and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint λ reflects the steepness of the curve. It is clear to see that y(u) approaches K when u → ∞, while it converges to zero when u → −∞. In fact, the continuous curve y(u) is the solution of a first-order non-linear ODE, dy(u) du = λy(u) 1 − y(u) K with condition y(u 0 ) = K/2, where dy(u)/du can be interpreted as time-variant growth rate of the curve y. The above ODE reveals: 1) a non-negative growth rate, dy(u)/du > 0 as y(u) ∈ [0, K]; 2) an approximately exponential growth at the initial stage, y(u) ≈ exp(λu) as dy(u)/du ≈ λy(u) when y(u) → 0; 3) no growth at the final stage, y(u) dy(u)/du = 0 when y(u) → K; 4) a maximum growth rate of λK/4 occurred when y(u) = K/2, indicated by d 2 y(u)/du 2 = λdy(u)/du (1 − 2y(u)/K). Based on those curve characteristics, we can use the growth curve to characterize the trend of cumulative confirmed cases C.
In this paper, we mainly consider a family of growth curves that are derived from the generalized Richards curve (GRC), which is the solution to the following ODE, For those model-specific parameters in the context of epidemiology, K is the final epidemic size and should be an integer in the range of (0, N ], where N is the total population, λ ∈ R + is the infectious rate at early epidemic stage, p ∈ (0, 1) is known as scaling of growth, and α ∈ R + controls the curve symmetry. As our observed infectious disease data are usually counts collected at successive equally spaced discrete time points, we formulate the NB mean function g(·) based on the discrete version of (2), Table 1 provides a list of g(·)'s for growth curves with their characteristics. All the listed growth curves have been utilized and discussed in previous epidemiological studies. We include all of those choices in our framework excluding the last one, which is based on the generalized growth curve (GGC), because it lacks the final epidemic size K specification.
Without any existing information, we assume that K is from a discrete uniform distribution in its range and γ is from a gamma or a beta distribution, depending on the choice of growth curves. For instance, for both logistic and Gompertz curves, we assume γ ∼ Beta(a γ , b γ ), a natural modeling choice for parameter value restricted to the (0, 1) interval, and suggest to . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Top-level II: Compartmental model
The susceptible-infected-removed (SIR) model was developed to simplify the mathematical modeling of human-to-human infectious diseases by Kermack and McKendrick (1927). It is a fundamental compartmental model in epidemiology. At any given time u, each individual in a closed population with size N is assigned to three distinctive compartments with labels: susceptible (S), infectious (I), or removed (R, being either recovery or dead). The standard SIR model describes the flow of people from S to I and then from I to R by the following set of nonlinear where S(u), I(u), and R(u) are the population numbers of susceptible, infectious, and removed compartments measured in time u, subjecting to S(u) + I(u) + R(u) = N, ∀u. Another nature constraint is dS(u)/du + dI(u)/du + dR(u)/du = 0. Here, β ∈ R + is the disease transmission rate, γ ∈ R + is the removal rate, and their ratio R 0 = β/γ is defined as the basic reproductive is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint is a good indicator of the transmissibility of an infectious disease.
In this paper, we only consider the standard SIR model, although it is still feasible to design g(·)'s from its variations (see a comprehensive summary in Bailey et al. (1975)), such as the susceptible-infectious (SIS) model, the susceptible-infectious-recovered-deceases (SIRD) model, the susceptible-exposed-infectious-removed (SEIR) model, the susceptible-exposed-infectioussusceptible (SEIS) model, and their versions with the maternally-derived immunity compartment (Hethcote, 2000), as well as the recently developed extended-SIR (eSIR) model . For modeling discrete time-series data, we use the discrete-time version of the standard SIR model, The model has three trajectories, one for each compartment. The compositional nature of the three trajectories implies that we only need two of the three sequence data, e.g. S t = N − C t and R t for t = 1, . . . , n. However, recovery data only exist in few regions, and suffer from underreporting issue even if existing, which makes both model inference and predictions infeasible.
Alternatively, we consider both of the removed and actively infectious cases as missing data and mimic their relationship in spirit to some compartmental models in epidemiology. Specifically, we assume the number of new removed cases at time t, i.e.Ṙ t , is sampled from a Poisson distribution with mean γI t−1 , that is, should be specified. Such a strategy but with different error structure was also considered in some other compartmental models in epidemiology (see e.g. Siettos and Russo, 2013;Anastassopoulou et al., 2020;Wang et al., 2020). We can estimate the value of γ from publicly available highquality data where confirmed, deaths, and recovery cases are all well-documented, or from prior epidemic studies due to the same under-reporting issue in actual data. In this paper, we choose the removal rate γ = 0.1 as suggested by Pedersen and Meneghini (2020) and Weitz et al. (2020).
Based on this simplification, we rewrite the first equation in (4) as, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint Thus, we formulate the NB mean function g(·) for the standard SIR model as, where R can be sequentially inferred from C.
Without any existing information, in our Bayesian framework we assume β from a gamma distribution with hyperparameters that makes both the mean and variance of the transformed variable R 0 = β/γ equal to 1, that is, β ∼ Ga(1, 1/γ).

Model Fitting
In this section, we briefly describe the MCMC algorithm for posterior inference and forecasting.
Our Bayesian inferential strategy allows us to simultaneously infer all model-specific parameters and quantify their uncertainties.

MCMC algorithms
We first describe how to update the dispersion parameter φ in the proposed Bayesian framework, as it does not depend on the choice of models. At each MCMC iteration, we perform the following step: Update of dispersion parameter φ: We update φ by using a random walk Metropolis-Hastings (RWMH) algorithm. We first propose a new φ * , of which logarithmic value is generated from N log φ, τ 2 φ and then accept the proposed value φ * with probability min(1, m MH ), where the Hastings ratio is Here we use J(· ← ·) to denote the proposal probability distribution for the selected move. Note that the last term, which is the proposal density ratio, cancels out for this RWMH update.

Top-level as a growth model
We only present the updates of each parameters in the GRC model, as all other derivative models are its special cases. Our primary interest lies in the estimation of the final pandemic size K and the infectious rate at early epidemic stage λ.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10. 1101 Update of final epidemic size parameter K: We update K by using a RWMH algorithm. We first propose a new K * , of which logarithmic value is generated from a truncated Poisson distribution Poi (log K) within [log C T , log N ] and then accept the proposed value K * with probability min(1, m MH ), where the Hastings ratio is Note that with a discrete uniform prior on K, the last two terms cancel out for this RWMH update.
Update of infectious rate parameter λ: We update λ by using a RWMH algorithm.
We first propose a new λ * , of which logarithmic value is generated from N log λ, τ 2 λ and then accept the proposed value λ * with probability min(1, m MH ), where the Hastings ratio is Note that the last term, which is the proposal density ratio, cancels out for this RWMH update.
Update of growth scaling parameter p: We update p by using a RWMH algorithm.
We first propose a new p * , of which logarithmic value is generated from a truncated normal distribution N log p, τ 2 p within [−∞, 0] and then accept the proposed value p * with probability min(1, m MH ), where the Hastings ratio is Note that with a uniform prior on p over its range [0, 1], the last two terms cancel out for this RWMH update.
Update of symmetry parameter α: We update α by using a RWMH algorithm. We first propose a new α * , of which logarithmic value is generated from N log α, τ 2 α and then accept the proposed value α * with probability min(1, m MH ), where the Hastings ratio is Note that the last term, which is the proposal density ratio, cancels out for this RWMH update.

Top-level as a compartmental model
Our primary interest lies in the estimation of the reproductive number R 0 = β/γ, which reflects the transmissibility of the disease. With our assumption that γ is given, we propose the following updates in each MCMC iterations.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint Generate R based on C: We assume I 1 = C 1 , i.e. all the confirmed cases are capable to pass the disease to all susceptible individuals in a closed population at the very beginning. In other words, R 1 = 0. Then we sample R 2 ∼ Poi(γI 1 ), where γ is a pre-specified tuning parameter.
Due to the compositional nature, we can compute I 2 = I 1 +Ċ 2 −R 2 , whereĊ 2 = C 2 −C 1 is the new confirmed cases and R 1 is the total removed cases from the actively infectious population at time 2. Next, we repeat this process of sampling R t ∼ Poi(γI t−1 ) and computing I t = I t−1 +Ċ t − R t , to generate the sequence R.
Update of reproduction number parameter β: We update β by using a RWMH algorithm. We first propose a new β * , of which logarithmic value is generated from a truncated normal distribution N log β, τ 2 β and then accept the proposed value β * with probability min(1, m MH ), where the Hastings ratio is Note that the last term, which is the proposal density ratio, cancels out for this RWMH update.

Posterior inference
We obtain posterior inference by post-processing the MCMC samples after burn-in. Suppose that a sequence of MCMC samples on θ, θ ∈ {φ, K, λ, p, α, β}, θ (1) , . . . , θ (U ) have been collected, where u, u = 1, . . . , U indexes the iteration after burn-in. An approximate Bayesian estimator of each parameter can be simply obtained by averaging over the samples, θ = U u=1 θ (u) /U . In additional to that, a quantile estimation or credible interval for each parameter of interest can be obtained from this sequence as well.

Forecasting
Based on the sequences of MCMC samples on K, λ, p, and α in the growth model or β in the compartmental model, we can predict the cumulative or new confirmed cases at any future time T f by Monte Carlo simulation. Particularly, from time T + 1 to T f , we sequentially generatė C (u) t ∼ NB(g(C t−1 , Θ (u) ), φ (u) ), t = T + 1, . . . , T f .
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint Then, both short and long-term forecast can be made by summarizing the (T f − T )-by-U matrix of MCMC samples. For instance, the predictive number of cumulative and new confirmed cases T +1 /U , respectively.

Results
In this section, we discuss the findings of COVID-19 data analysis. We first implemented each of the growth models listed in 19 studies also based their analyses on this resource (Dong et al., 2020;Zhou and Ji, 2020;Toda, 2020). For our MCMC algorithms, we set 100, 000 iterations with the first half as burn in and chose weakly informative priors described in Section 2. Both numerical and graphical summaries for posterior inference and short-term forecasting are presented. Our final goal is to compare the predictive performance of all models, taking ARIMA as a benchmark model.

Forecasting of daily confirmed cases in U.S.
We first present the forecasting of U.S. daily confirmed cases made by the ARIMA and our is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101 millions) by the three models, respectively. To account for the reasons behind the discrepancy in forecasts and to measure the validity of the results, there is a need for an appropriate strategy to evaluate and compare the predictive performance of the concerned models.

Model comparison through rolling-origin cross-validation
Cross-validation (CV) is a resampling procedure used to evaluate regression and classification models on a limited data sample. The procedure randomly splits all data samples into two parts: training and testing sets, where the former is used to fit a model and the latter is used to evaluate the model's prediction performance in terms of a certain error measure. The key assumption of CV is that all data points should be independent and identically distributed (i.i.d.).
Unfortunately, time-series data is serially autocorrelated i.e. the observations are dependent on the time they were recorded on. To circumvent this situation, Tashman (2000) discussed a rolling-origin CV (ROCV) technique that splits the data into training and testing sets without hampering the i.i.d. assumption. An adaptation of this method is used here to evaluate the shortterm forecasting performance among different top-level choices under the proposed Bayesian . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint framework and ARIMA. Figure 2 shows the ROCV representation for an example of time-series data (T = 17). In our analysis, the choice of initial training sample size was set to seven days so as to evaluate how well the models are able to generate forecasts during the initial phase of the pandemic, while the testing sample size was chosen to be three days to meet with our objective of comparing short-term forecasting performance. We define the first day t = 1 as the date when the 100-th case was confirmed, so it varied for different countries.
time Figure 2: A visual guide to rolling-origin cross-validation (ROCV), where the total sample size T = 17, the initial training sample size is 9, and the testing sample size is 3. The green, orange, and white circles are training, testing, and unused samples in one CV iteration.
A CV algorithm needs a predictive error metric that could quantify model performance in terms of forecasting accuracy. Root mean square error (RMSE) and mean absolute deviations (MAD) are candidates of error measures for out-of-bag predictions but are dependent on scale. As a result, large values may influence the errors to be larger. Mean absolute percentage error (MAPE) has been a widely used predictive measure due to its interpretability and its independence from scale. Although, the distribution of such percentage errors can be skewed if the data consists of values close to zero. Moreover, there is a possibility of this measure being undefined by having a zero in the denominator. To address these issues, Tashman (2000) proposed an improved percentage error metric namely, symmetric mean absolute percentage error (sMAPE). This metric was considered in our analysis as it circumvented the problem of having an undefined measure and provided better symmetry as compared to MAPE. In all, we summarize the evaluation procedure used in this paper as follows.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint 1 Store the data starting day 1 to day T ; 2 Initialize the number of initial training observations k (k = 7); 3 Set the size of the testing set ω (ω = 3); Learn the first k observations (green circles) as training data; 6 Hold out the next k + 1, . . . , k + ω observations (orange circles) as the testing data; we can see, all the models performed poorly in the early stage but as more and more data became available to be learned, the predictive performance gradually improved as the sMAPE dropped.
It can also be observed that the ARIMA and SIR models were performing a lot worse in general than the growth models in the early phase. This could be attributed to the fact that ARIMA not having the growth specific parameters, unable to detect the early growth. On the other hand, making assumptions of a fixed transmission rate γ and due to the under-reported data issue, SIR performed poorly. While, the stochastic growth curves were able to learn the trend of epidemiological data in the initial phase with the help of the growth and scaling parameters.
Although, towards the latter half of the epidemic, all the models were performing equally well.
Hence, it is hard to conclude that any one particular dominated the entire duration of the epidemic. Now, the question arises of whether we could pick one model which has the best predictive performance on an average for any particular country. To answer that we construct a Cleveland dot plot, as shown in Figure 4, that allows us to rank the model performance averaged over . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101   In general, the GRC and GLC models were consistent performers throughout all countries due to their ability to detect sub-exponential growth rates at an early stage of an epidemic. In most cases, the GRC and Richards models were the best performers in countries that did not have symmetric 'S'-shaped growth patterns and displayed randomness as well as multiple peaks.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. This is due to the inclusion of the scale parameter α that could account for any asymmetry in the data. Countries including the U.S., Peru, Saudi Arabia, Iran, Turkey, and France display multiple peaks in the daily confirmed case counts. As a result, the Richards model performs the best in the U.S., United Kingdom (U.K.)., and Peru, while the GRC model dominates in the rest of the countries having multiple peaks. Moreover, a random structure was observed in countries like Brazil, Chile, Bangladesh, and Mexico. GRC being the most complex model out of all the other growth models performed the best in these countries. On the other hand, the GLC model usually performed better in countries that had a single peak and attributed an approximate 'S'-shaped curvature. The GLC model was able to generalize better than the GRC model when the data was well structured and had less randomness. Countries including Argentina, Pakistan, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10. 1101 Germany, Colombia, and India attributed a single peak without much randomness. As a result, the GLC model was a better performer in these countries. Whereas, in the case of South Africa, the usual growth models performed the worst due to a staggering growth rate in the initial and the middle phase of the epidemic. The SIR model performed the best out of the worst while the logistic model performed well due to its simplicity. On the other hand, the Gompertz model was the best performer in Russia, Spain, and Italy as it generalizes better than the other models.

Conclusion
In this paper, we developed a number of stochastic variants of growth and compartmental models under a unified Bayesian framework. A theoretical comparison of growth models has been discussed in greater detail in the literature (see e.g. Chowell et al., 2014Chowell et al., , 2016Chowell, 2017;Pell et al., 2018;Chowell et al., 2020;Jia et al., 2020;Roosa et al., 2020;Wu et al., 2020). However, to our best knowledge, no work systematically compares their performances between all pairs as well as against a compartmental model such as the SIR model and a time-series forecasting model such as the ARIMA. Based on our analysis, we conclude that the proposed Bayesian framework not only allows room for interpretation but also offers an exemplary predictive performance when it comes to COVID-19 daily report data. Moreover, ARIMA being a pure learning algorithm is not able to match with the forecasting accuracy of stochastic models, let alone the model parameters of ARIMA do not provide any information on epidemiological interests.
For future work, we aim to develop an ensemble model, which can aggregate the prediction of each base model and results in once final prediction for the unseen data. Note that Chowell et al. (2020) have recently introduced a GGM-GLM ensemble model and compared forecasting performances of that with the individual models for the Ebola Forecasting Challenge . It was reported that the ensemble model outperformed the others under some circumstances. We also plan to perform long-term forecasting evaluation using some epidemic features described in Tabataba et al. (2017). Chowell et al. (2019) have recently developed a sub-epidemic wave model that could detect multiple peaks in the data, which has the potential to improve forecasting performance. Thus, developing stochastic growth models via the addition of a change-point detection mechanism to account for multiple peaks is worth investigating. In this regard, we have demonstrated that an approach that combines a change-point detection model and a stochastic SIR model could significantly improve the short-term forecasting of the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint new daily confirmed cases . Other tabs offer different graphs for summarizing the model performance, long-term forecasting, and county-level spatial maps. This paper introduces a user-friendly interactive web application (https://qiwei.shinyapps.

Software
io/PredictCOVID19/) integrated with R Shiny package. Shiny is a web platform that allows users to interact with real-time data and use a myriad of visualization tools to analyze it. The web application has been developed to help the general public assess both short and long-term forecasts of COVID-19 across the U.S. at both state and metropolitan-level. The numbers of cumulative or new daily confirmed cases as well as deaths are projected by different growth models and the SIR model under the proposed Bayesian framework. Alongside the numerical summaries, users can view and interpret the trends that cover the same information. To validate . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10.1101/2020.10.11.20210971 doi: medRxiv preprint the short-term forecasting, numerical and graphical summaries of MAE and MAPE of the predictions are provided for the more advanced users. Moving on to the long-term forecasting, the models estimate the peak number of cases and deaths as well as their respective dates. Moreover, a predictive estimate for the final size and date is also offered. Finally, for the users that are keen to visualize the currently observed cases at a geographical level, the website offers county-level spatial maps.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 14, 2020. . https://doi.org/10. 1101