Assessing the Performance of COVID-19 Forecasting Models in the U.S.

Dozens of coronavirus (COVID-19) forecasting models have been created; however, little information exists on their performance. Here we examine the performance of nine commonly-used COVID-19 forecasting models, as well as equal- and performance-weighted ensembles, based on their knowledge - i.e., accuracy and precision, and their 'self-knowledge' - i.e., 'calibration' and 'information'. Calibration and information are measures commonly employed in structured expert judgment to assess an expert's ability to meaningfully communicate the extent and limits of their knowledge. Data on observed COVID-19 mortality in 4 states, selected to reflect differences in racial composition and COVID-19 case rates, over eight weeks in the summer of 2020 provided the basis for evaluating model predictions. Only two models showed little bias (geometric mean of observed/predicted < 10%) and good precision (geometric standard deviation of observed/predicted < 1.6). Three models demonstrated good calibration and information. However, only one model exhibited superior performance in both dimensions. Nearly all models under-predicted COVID-19 mortality, some quite substantially. Further, model performance depends on racial composition and case rates, and forecasts in the short-term outperform forecasts in the medium-term on all criteria. The performance-weighted ensembles also outperformed the equal-weighted ensemble on all criteria. The ability of models to accurately and precisely predict mortality and the ability of the modelers to provide meaningful characterizations of the uncertainty in their estimates are potentially important to model developers and to those using model output to inform decisions.


Background
Effective non-pharmaceutical interventions (NPIs), or community mitigation strategies, are crucial in combating the spread of contagious illnesses like coronavirus disease 2019 . 2 Community NPIs, such as social distancing guidelines, restrictions, closures, and lockdowns, can effectively delay and diminish an epidemic peak, also known as flattening the epidemic curve. 2,3,4 However, these NPIs can also have immediate educational and economic consequences. 5,6,7 To make decisions on the implementation of community NPIs amidst the COVID-19 pandemic, the relevant stakeholders (e.g. government officials, community leaders, school administrators etc.) may desire estimates of the number of coronavirus cases, hospitalizations, and deaths likely to occur in their region of interest within the coming weeks.
Information about, and forecasts from, dozens of COVID-19 forecasting models are currently available via the University of Massachusetts Amherst Reich Lab's COVID-19 Forecast Hub. 8 All of the participating modeling groups provide central estimates, and most also quantitatively characterize the uncertainty in their predictions -often giving an interquartile range (25% LCL to 75% UCL) and a 90% confidence interval (5% LCL to 95% UCL) for each prediction. 9 Unfortunately, little peer-reviewed information about the accuracy or precision of the estimates is widely available. Thus, stakeholders lack a scientific basis for deciding which models to trust and how much confidence to place in the forecasts they provide. Quite recently, one study has become available. 10 It compares the median absolute percent error of eight models stratified across world region, month of model estimation, and weeks of extrapolation. 10 Collectively, the 8 models released in July over a twelve week forecasting range had a median average percent error of about 25%, with errors tending to increase with longer forecasts and the best performing model varying by region. 10 While this information is quite useful and provides a sense of the typical bias of model predictions, other aspects of model performance may also be of interest. Users may also care about the precision of estimates and about the modeler's selfknowledge (i.e., their ability to properly characterize the uncertainty in their estimates).
Some may wonder whether better forecasts might be obtained by averaging the forecasts of two or more individual models. One such 'ensemble' model has been created by the aforementioned Reich Lab. 11 It involves an equal-weighted combination of individual model forecasts and is not performance-weighted. 11 In the expert judgment literature it has been clearly demonstrated that performance-weighted combinations of expert opinion consistently outperform equally weighted combinations. 12,13 This deserves consideration in the analysis of COVID-19 model projections.
The analysis presented below compares model forecasts with subsequent observations using several measures of model performance -reflecting 'knowledge' (bias and precision) and 'self-knowledge' (calibration and information). We also construct three ensemble models, one equal-weighted and two performance-weighted, and compare their performance with each other and with the best performing individual models. Lastly, we aim to explore whether the available models tend to provide better forecasts under certain circumstances (i.e., case rates, racial demographics, and forecast periods).

Data
Our analysis involves a comparison of model forecasts with subsequent observations of weekly deaths from COVID-19 in four states (Idaho, Louisiana, New York and Maine) over an eight-week period.
The states considered in our analysis were selected on the basis of recent case rates of COVID-19 (cases/100,000 population within the previous week) and racial composition (majority non-Hispanic Black vs. majority non-Hispanic White). Racial composition was of interest as the COVID-19 mortality rate for non-Hispanic Black Americans is 2.1 times that of non-Hispanic White Americans. 14 With these two domains in mind, our goal was to assess two states with relatively high case rates (Idaho and Louisiana); two with relatively low case rates (Maine and New York); two with a relatively high fraction of population reported as non-Hispanic Black (Louisiana and New York); and two with a relatively high fraction of population reported as non-Hispanic White (Idaho and Maine). This was done to assess how models perform forecasting for states under varying circumstances. More detail on how the case rates and racial composition for states were determined, as well as how states were selected, is available in the supplemental material (S.1.1.).
We were also interested in the models' ability to forecast COVID-19 deaths in both the near-term and the medium-term. Near-term performance was gauged using projected COVID-19 deaths in the week immediately after the forecast was made. Medium-term performance was gauged using projected COVID-19 deaths in the week ending four weeks after the forecast was made. Our evaluations of model performance for the four states and the two forecast periods of interest (week ending one week in the future and week ending four weeks in the future) were examined twice -once for forecasts made on June 13 th , 2020, and a second time for forecasts made on July 11 th , 2020 (no overlap in forecasts). In total, 16 comparisons of model forecasts with observed deaths were made for each model.
Of the many models providing data to the Reich Lab's data repository, only those for which all 16 forecasts were available were included in our analysis. These include: OliverWyman-Navigator (Model A) 15 22 , and Covid19Sim-Simulator (Model I) 23 .

Performance Criteria
Two aspects of model performance were evaluated -(i) the accuracy and precision of the model's central estimates; and (ii) the uncertainty of a model's estimates reflected by their reported confidence intervals.
First, to evaluate 'knowledge' (i.e., the performance of the models' central estimates), each observation, O i,j , was divided by the corresponding prediction, P i,j -where i is an index indicating the model and j is an index reflecting the date, state, and time interval: The distribution of the resulting ratios was then examined. For each model, i, the geometric mean (GM) and geometric standard deviation (GSD) of the distribution of R i,j were computed and used as measures of the observed bias and precision of the model's estimates. Second, to evaluate 'self-knowledge' (i.e., the modelers' ability to characterize the uncertainty of their forecasts), the performance of each model was assessed using Cooke's Classical Method (CM). 1 This method was initially designed for evaluation of the performance of formally-elicited structured expert judgment (SEJ) -where an expert's ability to meaningfully characterize the uncertainty in his or her estimates is arguably as important as the predictions they provide -and has been employed in many studies. 24,25 We believe the CM is also applicable to the forecasts given by the models, as the true observations are unknown at the time of forecasting and the modeling groups may act as the 'expert' while their forecasts may serve as their 'judgment'.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.09.20246157 doi: medRxiv preprint Cooke's CM assesses 'calibration', C, using Shannon's relative information statistic, I s , which compares the assessed and observed probabilities of calibration variables falling within various ranges.
Where s k = O i,k / m is the number of realizations falling within the k th of m quantiles given by the i th model and p k is the probability given by the model (the 'stated probability') that observations will fall within that quantile and Χ m 2 is the Chi-squared statistic with m degrees of freedom. If s = p for all k then I s = 0 and C = 1. As the divergence between stated and observed probabilities increases, I increases and C moves toward 0.
Cooke's CM assesses 'information', I, by comparing the width of the confidence intervals given by each model with the 'intrinsic range' of each calibration variable. The intrinsic range for a variable is defined as the difference between the largest forecasted or observed value and the smallest forecasted or observed value. This range is expanded slightly by multiplying it by a user-defined expansion factor, 1+ F, where F is typically a small fraction (for our data, the default 10% was used). Using this framework, the information of each expert, I e , on each variable is defined as: Where p k are the probabilities given by the model, e, and r k are the probabilities from a uniform (or loguniform) probability density function over the intrinsic range. Models which concentrate their forecasts in a narrow range will have high information scores.
From these two scores, Cooke calculates performance weights as the product of calibration and information and then normalizes these so that they sum to 1 across all models.

Ensemble Models
It is possible that better performance might be obtained by producing an ensemble based on weighted combinations of the forecasts given by the individual models. Two approaches of potential interest are -(i) equal-weighted combinations, and (ii) performance-weighted combinations. In addition to an equalweighted model, two performance-weighted ensembles are considered, based on -(a) inverse-variance weights and (b) Cooke's weights.
For inverse-variance performance weighting, each random variable is weighted in inverse proportion to its variance (i.e., proportional to its precision). This weighting method is commonly used in metaanalysis. 26 The equal-weighted model is established by equally weighting the probability densities given by the individual model forecasts. The inverse-variance-weighted model and the Cooke's-weighted model both use performance-weighted averaged densities of the individual model forecasts.

Individual Model Performance
First, in order to visualize how accurate each model's predictions (i.e., central estimates) are, we plotted each prediction by its corresponding realization. Our results in Figure 1 illustrate substantial differences in model accuracy.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.09.20246157 doi: medRxiv preprint  Table  1. This table suggests that there are substantial differences in performance among these models. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.09.20246157 doi: medRxiv preprint Focusing initially on the central estimates, we see that -(i) typically, model predictions have little bias (≤ 35%), but (ii) for models B, C and E, bias is substantial. All models, except model C, have a bias score of > 1 indicating systematic underprediction of observed COVID mortality. Looking at the precision of model central estimates, we see that -(i) typically, model predictions are within a factor of ~ 2 of the observed values, but (ii) model G and H yield much less precise predictions (within factors of ~ 4 and ~ 3, respectively), and (iii) models A and I appear to offer the most precise predictions (1.57 and 1.47, respectively). Thus, if judged by the performance of their central estimates, models A and I would seem to be the most attractive -with bias < 10% and precision of < 1.6.
When we look instead at the performance of model predictions in the context of their uncertainty intervals, a different picture emerges. From this perspective, only three (or perhaps, four) of the nine models considered perform at all well -models A, D, F, and to a lesser extent, G. The calibration scores of all the other models are quite low (<< 0.01) indicating 'overconfidence' -i.e., that their stated confidence intervals are far too narrow while simultaneously poorly capturing the true value. The information scores vary from 0.5 to 3.2, suggesting substantial differences in the width of the stated confidence intervals. However, the models that have the highest information scores (models E, I, G and B) all have extremely low calibration scores (<< 0.01), suggesting that their self-confidence is misplaced.
i This value relies on treating model G's four-weeks ahead prediction of 0 for the week ending on July 11th in Idaho as 0.1. If instead this model G prediction is dropped from the analysis, then the GSD becomes 2.19 and the geometric mean becomes 1.00.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.09.20246157 doi: medRxiv preprint When calibration and information are considered simultaneously (i.e., the unnormalized Cooke's weight), models D, A and F appear to perform admirably.

Ensemble Models
Each ensemble model was also assessed for all performance measures. Performance measure results for each of the ensemble models is summarized in Table 2. The results make clear that both performance-weighted ensembles outperform the equal-weighted combination of models. There is no measure of performance for which the equal-weighted combination outperforms both of the performance-weighted ensembles.
The inverse-variance-weighted ensemble demonstrates substantially less bias (0.98) than the equalweighted combination (1.12). The Cooke's-weighted ensemble reflects somewhat less bias (1.12) than the equal-weighted combination but does not match the performance of the inverse-variance-weighted ensemble in this regard. In terms of precision, again the inverse-variance-weighted combination performs best (1.41), with the Cooke's-weighted ensemble (1.57) reflecting no better precision than the equalweighted combination (1.60).
The differences in performance are more noticeable when self-knowledge (calibration, information, and Cooke's weight) is of interest. The Cooke's-weighted ensemble far outperforms the other two models in both calibration (0.54) and Cooke's weight (0.43), suggesting that it better quantifies uncertainty in its predictions (i.e., it is more accurate and provides more concentrated forecast intervals).

Performance by Domain
It is also interesting to compare these models' performance across the three domains of interest -(i) race (i.e., states which are heavily non-Hispanic White vs. states with relatively-large non-Hispanic Black populations), (ii) COVID-19 case rates (i.e., states with a relatively low amount of weekly cases per 100,000 population vs. states with a relatively high amount of weekly cases per 100,000 population), and i i A hypothesis rejection significance level of 0.05 was used for this ensemble.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.09.20246157 doi: medRxiv preprint (iii) forecast period (i.e., forecasts of mortality for the upcoming week vs. forecasts of mortality for the week ending four weeks from the date on which the forecast was made). Table 3 evaluates the performance of equal-weighted combinations of the nine models stratified by different domains (eight forecasts per domain). Table 3: Equal-weighted ensemble performance stratified by different domains assessed on bias (GM), precision (GSD), calibration, information, and unnormalized Cooke's weight.

Sub-Domain Bias -GM Precision -GSD Calibration Information Cooke's Weight (Unnormalized)
High % non-Hispanic White (ID, ME) There appear to be systematic differences in the ability of these models to forecast COVID-19 mortality depending on whether they are being used to -(i) project deaths in the near or medium-term, (ii) project deaths in states which are largely non-Hispanic White or in those which have substantial non-Hispanic Black populations, and (iii) project deaths in states with low or high COVID-19 case rates.
The largest differences are seen in calibration. Model calibration is much better in states with large non-Hispanic White populations (0.60) than in states with large non-Hispanic Black populations (0.05). It is also somewhat better when making forecasts of deaths for the next week (0.27) than the week ending four . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.09.20246157 doi: medRxiv preprint weeks in the future (0.05), and in states with high case rates (0.32) than in those with low case rates (0.14).
Model forecasts are essentially unbiased (< 10% mean error) when making projections of deaths for the next week or deaths in states where case rates are low. However, these models appear to systematically underpredict deaths for the week ending four weeks in the future (GM = 1.62) and in states with high case rates (GM = 1.42). The racial composition of the state seems to have little effect on the bias of model projections.
The precision of model estimates also differs by domain, with projections being substantially less precise in both -(a) states with large non-Hispanic White populations (GSD = 1.86) than those with substantial non-Hispanic Black populations (GSD = 1.34), and (b) states with high case rates (GSD = 1.77) than in those with low case rates (GSD = 1.39). Perhaps surprisingly, there are only small differences in the precision of near-term and medium-term projections.

Discussion and Conclusions
Our results suggest that there may be substantive differences in the performance of the models now available to predict COVID-19 mortality in the United States and that model performance may differ when judged by 'knowledge' (i.e., their bias and precision) and self-knowledge (i.e., their calibration and information). For example, when evaluated on the basis of the bias and precision of their central estimates, models A and I would seem to be the most attractive -with bias < 10% and precision of < 1.6.
It should be noted that, although the bias of several of the models is relatively small, all but one of the models appear to systematically underpredict the true rates of COVID mortality.
When evaluated in terms of the modelers' ability to characterize the uncertainty in their estimates, a different picture emerges -models D, A and F look quite good, but model I does not. The fact that all but 3 of the 9 models considered have such low calibration scores indicates 'overconfidence' -i.e., the relatively narrow confidence intervals given by these models are unjustified and should not be relied on by users.
We also find that model performance appears to depend on the racial composition and the COVID-19 case rate for the population of interest, and that model projections of near-term mortality are better than projections of medium-term mortality.
Due to differences in inclusion criteria, only one of the 9 assessed forecasting models, YYG, was also assessed by Friedman et al. (2020). 10 Thus, it is difficult to compare performance results for the models. However, the authors do show increasing median error and median absolute error with longer forecasting periods, which is in agreement with the results shown in this analysis. 10 Finally, our results indicate that performance-weighted ensembles outperform equal-weight ensemble models and therefore may be of interest to decision makers. The Cooke's performance-weighted ensemble outperforming the equal-weighted ensemble is in agreement with the expert judgment literature. 12,13 Our sense is that these results are more suggestive than conclusive, because -(i) they examine only nine models, (ii) they are based on model performance in one eight-week period in the summer of 2020, (iii) they are limited to four states; (iv) they consider only forecasts of mortality and not other outcomes, such as cases or hospitalizations, that may be of interest for decision-makers; (v) our performance comparisons are descriptive and lack any formal tests of the statistical significance of observed performance . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.09.20246157 doi: medRxiv preprint differences; (vi) the performance of our ensembles has not been validated with independent data or subjected to cross-validation; and (vii) our analysis treats the models as 'black boxes' with no attempt to understand their internal structure, assumptions or data requirements.
On the other hand, our analysis has several strengths -(i) it evaluates the performance of a set of leading models which currently are being used to project COVID-19 mortality in the US; (ii) it relies on a broad set of performance criteria -which assess both knowledge (i.e., bias and precision) and self-knowledge (i.e., calibration and information); and (iii) it considers performance in four states that were selected to reflect the differences in racial composition and COVID-19 case rate in the US at the time.
. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.09.20246157 doi: medRxiv preprint

Availability of Data
Model forecasting data was gathered from the COVID-19 Forecast Hub's publicly available structured data storage repository on GitHub. 9 Observed state COVID-19 mortality and case data was gathered from the Centers for Disease Control and Prevention (CDC). 27 State population and racial composition data was collected from one-year estimates from the Census Bureau's 2018 American Community Survey (ACS). 28 Table 4 in the supplemental material (S.2.1.) provides the racial composition statistics and case rate data. Table 5 in the supplemental material (S.2.2.) provides the model predictions, their uncertainty distributions, and the subsequent observations for COVID-19 mortality. Data was analyzed using Microsoft Excel and EXCALIBUR (a software package for using Cooke's Classical Method). 29