Short-term forecasts to inform the response to the COVID-19 epidemic in the UK

Background: Short-term forecasts of infectious disease can create situational awareness and inform planning for outbreak response. Here, we report on multi-model forecasts of Covid-19 in the UK that were generated at regular intervals starting at the end of March 2020, in order to monitor expected healthcare utilisation and population impacts in real time. Methods: We evaluated the performance of individual model forecasts generated between 24 March and 14 July 2020, using a variety of metrics including the weighted interval score as well as metrics that assess the calibration, sharpness, bias and absolute error of forecasts separately. We further combined the predictions from individual models to ensemble forecasts using a simple mean as well as a quantile regression average that aimed to maximise performance. We further compared model performance to a null model of no change. Results: In most cases, individual models performed better than the null model, and ensembles models were well calibrated and performed comparatively to the best individual models. The quantile regression average did not noticeably outperform the mean ensemble. Conclusions: Ensembles of multi-model forecasts can inform the policy response to the Covid-19 pandemic by assessing future resource needs and expected population impact of morbidity and mortality.


Introduction
Since the first confirmation of a case on 31 January 2020, the Covid-19 epidemic in the UK has caused a large burden of morbidity and mortality. Following a rapid increase in cases throughout February and March, triggered by repeated introduction and subsequent local transmission 1 , the UK population was advised on 16 March to avoid non-essential travel and contact with others, and to work from home if possible. This advice became enforceable law a week later, which was followed by a decline in reported cases and deaths starting in the first half of April. During the same period, hospitals prepared for a rapid increase in seriously ill patients by maximising inpatient and critical care capacity 2 .
Short-term forecasts of infectious diseases are increasingly being used to inform public health policy for a variety of diseases 3 . Models for short-term forecasts can be statistical (investigating the changing distribution of variables over time), mechanistic (explicitly incorporating plausible biological and social mechanisms of transmission), or a hybrid of the two. While the best choice of models for short-term infectious disease forecasts is an ongoing topic of research, there is some evidence that introducing mechanistic assumptions does not necessarily improve short-term predictive performance compared to statistical models that have no specific assumptions related to the disease transmission process 4 . While short-term forecasts are most prominent for seasonal influenza, more recently they have also been made for outbreaks such as Ebola, measles, Zika, and diphtheria [5][6][7][8][9] . Developing accurate and reliable short-term forecasts in real time for novel infectious agents such as SARS-CoV-2 in early 2020 is particularly challenging because of uncertainty about modes of transmission, severity profiles and other relevant parameters 7,10 .
Here, we report on short-term forecasts of the Covid-19 epidemic produced by six groups in the UK, representing a mixture of academic and government institutions and using a variety of models and methods. A system for collating and aggregating these forecasts was set up through a short-term forecasting subgroup of the Scientific Pandemic Influenza Group on Modelling (SPI-M) on 23 March, 2020. The aim of these efforts was to predict the burden on the healthcare system, as well as key indicators of the current status of the epidemic faced by the UK at the time. Aggregate forecasts were made available to the Strategic Advisory Group of Experts (SAGE) and the UK government, marking the first time that a multitude of models for short-term forecasting models were explicitly combined to inform health policy in the UK. We review forecasts between the end of March and July and assess the quality of the predictions made at different times.

Targets and validation datasets
Initially seven forecasting targets were set: 1) The number of intensive care unit (ICU) beds occupied by confirmed Covid-19 patients, 2) the total number of beds (including ICU) occupied by confirmed Covid-19 patients, 3) the total number of deaths by date of death, 4) the number of deaths in hospitals by date of report, 5) the number of new and newly admitted confirmed Covid-19 patients in hospital, 6) the number of new admissions to ICU and 7) the cumulative number of infections.
Validation data sets for England and the seven English National Health Service (NHS) regions into which it is divided were derived from NHS England situational reports, and for the devolved nations in Northern Ireland, Scotland and Wales from their distinct reports and data sets. Because of differences in the structure of these reports and the data included in them, validation data sets started being used for the short-term forecasts at different times. Over time, this was reduced to four targets: the number of ICU beds and any beds occupied by confirmed Covid-19 patients, respectively; the number of new and newly admitted Covid-19 patients in hospital; and the number of deaths by date of death. Data sources and their interpretation changed at several points and definitions were updated as time progressed. Data sources that were not available at the time forecasts were made were excluded from the evaluation.

Forecasts
Starting on 24 March 2020, modelling teams produced probabilistic forecasts for every day three weeks ahead for all targets except the cumulative number of infections, for which only a single time point at the date of report (i.e., a nowcast) was produced. Initially, teams submitted forecasts three times a week (with deadlines on Sunday, Tuesday and Thursday nights) and provided a best predicted estimate with lower and upper bounds of each forecast metric. On 31 March, this was changed to the median, 1%, 5%, 25%, 75%, 95% and 99% predictive quantiles. On 14 April it was decided to change to a schedule of submission twice a week (Sundays and Wednesdays, with an initial submission on a Thursday), and on 18 April the prediction quantiles were changed to 5% intervals from 5% to 95%. On 25 May the submission was changed to weekly on Tuesday mornings.

Models
The number of models providing forecasts changed over time, with 11 models from six institutions used from the end of March. Some of these 11 models were only used on a few occasions owing to the time required in producing forecasts at regular time intervals along with shifting priorities. Other models were changed over time as discussions over the nature of the validation data sets evolved and more information became available about the intricacies of the data used to model the epidemic. One of the models (NHSBHM) was purely statistical, one a statistical/mechanistic hybrid (EpiSoon) and all others mechanistic. In brief, the models used were: NHSBHM : A statistical Bayesian hierarchical model fitted individually to ICU and hospital admissions data to produce forecasts at the level of individual health trusts. The growth (decay) of the recorded values in each trust were assumed to follow a negative binomial distribution, whose mean was parameterised as a generalised logistic growth (decay) function. These forecasts were then aggregated to the regional and national level.
Microsimulation 11 : A spatial microsimulation model of Covid-19 transmission, fitted to hospital prevalence and incidence and death data by region of Great Britain via a sweep over a multidimensional parameter grid of the basic reproduction number R 0 , seeding and infection timing and effectiveness. A posterior distribution is calculated via Monte-Carlo sampling based on the (assumed negative binomial) likelihood of each model run.
SIRCOVID 12 : An age-structured stochastic compartmental Susceptible-Exposed-Infectious-Recovered (SEIR)-type model incorporating hospital care-pathways and transmission within care homes. A Bayesian evidence synthesis approach is applied to fit the model to multiple regional data sources, namely: daily deaths in hospital-and non-hospital settings, ICU and general bed prevalence data, Pillar 2 Polymerase Chain Reaction (PCR) testing data and serological survey data from blood-donors. Particle Markov-chain Monte Carlo (MCMC) methods are used to sample from the joint posterior distribution of model parameters, including disease-severity and time-varying transmission rates, and epidemic trajectories. The future epidemic trajectories are then simulated via posterior predictive forecasting.
Exponential growth/decline 13 : An exponential model based on assumed doubling/halving times, applied to daily hospital admission data to forecast future admissions, broken down by general hospitalisation and ICU. For each admission, the duration of hospitalisation is simulated from discretized Gamma distributions parameterised with published estimates of length of stay in ICU/non-ICU. Beds are counted for each day to derive bed occupancy in each.
EpiSoon 14 : A semi-mechanistic model that combines a time series forecasting model with an estimated trajectory of the time-varying reproduction number over time. Cases, hospitalisations and deaths are simulated forward separately using a renewal equation model. Changes in the reproduction number are estimated from probabilistic reconstruction of infection dates, separately for each geography.
Transmission 15 : An age-structured dynamic transmission model that uses Google mobility data to parameterize the impact of social distancing measures in each NHS region. The model is fitted to deaths and hospital bed occupancy in each region.
DetSEIRwithNB 16 : A deterministic SEIR-type model without age structure, but with differential rates between compartments depending on the next state individuals are progressing to (e.g. symptomatic cases recovering naturally or being admitted to hospital spend different times in their infectious state, reflecting different underlying progression processes). Infected cases can be asymptomatic or symptomatic, symptomatic cases recover or go to hospital, hospitalised cases recover, die or proceed to ICU, and ICU cases die or step down to hospital and then recover. The model fits new and newly confirmed cases in hospital, hospital and ICU beds occupied, and hospital deaths, using a negative binomial likelihood around the deterministic mean, to data for each region and nation independently. Transmission, and hence the reproduction number, are piecewise constant, with change points at the time of the lockdown (24 March), 1 month later (suggested by visual inspection of the data stream) and 6 weeks before the most recent data point. Two different variants for the fitting procedures are employed, one based on Maximum Likelihood Estimation (MLE) and one on MCMC.
Regional/age 17 : An adaptation of a Bayesian modelling framework developed for pandemic influenza 18 , this approach combines parallel deterministic SEIR transmission and disease reporting models for each NHS region, fitted to age-specific death and serological data. The parallel regions are linked through common parameters for the infectious period and the IFR. The model outputs Bayesian posterior probability distributions for parameters of interest and predictive distributions for the trajectory of the epidemic. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; three devolved nations (Wales, Scotland and Northern Ireland) solutions are matched to number of deaths (using date of death), hospital occupancy, ICU occupancy and hospital admissions; for England this is supplemented with serological data from blood donor sampling -this fitting procedure infers early growth and the strength of controls in each area as well as the appropriate scaling between symptomatic cases and observable health-care metrics.

Assessment metrics
We assessed weekly forecasts using the weighted average interval score (WIS) across all quantiles that were being gathered 21 . This WIS is a strictly proper scoring rule, that is, it is optimised for predictions that come from the data-generating model and, as a consequence, encourages forecasters to report predictions representing their true belief about the future 22 . The WIS represents a parsimonious approach to scoring forecasts when only quantiles are available: It is a weighted average of the interval score 22 over central prediction intervals bounded by quantile levels , where is an observed outcome, the forecast, the median of the predictive distribution, and and are the predictive upper and lower quantiles corresponding to the central predictive interval level , respectively. The interval score is optimised at 0, representing a point forecast which is exactly correct. It penalises wide prediction intervals as well as data that lies outside the intervals.
We further assessed the calibration, sharpness and bias of forecasts separately, in line with the premise that forecasts should "maximise sharpness subject to calibration", that is as narrow as possible while consistent with future observations 22,23 . For calibration, we assessed coverage at the central 50% and 90% prediction intervals. For sharpness metric, we used the sharpness term in the WIS definition, .
As bias metric, we estimates the proportion of predictive probability mass below/above the observation using the discrete quantiles, approximating a previously defined bias metric 23 , where we define the outermost quantiles corresponding to as and .
We lastly compared weekly forecasts by the mean absolute error (MAE) of the median forecast. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; https://doi.org/10.1101/2020.11.11.20220962 doi: medRxiv preprint

Ensembles
We combined all the model forecasts into a single ensemble using two stacking procedures, with the aim to compare them in order to select an optimum procedure. Firstly, we produced an ensemble with equally-weighted quantiles (EWQ) by calculating each combined quantile as the mean or median of all the individual model predictive quantiles 24,25 . Secondly, we generated an ensemble using quantile regression averaging (QRA) to calculate each combined quantile as a weighted average of the individual model quantiles for each location and metric, where weights were estimated from past data to optimise past performance with respect to the one-week ahead WIS 26,27 . We tested a range of past data from 1 to 5 weeks to include in QRA, and several variants of this procedure, including constraining the weights to be non-negative and sum up to one or not, estimating weights per quantile (with an additional constraint to avoid quantile crossing) or one weight across quantiles, estimating common weights for NHSE regions or separate weights, and having an intercept in the regression or not. Weights were calculated using the quantgen R package 28 .

Null model
We compared the performance of both the individual models and the ensembles with a null model that assumed that each target would stay at its current value indefinitely into the future, with uncertainty levels given by a discretised truncated normal distribution with lower bound 0 and a standard deviation given by past one-day ahead deviations from the value of the metric.

Results
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; https://doi.org/10.1101/2020.11.11.20220962 doi: medRxiv preprint The initial forecasts were made just before the peak in confirmed Covid-19 hospital occupancy in early April. The 11 models were used to provide a total of 71,887 predicted days across the four final forecast targets and different geographies, during the 13 weeks up to 14 July 2020 or 8 weeks up to 9 June 2020 (for new and newly confirmed patients in hospital), respectively. Overall, the individual models broadly followed the trajectory of the epidemic in their forecasts but struggled to correctly predict the timing and height of the peak in early April (Fig. 1). Almost all individual models performed consistently better than the null model (no change) in median predictions at a 2 week horizon, but less clearly so at a 1 week horizon, when a model null model of no change sometimes outperformed individual models (Fig. 2). More details of individual model performance are given in the Supplementary Table 1.

Figure 2: Performance against a null model of no change, shown as the proportion of 1-week and 2-week forecasts of each model that performed better than the null model with respect to the WIS (dots: proportion; lines: 95% binomial confidence intervals).
Only models that were used to make forecasts at more than 2 time points are shown. Data sources are marked in grey where they were not available at the time of the forecasts.
While there was little consistency amongst the models with respect to performance against the four targets considered and no model clearly outperforming the others, all the ensemble models considered performed as good as or better than the best individual models against each target with respect to the WIS (Fig. 2 and Table 1). The best-performing QRA . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; https://doi.org/10.1101/2020.11.11.20220962 doi: medRxiv preprint ensemble according to the WIS used only the latest set of historical forecasts to estimate the weights and assigned separate weights for each quantile and each geographical region, constrained to sum up to 1 and without intercept (Supplementary Table S2). The best-performing EWQ combined models by taking the median of quantiles at each quantile level (Supplementary S3). Compared to the equal-weighted ensemble, the best-performing QRA yielded some improvement (Figs. 2 and 3 and Table 1).  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; https://doi.org/10.1101/2020.11.11.20220962 doi: medRxiv preprint was made a week earlier (dots: median; whiskers: 90% prediction intervals) and compared to the data as subsequently observed (black lines). Data sources are marked in grey where they were not available at the time of the forecasts.
Both the ensemble methods yielded well-calibrated models for daily deaths and new and newly confirmed cases in hospital at both 1-week and 2-week time horizons, but less so for hospital and in particular ICU occupancy (Table 1, Cov 0.5 and Cov 0.9). The forecasts were positively biased, i.e. overestimated ICU beds occupied as well as new and newly confirmed cases in hospital, while they were much less biased in either direction for total beds occupied and deaths (Table 1, Bias). The ensemble methods had similar sharpness, with the QRA model slightly sharper than the EWQ in most cases (Table 1, Sharp). Overall, the EWQ models performed better than most variants of the QRA in terms of both WIS and MAE, but the best QRA models tended to outperform the EWQ ( Table  1, WIS and MAE, and Supplementary Tables 2 and 3). The QRA as a model that could learn from past performance, gave widely fluctuating weights to models over time (Fig. 4).

Figure 4: Weights given to the different models in the median prediction of the best-performing QRA
at each forecast date for England.

Discussion
A system for collating and combining short-term forecasts was set up during the early phase of the Covid-19 epidemic in the UK as a response to an urgent need for predictions of the epidemic trajectory and health system burden. This led to rapid model building and left little time for systematic testing. The results shown here highlight some of the challenges in predicting an emerging epidemic, where calibration and predictive performance are difficult to achieve and considerable uncertainties exist 10,23,29 . The forecasts reflect a large variation in models that had been designed for a variety of purposes and, at least in some cases, rapidly adapted to the task. The is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; https://doi.org/10.1101/2020.11.11.20220962 doi: medRxiv preprint models have been and are being developed and improved as additional information becomes available, both on the validation data sources and the properties of the epidemic. This makes it difficult to create like-for-like comparisons across different dates at which forecasts were made.
Given these difficulties, it is important that models are synthesised in a meaningful and principled manner in order to derive greater value from different methodological perspectives on the epidemic. We found that stacking the models using a quantile regression average that optimises historical forecast performance resulted in good calibration against most targets and tended to outperform the equally-weighted quantile combination. Having said this, this result was based on testing a wide range of ways to combine models in a quantile regression, and only the best-performing variants performed better than a simple equal-weighted quantile average, which in turn performed better than most individual models. It has been observed in other fields that a simple model average tends to outperform individual models, which can, to some degree, be explained theoretically 30 .
Over the three months analysed here, the models, as well as the inclusion and interpretation of the different data sources, were undergoing continuous change. At the same time, not every model was submitted every week due to the time pressures involved and shifting priorities in a rapidly evolving public health emergency. The lack of significant improvement from weighting by past forecast performance would indicate that these issues change the performance of the individual models on a week-by-week basis, potentially to a degree that reduces the expected benefits from systematically taking into account the past performance of each model. For these reasons, the model combination produced from the equally weighted quantiles can serve as a good and principled ensemble forecast, while the performance of different ensemble methodologies remains an ongoing topic of investigation.
Throughout the period investigated in this study, the epidemic in the UK steadily declined, and good performance in this period need not correlate with good performance in other regimes, such as a resurgence of cases or a steady-state behaviour. The difficulty of the models to correctly predict the turnaround of the epidemic around the peak (albeit often acknowledging their own uncertainty) indicates that there may be value in incorporating external information, e.g. from changing social contact studies or behavioural surveys 31,32 . All models received some weight in the QRA at least during some periods, indicating that there is value in the contributions of all of the models. Extensions to the simple regression used here 33 , or alternatives such as isotonic distributional regression 34 , may yield future performance gains, as may the inclusion of model types and structures that are currently not represented in the pool of models that are part of the ensemble.
Much discussion on real-time modelling to inform policy has focused on the value of the reproduction number, R . A real-time estimate of R indicates whether new infections are expected to increase or decrease and is, therefore, a valuable quantity to signal the need for control measures and their required strength. However, it does not provide a direct prediction of the estimated burden on the healthcare system or expected morbidity and mortality in the near future. By forecasting these directly, decision makers can be equipped with a more complete set of indicators for short-term planning than through considering R alone. Unlike scenario models, which are a key tool for long-term planning but are difficult to validate rigorously because of unsurmountable long-term uncertainty, modelling for short-term forecasts can be numerically evaluated and, . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; consequently, improved in real-time. Conversely, models that are optimised for short-term forecasts usually decline in predictive performance after only a few generations of transmission 23 . Together, these two types of modelling have played a key role in informing the response to the Covid-19 pandemic in the UK and elsewhere.
As SARS-CoV-2 continues to spread in populations around the world, real time modelling and short-term forecasting can be key tools for short-term resource planning and pandemic management. The short-term forecasts described here and related initiatives in the US 35 , Germany and Poland 36 , and elsewhere are reflecting efforts to provide decision makers with the information they need to make informed decisions. In the UK, similar methodologies to the ones presented here are now used to generate medium-term projections, that is extrapolations over time periods longer than three weeks of what would be expected to happen if nothing changed from the current situation. As SARS-CoV-2 continues to affect populations around the world, short-term forecasts and longer-term projections can play a crucial part in real-time monitoring of the epidemic and its expected near-term impact in morbidity, healthcare utilisation and mortality.

Code and data availability
All code and data used to generate the results in this paper are available as an R package at https://github.com/epiforecasts/covid19.forecasts.uk .

Acknowledgments
The approaches for scoring and creating ensembles have greatly benefited from regular exchange with groups conducting similar efforts elsewhere, in particular Evan Ray, Nicholas Reich (University of Massachusetts Amherst), Johannes Bracher and Tilmann Gneiting (Heidelberg Institute of Theoretical Studies). The authors would like to thank the SPI-M members and secretariat for discussion and support, especially Graham Medley, Tom Finnie and Alastair Ikin. This work has greatly benefited from parallel research undertaken by the Statistics and Dispersion Modelling team at DSTL, and was conducted within a wider effort to improve the policy response to COVID-19. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; https://doi.org/10.1101/2020.11.11.20220962 doi: medRxiv preprint 34. Henzi, A. and Ziegel, J.F. and Gneiting, T. Isotonic Distributional Regression. arXiv (2019). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 4, 2020. ; https://doi.org/10.1101/2020.11.11.20220962 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Supplementary
The copyright holder for this this version posted December 4, 2020. ; https://doi.org/10.1101/2020.11.11.20220962 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint