Abstract
In this paper, we compare the inference regarding the effectiveness of the various non-pharmaceutical interventions (NPIs) for COVID-19 obtained from three SIR models, all developed by the Imperial College COVID-19 Response Team. One model was applied to European countries and published in Nature1 (model 1), concluding that complete lockdown was by far the most effective measure, responsible for 80% of the reduction in Rt, and 3 million deaths were avoided in the examined countries. The Imperial College team applied a different model to the USA states2 (model 2), and in response to our original submission, the Imperial team has proposed in a referee report a third model which is a hybrid of the first two models (model 3). We demonstrate that inference is highly nonrobust to model specification. In particular, inference regarding the relative effectiveness of NPIs changes substantially with the model and decision makers who are unaware of, or ignore, model uncertainty are underestimating the risk attached to any decisions based on that model. Our primary observation is that by applying to European countries the model that the Imperial College team used for the USA states (model 2), complete lockdown has no or little effect, since it was introduced typically at a point when Rt was already very low. Moreover, using several state-of-the-art metrics for Bayesian model comparison, we demonstrate that model 2 (when applied to the European data) is better supported by the data than the model published in Nature1. In particular, serious doubt is cast on the conclusions in Flaxman et al.1, whether we examine the data up to May 5th (as in Flaxman et al.1) or beyond the point when NPIs began to be lifted. Only by objectively considering a wide variety of models in a statistically principled manner, can one begin to address the effectiveness of NPIs such as lockdown. The approach outlined in this paper provides one such path.
1 A Tale of Three Models
The different models (Flaxman et al.1 and Unwin et al.2) produced by the Imperial College COVID-19 Response Team aim to explain the evolution of Rt. We will refer to these models as model 1 (the model applied to European countries in the Nature publication1), model 2 (the model applied to the USA states2), and model 3 (the model that Flaxman et al.1 proposed in their referee report to our original submission to Matters Arising), respectively. The first two models for the Rt are fundamentally different. In model 1, the proportional variation of Rt from the initial R0 is modelled as a step function and only allowed to change in response to an intervention. Therefore, any decrease in Rt (even if this decrease is a result of the increasing proportion of the population who are infected, to changes in human behaviour, clustered contact structures and/or pre-existing immunity3) must, by the model construction, be attributed to interventions and the impact is immediate with no time lag or gradual change, when a new intervention is adopted. In model 2, the proportional variation of Rt from R0 has a different functional form and is allowed to vary with mobility indicators for various activities. These mobility indicators are proxies for changes in human behaviour, whether that change is due to one or more centrally imposed interventions or whether it is the product of individuals responding to the epidemic on their own initiative independently of centrally imposed interventions. Model 2 also does not presume step functions and is therefore capable of capturing more gradual changes over time. Model 1 is deliberately simple and designed to test the impact of interventions without the confounders of mobility data, which themselves will reflect the impact of the interventions. Model 2 leaves out interventions and uses mobility data as predictors in the evolution of Rt. The advantage of model 2 is that it gives a more flexible estimate of Rt, by allowing it to change with mobility trends. Although there is no explicit causal structure in model 2, the time sequence of the data and events makes inference around the impact of interventions possible by observing if the change in Rt precedes a given intervention or interventions or not. Here, we apply both models 1 and 2 to the European data to compare the results and inferences they obtain. See supplementary methods for details on methods and data. In response to our original submission, the Imperial team proposed yet a third model which includes a single NPI, lockdown, as well as mobility, and we refer to this as model 3. We extend Imperial’s model 3 to include all the NPIs that were used their original model 1, and discuss inferences regarding this model.
2 Results
There are two main observations when comparing the modeling results in Figure 1a, as well as the corresponding Extended Data Figures 1a-1e. First, and most notable, is that while the models give very different trajectories of Rt, both models produce visually similar fits to the observed daily death counts, although we note that the root mean square error (RMSE) for model 2 is less than or equal to the RMSE for model 1 in 9 out of the 11 countries. Additionally, and most importantly, the predictive density of model 2 provides a better fit to the data than model 1 as measured by two state-of-the-art criteria for Bayesian model comparison, the Deviance information criteria (DIC)4 and the Watanabe-Akaike information criteria (WAIC)5. Table 2 and Table 3 provide details. The second observation is that inference regarding the impact of interventions varies significantly between the two models. The inference from model 1 indicates that lockdown had the biggest impact of all the interventions in all countries. Indeed, Flaxman et al.1 states that Lockdown has an identifiable large impact on transmission (81% [75%-87%] reduction).
United Kingdom. The start time for the plots is 10 days before 10 deaths are recorded. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
In contrast, model 2 shows clearly that Rt was falling well before lockdown, which occurred after the sharp decline in Rt. With the exception of Belgium (and to a lesser degree France), this is consistently seen across the European countries (see Extended Data Figures 1a-1e) and is in line with the mobility data, see Figure 2. At the time lockdown was adopted in the UK, Rt had already decreased to 1.46 (95% CI, 0.89 to 2.20) from an initial R0 of 4.46 (95% CI, 2.64 to 7.20) according to model 2, and the largest drop in Rt occurred after the implementation of self-isolation measures and encouraging social distancing, see Table 1.
Percentage change in mobility from baseline level from February 15th to July 12th, by locations in each of the European countries examined in Flaxman et al.1, as well as an additional three countries consisting of Greece, the Netherlands and Portugal. Average mobility is computed based on the trends in retailers and recreation venues, grocery markets and pharmacy, transit stations and workplaces. Black dashed lines in each plot indicate the lockdown start and end dates.
Basic reproduction number R0 and time-varying reproduction number Rt immediately at time of introduction of NPIs given by both models using data up to May 5th for all eleven countries analysed in Flaxman et al.*. These NPIs are self-isolation (SI), social distancing (SD), school closure (SC), event ban (EB) and lockdown (LD). 95% credible intervals are given in parentheses below the corresponding point estimates. Countries where the seeding of new infections occur after the introduction of NPIs are denoted with an asterisk.
We also evaluated the effect of extending the time horizon of analysis until July 12th where restrictions had been lifted in some of these countries. At the suggestion of Flaxman et al.1 in their referee report, we allow for the effect of the NPIs to be asymmetric. That is, we allow the impact of lifting of an NPI on Rt to be different in magnitude from the impact of imposing that NPI. To make the models 1 and 2 comparable we start both models on March 4th. We also show the results for model 1 with a start date of February 13th, the date used by1, as well as a March 4th start date, in Extended Table 3. As shown by Extended Table 3, inference regarding the impact of an NPI is not highly sensitive to the start date.
We also include an additional three countries, i.e. Greece, the Netherlands and Portugal, in the analysis. Extended Data Table 1 shows the end dates of school closures, ban on public events and lockdown used in the analysis. Figure 1b presents a comparison of the results obtained from both models for the UK for the extended data, while Extended Data Figures 2a-2g show similar results for the remaining 13 countries. Extended Data Table 2 reports the RMSE while Tables 2 and 3 report measures of fit, effective sample sizes and two information criteria; Deviance information criterion (DIC)4 and Watanabe-Akaike information criterion (WAIC).
Estimates of log pointwise predictive density (lppd), effective number of parameters (WAIC refers to Watanabe-Akaike information criterion; DIC refers to Deviance information criterion) and log predictive density log p(y|) evaluated at posterior mean
.
Estimates of various information criteria; the Watanabe-Akaike information criterion, WAIC1 = −2lppd + 2pWAIC1 and WAIC2 = −2lppd + 2pWAIC2 which uses lppd as a measure of fit with pWAIC1 and pWAIC2 as the effective number of parameters to penalize the fit respectively; the Deviance information criterion which uses
, as the measure of fit, and pDIC as the penalty. Note that a lower value implies a better predictive density, and the preferred model for each criteria and time period is shown in bold.
Figure 1b, Extended Data Figures 2a–2g, and Tables 2 and 3 present a number of features. Model 2 provides better fits to the data than model 1, as evidenced by the RMSE, which is lower for 11 out of the 14 countries. Importantly, the impact of lockdown in model 1 is significantly reduced; the CI for the coefficient of lockdown is (0.73,1.33), for the period up to July 12th, while the CI for the coefficient based on data up to May 5th is (1.37, 2.05). The CI for the coefficient of lifting lockdown is (0.00,0.15), suggesting that it is significant.
The authors1, in their referee report for our submission, proposed a third model – model 3, in which they included mobility and lockdown, as the only NPI, and found that lockdown was still significant in reducing Rt. We question why they chose only to include lockdown, and not any other of the NPIs, either jointly or singly, given that their initial goal was to quantify the relative contributions of several NPIs to the reduction in Rt. In response, we construct a similar model but, for thoroughness and in the spirit of the original Nature paper, we include include all of the NPIs, as well as mobility. Figure 3 plots the posterior distribution for the regression coefficients, and shows that lockdown and event ban had similar effect sizes on the reduction of Rt. This paints a very different picture from the Nature paper which claimed that lockdown was responsible for 80% of the drop in Rt. Moreover, we question the usefulness of including NPIs as well as mobility data in the same model, because of confounding effects, but we include this analysis only in response to the referee report from Imperial. NPIs reduce Rt by reducing contact among individuals. An indirect measure of the reduction in individual contact is the mobility data, and so these data will be highly correlated with NPIs, making inference difficult.
Density estimates of the posterior distributions of regression coefficients for the covariates in model 3, for the data up to May 5th.
Finally, we should point out that Figure 1b shows that the gradual increase in Rt following the minimum value commenced well before the lifting of lockdown. This suggests that as people become less frightened by the prospect of a catastrophe, mobility (see Figure 2), and hence the time-varying reproduction number, increases.
3 Conclusion
The purpose of this paper is to demonstrate that there is substantial model uncertainty which is ignored by the Nature paper1. and that inference around the effectiveness of NPIs is highly model dependent. Failing to report this uncertainty to those who make policy decisions will ultimately undermine the public’s trust in the value of decisions based on statistical modeling. In attempting to answer the question ‘which results should be used to guide policy making in lifting restrictions?’, we must be transparent about what we do not know. Flaxman et al.1 make the statement We find that, across 11 countries, since the beginning of the epidemic, 3,100,000 [2,800,000 – 3,500,000] deaths have been averted due to interventions. However, we have shown that their model 2, which is better supported by the data than model 1, provides very different inferences concerning the effectiveness of intervention strategies for the period March 4th to May 5th (as well as up to July 12th) in the UK. Moreover when we include all the NPIs in their model 3 we find that event ban had a similar effect on the reduction of Rt as did lockdown- a very different result from the Nature paper1. Although it is tempting to congratulate ourselves on our decision to implement lockdown, citing the number of lives that were saved, we should resist this temptation, and examine other possible explanations in an objective and statistically principled manner. The approach outlined in this paper provides one such path. Failure to do this and therefore mis-attribute causation could mean we fail to find the optimal solution to this very challenging and complex problem, given that complete lockdown can also have many adverse consequences6.
We do not want to reach the opposite extreme of claiming with certainty that lockdown definitely had no impact. Other investigators using a different analytical approach have suggested also benefits from lockdown, but of much smaller magnitude (13% relative risk reduction7) that might not necessarily match complete lockdown-induced harms in a careful decision analysis. Another modeling approach has found that benefits can be reaped by simple self-imposed interventions such as washing hands, wearing masks, and some social distancing8. Observational data need to be dissected very carefully and substantial uncertainty may remain even with the best modelling9. Regardless, causal interpretations from models that are not robust should be avoided. Given the analyses that we have performed using the three models that the Imperial College team has developed, one cannot exclude that the attribution of benefit to complete lockdown is a modelling artefact.
Data Availability
All source code for the replication of our results is available from the Imperial College COVID-19 Response Team's Github repository. Daily confirmed cases and deaths data are publicly available from the European Centre of Disease Control's website.
Author Contributions
All authors contributed equally to this work. VC performed all the computations and produced all the graphics. SC wrote the initial draft. JI and MT wrote subsequent drafts. All authors discussed the results and implications and commented on the manuscript at all stages.
Conicts of Interest
None.
Code Availability
All source code for the replication of our results is available from the Imperial College COVID-19 Response Team’s Github repository: https://github.com/ImperialCollegeLondon/covid19model. Daily confirmed cases and deaths data are publicly available from the European Centre of Disease Control’s website.
Supplementary Methods
Model 2 assumes that Rt is a function of mobility and allows this impact to vary regionally by the use of regional specific random effects terms. The data used by Unwin et al.2 to estimate Rt is the Google’s COVID-19 Community Mobility Report. We adapt this methodology to the European context by modelling potential idiosyncrasies in mobility trends across countries using country specific random effects. To complete the specification of model 2 for the European data, we use Google’s COVID-19 Community Mobility Report10 which provides data measuring the percentage change in mobility compared to a baseline level for visits to a number of location categories; retailers and recreation venues, grocery markets and pharmacies, parks, transit stations, workplaces and residential places. We use the average change in mobility across all location categories, excluding residential places and parks, as a measure of the reduction in mobility, so that the reproduction number Rt,m is given by:
where Xt,m is the average mobility at time t for country m and ∊m,w(m)(t) is a weekly AR(2) process centred around 0.
The seeding of new infections in models 1 and 2 is chosen to be 10 days before the day a country has cumulatively observed 10 deaths so that mobility data are available for all the countries examined. In the UK, this corresponds to March 4th, which is later than the infection start date of February 13th used by Flaxman et al.1.
For posterior inference of model 2, we use the same prior distributions as in Unwin et al.2 except for R0, where a weakly informative prior of a normal distribution truncated below at 1 with mean 3.28 and standard deviation 2 is used. This prior is chosen so that approximately 95% of the prior density is between 1 and 711, and that R0 is above the critical value of 1 at the start of the epidemic. For model 1, we use the same priors as in Flaxman et al.1 for the analysis up to May 5th and July 12th.
The functional form of Rt,m based on model 3 suggested by Flaxman et al.1 analysed in our paper is:
where the subscript k refers to the imposition of various NPIs given in Table 4.
Correspondence of subscripts for k to each NPI.
Austria and Belgium. Plots of daily infections, daily deaths and time varying reproduction number Rt until May 5th. The start time for the plots is 10 days before 10 deaths are recorded. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Denmark and France. Plots of daily infections, daily deaths and time varying reproduction number Rt until May 5th. The start time for the plots is 10 days before 10 deaths are recorded. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Germany and Italy. Plots of daily infections, daily deaths and timevarying reproduction number Rt until May 5th. The start time for the plots is 10 days before 10 deaths are recorded. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Norway and Spain. Plots of daily infections, daily deaths and time-varying reproduction number Rt until May 5th. The start time for the plots is 10 days before 10 deaths are recorded. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Sweden and Switzerland. Plots of daily infections, daily deaths and time-varying reproduction number Rt until May 5th. The start time for the plots is 10 days before 10 deaths are recorded. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Austria and Belgium. Plots of daily infections, daily deaths and time-varying reproduction number Rt in columns 1–3 until July 12th. The start time for the plots is 10 days before 10 deaths are recorded. Column 4 is a magnification of column 3 around the period of the NPIs. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Denmark and France. Plots of daily infections, daily deaths and time-varying reproduction number Rt in columns 1{3 until July 12th. The start time for the plots is 10 days before 10 deaths are recorded. Column 4 is a magni_cation of column 3 around the period of the NPIs. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Germany and Greece. Plots of daily infections, daily deaths and time-varying reproduction number Rt in columns 1{3 until July 12th. The start time for the plots is 10 days before 10 deaths are recorded. Column 4 is a magni_cation of column 3 around the period of the NPIs. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Italy and the Netherlands. Plots of daily infections, daily deaths and time-varying reproduction number Rt in columns 1{3 until July 12th. The start time for the plots is 10 days before 10 deaths are recorded. Column 4 is a magnification of column 3 around the period of the NPIs. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Norway and Portugal. Plots of daily infections, daily deaths and time-varying reproduction number Rt in columns 1{3 until July 12th. The start time for the plots is 10 days before 10 deaths are recorded. Column 4 is a magnification of column 3 around the period of the NPIs. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Spain and Sweden. Plots of daily infections, daily deaths and time-varying reproduction number Rt in columns 1–3 until July 12th. The start time for the plots is 10 days before 10 deaths are recorded. Column 4 is a magnification of column 3 around the period of the NPIs. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
Switzerland. Plots of daily infections, daily deaths and time-varying reproduction number Rt in columns 1{3 until July 12th. The start time for the plots is 10 days before 10 deaths are recorded. Column 4 is a magnification of column 3 around the period of the NPIs. Observed counts of daily infections and daily deaths are shown in red, and their corresponding 50% and 95% CIs are shown in dark blue and light blue respectively.
RMSE of daily death counts for Models 1 and 2 for the time period up to May 5th and July 12th.
95% credible intervals for the coefficients of imposition and lifting lockdown, event ban and school closure in model 1 up to July 12th for 2 different infection start dates in the UK. For the start date from March 4th, the lifting of school closure is excluded from the model due to non-convergence.
Acknowledgement
We congratulate the Imperial College Response Team for sharing openly the code for their models and for the overall transparency of their work that has allowed performing these analyses. We thank Hadi Ashfar for his suggestions to improve the computational efficiency of the HMC scheme. We also thank Jack Wood for his help in the construction of Extended Data Table 1.