Bayesian Estimation of real-time Epidemic Growth Rates using Gaussian Processes: local dynamics of SARS-CoV-2 in England

Quantitative assessments of the recent state of an epidemic and short-term projections into the near future are key public health tools that have substantial policy impacts, helping to determine if existing control measures are sufficient or need to be strengthened. Key to these quantitative assessments is the ability to rapidly and robustly measure the speed with which the epidemic is growing or decaying. Frequently, epidemiological trends are addressed in terms of the (time-varying) reproductive number R. Here, we take a more parsimonious approach and calculate the exponential growth rate, r, using a Bayesian hierarchical model to fit a Gaussian process to the epidemiological data. We show how the method can be employed when only case data from positive tests are available, and the improvement gained by including the total number of tests as a measure of heterogeneous testing effort. Although the methods are generic, we apply them to SARS-CoV-2 cases and testing in England, making use of the available high-resolution spatio-temporal data to determine long-term patterns of national growth, highlight regional growth and spatial heterogeneity.

control measures (UK Health Security Agency, 2020a); hence placing great political, economic and public-health importance 23 on this single value. A robust and rapid estimation of R (or the epidemic growth rate r ), together with levels of uncertainty, 24 remains a key public-health tool. The estimation needs to be rapid, such that prompt action can be taken before the burden on 25 health services becomes too great; the estimation also needs to be robust, as the economic and social consequences of action 26 can be costly and so should only be enacted when there is considerable certainty that such measures are needed. As such there 27 is a clear need for continued development of statistical methods that can extract a meaningful signal from complex and noisy 28 epidemiological data. 29 Obtaining an accurate and timely measure of R generally requires a robust estimate of either the generation time or the 30 infectiousness profile over time (Wallinga and Lipsitch, 2007) (capturing the expected level of transmission at time t after 31 infection). Both of these necessitate detailed individual-level observations (Hart et al., 2021;Abbott et al., 2020 and may 32 therefore be context dependent, leading to a diversity of R estimates from the same population-level data . 33 Here, we adopt the more parsimonious approach of working with the growth rate r (such that the number of infections grows like 34 I (t ) ∼ ex p (r t )), in which case our threshold for a growing or declining outbreak becomes where r is greater or less than zero, 35 respectively. 36 Given the importance of real-time estimation of the growth rate, r , or the reproductive number, R , multiple statistical 37 methods have been developed (The Royal Society, 2020; Gostic et al., 2020). All methods have advantages and potential 38 problems, with an inevitable trade-off between robustness and timeliness. Most naively, the growth rate can be estimated by 39 simply measuring the rate of change of log(infection), where infections are often approximated as being proportional to reported 40 cases. This naive approach is confounded by the stochastic nature of transmission and reporting, requiring either smoothing of 41 the data or fitting the growth rate over a defined time window -longer windows and more smoothing eliminate stochastic effects, 42 but mean that real-time estimates of the growth rate and R are considerably lagged. The UK government dashboard (UK Health 43 Security Agency, 2020c) expands on these simple ideas to produce estimates of the growth rate at the national scale, calculated 44 as the relative change over seven days in the smoothed number of cases (smoothed by taking a mean over a seven-day window). 45 In recent years, EpiEstim (Cori et al., 2013) has grown in popularity as a method of estimating changing R values, due to its 46 flexibility and accuracy . EpiEstim uses a Bayesian framework to compare the reported number of cases over a 47 time window with the projection based on the infectiousness profile and historic reporting to generate an estimate of R in a given 48 . CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. ; https://doi. org/10.1101org/10. /2022 In this paper we develop a novel method to generate a real-time estimate of the growth rate of infection in small stochastic 50 populations. Our flexible method uses a Bayesian approach to compute the posterior distribution of the growth rate at any point in 51 time and produces samples of the joint posterior distribution of the growth rate for any given interval. We use Gaussian processes 52 (GPs) to fit to the reported data, which gives us flexibility in smoothing the count of new cases according to the GP parameters. 53 We fit to two different measures: the raw number of recorded cases in a region, as defined by PCR positives in the community; or 54 the proportion of community PCR tests that are positive. The latter provides a more stable estimate when testing patterns are 55 changing rapidly. 56 We first outline the basic methodology and illustrate its use on surrogate data sets where the growth rate is known. We then 57 apply our model to data on SARS-CoV-2 cases in England, initially at a national-level by estimating the daily growth rate of 58 SARS-CoV-2 from 1st September 2020 to 6th December 2021 (as available at the time of writing). Finally, we explore the spatial

63
We describe a model framework to estimate the growth rate, r , of an epidemic based on the count of reported infections (cases). 64 If the counts are recorded through a testing programme, to adjust for changes in testing effort the model can also incorporate the 65 total number of tests performed over time. We assume that the underlying process generating the count of cases is given by 66 a one-dimensional Gaussian process (Section 2.1), and obtain the growth rate by sampling from the derivative of the process 67 (Section 2.2). 68 2.1 | Model structure 69 For a given community, let T = {1, . . . , T } be a set of time indices at which data are collected. For each t ∈ T, y t denotes 70 the number of positive test results at time t and n t denotes the total number of tests. In the context of SARS-CoV-2, data are 71 generally collated daily with the potential for missing data, which our proposed models allow for; for other infections data may 72 be collected over larger or irregularly spaced intervals. 73 2.1.1 | Positives model 74 We first propose the following Bayesian hierarchical model, labelled as the positives model, which only requires knowledge 75 of y t (the number of positive cases at time t ) and is therefore applicable in situations where n t (the number of tests at time 76 t ) is not available. The model assumes that y t follows a negative binomial distribution parameterised by its mean µ t and a 77 time-homogeneous over-dispersion parameter η. The probability mass function of y t under this parameterisation is: The parameter log(η) is assigned a normal prior N (m η , τ −1 η ). The log relative risk, log(µ t ), is decomposed into the sum of 80 a smooth term x t and a Gaussian error term ϵ t with zero mean and precision τ ϵ ∼ Γ (a ϵ , b ϵ ). The model can therefore be 81 . CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. ; https://doi.org/10.1101/2022.01.01.21268131 doi: medRxiv preprint expressed as: where the hyperparameters underpinning the distribution of the error precision (a ϵ and b ϵ ) and the over-dispersion hyperparam-88 eters (m η and τ −1 η ) are specific to the problem and quoted in the results. To avoid identifiability issues with the terms x t , we 89 impose a sum-to-zero constraint to the error terms such that t ϵ t = 0.

90
The prior on the smooth terms x t is given by a Gaussian process f on such that x t = f (t ), where f has mean zero, 91 covariance k θ (f (s), f (s ′ )) between the value of the process f at times s and s ′ and hyperparameters θ: 93 f (s) |θ ∼ G P (0, k θ (f (s), f (s ′ ))).

95
A comprehensive summary of such regression models using Gaussian processes can be found in (Rasmussen and Williams, 2006, 96 Ch. 2). Here, we use a one-dimensional Matérn covariance family (Stein, 1999), since the resulting process f is stationary and 97 isotropic, and the smoothness can be specified through a single smoothing parameter ν. We choose ν = 3/2 which results in the 98 covariance function: reported SARS-CoV-2 cases in England have a pronounced day-of-the-week effect, with fewer cases reported on weekends (see 106 Figure 2A). The day-of-the-week effect (which is included in all the results shown below) can be interpreted as an alternative 107 error term with possible non-zero mean whose parameters are allowed to depend on the day of the week. Formally, we substitute 108 the error term ϵ t by a day-of-the-week effect w d (t ) with a Gaussian hyperprior with zero mean and precision is the day of the week on t ). We impose a sum-to-zero constraint t = w t = 0 to avoid identifiability issues with 110 the terms x t . with mean parameter µ t , over-dispersion parameter ρ and number of trials n t . We use a beta-binomial distribution to account for 116 both the bounded nature of y t (which is bounded above by n t ) and the over-dispersion.

117
. CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. ; https://doi.org/10.1101/2022.01.01.21268131 doi: medRxiv preprint The probability mass function of y t under this parameterisation is given by: where M = (1/ρ) − 1. Given the bounded nature of the positive tests, such that µ t ∈ (0, 1), we utilise the inverse logit transform 120 (logit −1 ), and assume that logit −1 (µ t ) is decomposed into the sum of a smooth term x t and a Gaussian error term ϵ t with 121 zero mean and precision τ ϵ ∼ Γ (a ϵ , b ϵ ). The transformed over-dispersion parameter logit −1 (ρ) is assigned a normal prior 122 N (m ρ , τ −1 ρ ). As in the positives model, the prior on x t is given by the Gaussian process described in Section 2.1.1: where ∂ t signifies the time derivative. However, w t is unknown in practice, so we instead approximate the growth rate using 132 our fitted Gaussian process. For the positives model, we approximate r t as the growth rate of the process fitting the number where h is the window size of the approximation. points. The nodes are located according to the frequency of reported counts; that is, if the data is reported daily, one node per 148 day is located, even if there is missing data. To avoid boundary effects, the mesh domain is extended by at least the length of 149 the studied period (extra nodes are added before the first observation and after the last observation) (Lindgren and Rue, 2015). 150 The code is available in GitHub/juniper-consortium/growth-rate-estim. For the rest of the paper, we use weakly informative 151 priors to the over-dispersion parameters of the models, such that log(η) ∼ N (0, 0.01 −1 ) and logit −1 (ρ) ∼ N (0, 0.5 −1 ). More is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. ; https://doi.org/10.1101/2022.01.01.21268131 doi: medRxiv preprint Ch. 5). Choices of τ ϵ , l 0 , σ 0 and B are case-specific and are detailed in the results. 154

155
To validate the accuracy of the models, we generate synthetic epidemiological data from a single homogeneous population of 156 size N = 1, 000, 000. We assume for the first 100 days there is an underlying growth rate of r = 0.03 per day. For the second 157 100 days, we assume that controls are enacted and the epidemic goes into decline with a rate of r = −0.02. More precisely, the 158 number of infections y (t ) on day t are sampled from a Poisson distribution with rate r 6 i =3 y (t − i ) for t > 6 and exp(r t ) for 159 t <= 6 (where r = 0.03 for t <= 100, r = −0.02 for t > 100, and y (0) = 100). 160 We compare two scenarios for the number of daily tests n (t ). As our purpose is to test the model accuracy under a known 161 growth rate, rather than discuss the effect of the test sampling, we make highly optimistic assumptions for the frequency of 162 testing. In the first scenario, a random ten percent of the population is tested daily, n (t ) = 0.1N ; in the second scenario, tests . CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. ; https://doi.org/10.1101/2022.01.01.21268131 doi: medRxiv preprint Although the method is not inherently spatial, treating each set of temporal data as statistically independent, we can nevertheless 179 use the spatial position of each spatially sampled location to address localised effects. In this way, we introduce a heterogeneity 180 measure to assess whether exceptionally high or low growth rates within a given spatial location are a localised pattern or are 181 caused by a larger, more widespread, phenomenon. We define the heterogeneity h i of a spatial element i as: where r i is the growth rate within location i , j ∈ N bd (i ) denotes all spatial locations that neighbour element i (where for 184 simplicity we assume this means share a boundary, but could be any measure of spatial locality) and N i is the number of 185 neighbours of i . Samples of h i are taken by sampling from r i and r j . As such, h i provides a measure of local covariance, with its 186 sign reflecting whether it has higher or lower growth than its neighbours. We apply the models described above to data on daily counts of SARS-CoV-2 cases in England and in lower tier local authorities 192 (LTLAs) between 1st September 2020 and 6th December 2021, dataset provided by Public Health England (now UKHSA). 193 The data correspond to the count of people from the wider population (Pillar 2 of the UK government testing programme (UK 194 Health Security Agency, 2021)) with at least one positive PCR test, reported by specimen date and residence location (by LTLA). 195 The data also include the count of negative tests. . This agreement provides additional confidence that we are seeing a robust signal from the data. 222 Nevertheless, there were sustained periods with the two models producing dissimilar quantitative estimates, such as during 223 December 2020. Higher differences in testing correspond to higher differences in growth rate estimation (Figure 3, panel B). 224 This helps explain the discrepancies observed in December 2020, when testing practices are likely to be affected by the holiday 225 period. We applied the proportions model to the count of positive cases of SARS-CoV-2 in England for each of the 317 LTLAs. Since 229 data at a lower resolution can be noisy, setting weak priors for the hyperparameters of the Gaussian process can lead to unrealistic 230 length scales to account for the noise. To overcome that issue, we assume that the covariance function of the underlying Gaussian 231 process at a local authority level has a similar shape to the national data. Therefore, we set the baseline values σ 0 and l 0 for the 232 LTLA level to be the posterior median of σ and l obtained with the national data in Section 3.2, respectively, with precision 233 B = 10 . 234 We focus on the results from 23rd April 2021, when infections with the Delta variant were increasing in the North-West of 235 England, particularly in Bolton where our proportions model gave an estimated positivity of 3.25% (95% PI: 2.65%-3.90%) 236 ( Figure 4, panel A). Multiple neighbouring LTLAs in the North-West region had median estimates for proportion of tests being 237 positive above 2%. In other regions at that time, some urban centres had a similar high incidence of 2% or above, which included 238 Manchester and Sheffield. However, we generally measured incidence to be lower in other regions compared to the North-West. 239 For example, all LTLAs in the South-West and along the southern coast had low median incidence estimates (below 1%). 240 Though there was regional structure to the magnitude of test positivity, for growth rates we observed spatial variability in 241 areas experiencing high growth in cases and those where incidence was declining (Figure 4, panel B). Areas expressing the 242 greatest heterogeneity were regionally disconnected (Figure 4, panel C). LTLAs whose probability of positive heterogeneity 243 exceeded 0.95, thereby indicating high growth rates larger than the surrounding areas, included Erewash in the East (median 244 heterogeneity: 1.42), Sefton in the North-West (median heterogeneity: 1.00), Bedford in the east (median heterogeneity: 0.79), 245 and Bolton in the North-West (median heterogeneity: 0.50). For Bolton, our heterogeneity measure suggested that area was 246 having a localised increase (>99% probability of heterogeneity being greater than 0) rather than a regionally-driven event. 247 Through concurrently considering the growth rate and the proportion of tests with a positive result, we could discern those 248 . CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. . CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. The latent structure of both models includes a Gaussian process (GP) that interpolates the epidemic curve and approximates 260 the underlying process that generates the disease incidences. We then take samples of the derivative of the GP to estimate the 261 growth rate. The models are implemented using the Laplace approximation incorporated in the INLA package in R. Both models 262 use data on positive reported infections, while the proportions model also incorporates testing counts, enabling us to account for 263 changes in test-seeking behaviour. We believe our approach has four benefits over existing methods. Firstly, it is rapid, robust and 264 computationally efficient -all of which are considerable advantages when dealing with a rapidly changing epidemic in multiple 265 spatial locations. Secondly, by focusing on growth rate rather than the reproduction number, we by-pass the complexities of 266 estimating generation-time distributions that can substantially hinder other methods early in an outbreak. Thirdly, the combined 267 . CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. (left) Estimated median of posterior of incidence, with lighter shading corresponding to higher incidence estimates. (center) Estimated median of posterior of growth rate (red shading when greater than zero, blue shading when less than zero). Regions with thicker borderlines correspond to LTLAs where the probability that the growth rate is greater than 0 exceeded 95%. (right) Median of heterogeneity (red shading when greater than zero, blue shading when less than zero). Regions with thicker borderlines correspond to LTLAs where the probability that the heterogeneity is greater than 0 exceeded 95%.
use of positive reported infections and number of tests allows us to deal with the proportion of tests that are positive, a measure 268 that is relatively insensitive to changes in testing behaviour. Finally, the use of Gaussian processes means that the method is also 269 relatively robust to missing data, allowing us to provide continuous estimates even if some of the data streams are considered 270 unreliable (for instance, the high rate of false negatives reported by the Immensa Health Clinic in some regions in the UK in 271 September 2021 (Torjesen, 2021)).

272
Throughout we applied our method to reported cases of SARS-CoV-2 infection in England as confirmed by PCR testing. We 273 perform our analysis both at a national scale (Figure 2) and at a small regional scale (Figures 4 and 5). Our choice of pathogen 274 was determined by the need to quantify and explain the ongoing pandemic, feeding our findings through SPI-M-O (Scientific 275 Pandemic Influenza Group on Modelling, Operational sub-group) to policy advisers. England has seen three major waves of 276 infection, broadly associated with the wild-type, Alpha and Delta variants. The first wave which began in March 2020 led to large 277 numbers of hospital admissions and deaths, but was poorly quantified in terms of infection due to the low level of community 278 testing. The second wave began in September 2020 and peaked in late December 2020 or early January 2021 with over 60,000 279 cases reported on 29th December 2020. The third wave from June 2021 has been characterised by a prolonged period (over 5 280 months) of high cases, but with relatively low hospital admissions and deaths due to high vaccine uptake.

281
The national trends in growth rate highlight the complex pattern of growth (r > 0) or decay (r < 0) over time (Figure 2). 282 Some notable changes that correspond to mitigation activities include: a pronounced negative growth rate during November 283 2020 due to the National 4-week lockdown, although the growth rate had been lower in October 2020 than September 2020;  well over a million (Birmingham), but most contain around 140,000 inhabitants. Performing our analysis at this spatio-temporal 304 scale allows us to identify both highly localised outbreaks (as seen in the maps in (Figure 4) or wider regional trends, enabling 305 scrutiny of locations exhibiting atypical data patterns. Furthermore, introducing a heterogeneity measure enabled comparisons 306 of the growth rates between neighbouring LTLAs. The heterogeneity measure has been used during the pandemic to highlight 307 places with abnormal growth patterns, generally identifying LTLAs with significantly higher growth. The process has also be 308 extended (by considering S-gene target failure) to quantify the spread of new variants (e.g. Alpha and Delta) to pinpoint localities 309 that were increasing above mere noise (Challen et al., 2021). 310 Our analysis of LTLAs was focused around 23rd April 2021; at this time the Delta variant had begun to establish across 311 England (with about 20% of cases attributed to Delta), hospital admissions and deaths were continuing to decline, but community 312 cases had reached a nadir. Understanding the spatial patterns of growth at this time, and linking it to the prevalence of the 313 . CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. ; https://doi.org/10.1101/2022.01.01.21268131 doi: medRxiv preprint Delta variant, was important for assessing the invasion of the new variant. We observe a mixed mosaic of growth rates across 314 England ( Figure 4) with a few regions where the growth rate is significantly above zero. Many of these regions also appear in the 315 heterogeneity map as islands of growth amid a sea of declining cases; which suggests a rapid localised growth in these areas. 316 Focusing on LTLAs that either have high growth rates or high prevalence ( Figure 5) we identify three main grouping that may 317 require further epidemiological investigation. Firstly there are four LTLAs (South Hams, South Northamptonshire, Erewash and 318 Hyndburn) that have high positive growth rates and where we expect cases to continue to rise. Secondly, there is a group of 319 fifteen LTLAs where a high proportion of tests (between 2-4%) are positive; of these Bolton, Trafford and again Hyndburn (all in 320 the North-West of England) are of the greatest concern due to their positive growth rate. Finally, Selby in the North-East of 321 England (clearly identifiable on the incidence map of Figure 4) has an extremely high proportion of tests that are positive, and 322 while the mean growth rate is slightly below zero this is not statistically significant suggesting that cases will remain high over 323 the short-term. 324 Our approach for estimating the growth rate is a purely statistical method and therefore has limitations. First, the model is 325 non-mechanistic and does not incorporate any epidemiological assumptions. Therefore, it is not suitable for predicting future 326 changes in infections or making long term forecasts, particularly as it cannot account for the depletion of susceptible through 327 infection or vaccination. Second, we assume that the spatial regions investigated are independent and homogeneous, we do not 328 account for the movement of infection between regions (Kraemer et al., 2021) nor the spatial and social structure within a region. 329 A lack of internal structure could be important for public-health concerns; for example, an outbreak that is primarily increasing in 330 the young has very different health implications compared to one that is increasing in the elderly. There is no reason why richer 331 data structures cannot be incorporated within our methodology (for example looking at the growth rate in a set of age-groups), 332 but such an analysis requires large amounts of data and is increasing complex to interpret. Third, the data analysed in this study 333 comes from PCR testing (or individuals that have performed a lateral flow test followed by PCR). Therefore, there are limitations 334 due to specificity and sensitivity of the test and the ability of individuals to swab reliably. Associated with this, and discussed 335 above, changes to test-seeking behaviour beyond a simple increase in testing could introduce a range of biases. It is important to 336 stress that throughout we are fitting to positive tests not infections, although we believe the two are highly correlated. Finally, 337 though Gaussian processes provide a flexible tool, some prior knowledge of the patterns of the disease is required to inform the 338 subjective choice of the covariance function and its priors. If the data sources are not consistent over the time course of the study, 339 it will affect both models. Moreover, abrupt changes in the epidemic curve are harder to pick for certain covariance functions 340 (e.g. smooth covariance functions). This highlights the need for further studies around how to design more complex covariance 341 functions that allow such abrupt changes to be captured. 342 In summary, we have presented a general structure for estimating instantaneous growth rates that uses a Bayesian hierarchical 343 model to fit a Gaussian process to the epidemiological data. Applied to high-resolution spatio-temporal SARS-CoV-2 case and 344 testing data from England, we have demonstrated the ability of parsimonious models estimating instantaneous growth rate to both 345 determine long-term patterns of growth at a national-scale, and highlight growth and spatial heterogeneity at a regional-scale. 346

347
We are grateful to Nick Gent and Public Health England (now the UK Health Security Agency) for providing access to data on 348 positive and negative PCR tests by LTLA. The ethics of the use of these data for these purposes was agreed by the UK Health Code for the study is available at GitHub/juniper-consortium/growth-rate-estim.

352
. CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.
is the author/funder, who (which was not certified by peer review) The copyright holder for this preprint this version posted January 5, 2022. ; https://doi.org/10.1101/2022.01.01.21268131 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a has granted medRxiv a license to display the preprint in perpetuity.