Estimating the epidemic reproduction number from temporally aggregated incidence data: a statistical modelling approach and software tool

Background The time-varying reproduction number (Rt) is an important measure of epidemic transmissibility; it can directly inform policy decisions and the optimisation of control measures. EpiEstim is a widely used software tool that uses case incidence and the serial interval (SI, time between symptoms in a case and their infector) to estimate Rt in real-time. The incidence and the SI distribution must be provided at the same temporal resolution, which limits the applicability of EpiEstim and other similar methods, e.g. for pathogens with a mean SI shorter than the frequency of incidence reporting. Methods We use an expectation-maximisation algorithm to reconstruct daily incidence from temporally aggregated data, from which Rt can then be estimated using EpiEstim. We assess the validity of our method using an extensive simulation study and apply it to COVID-19 and influenza data. The method is implemented in the opensource R package EpiEstim. Findings For all datasets, the influence of intra-weekly variability in reported data was mitigated by using aggregated weekly data. Rt estimated on weekly sliding windows using incidence reconstructed from weekly data was strongly correlated with estimates from the original daily data. The simulation study revealed that Rt was well estimated in all scenarios and regardless of the temporal aggregation of the data. In the presence of weekend effects, Rt estimates from reconstructed data were more successful at recovering the true value of Rt than those obtained from reported daily data. Interpretation Rt can be successfully recovered from aggregated data, and estimation accuracy can even be improved by smoothing out administrative noise in the reported data. Funding MRC doctoral training partnership, MRC centre for global infectious disease analysis, the NIHR HPRU in Modelling and Health Economics, and the Academy of Medical Sciences Springboard, funded by the AMS, Wellcome Trust, BEIS, the British Heart Foundation and Diabetes UK.

One of the most popular tools for real-time Rt estimation, the R package EpiEstim, relies on observing the incidence data and supplying an estimated serial interval (SI) distribution -the time between symptom onset Figure 1. Schematic of the EM algorithm approach used to reconstruct daily incidence (It) from weekly aggregated incidence data (Iw). The algorithm is initialised with a naive disaggregation of the weekly incidence 88 (assuming constant daily incidence throughout the aggregation window). The resulting daily incidence is then 89 used to estimate the reproduction number for each aggregation window, in this case for each week, Rw. Rw is 90 converted into a growth rate (see eq. 2), which is in turn used to reconstruct daily incidence data, whilst 91 ensuring that if It were to be reaggregated it would still sum to the original weekly totals. The process cycles 92 between the expectation and maximisation steps until convergence.

94
Initialisation 95 The algorithm is initialised with naively disaggregated incidence data. For weekly data, the total incidence for 96 each week is split evenly over 7 days (allowing for non-integers).

98
Expectation 99 The current reconstructed It is used to estimate the expected reproduction number for each aggregation

107
It for that week is then computed assuming exponential growth, with a multiplying constant kw ensuring that 108 when reaggregated, the reconstructed It matches the original Iw: 109 110

114
The process is repeated iteratively until convergence, at which point It can be used to estimate the full 115 posterior distribution of Rt using EpiEstim. For this final step, Rt can be estimated on any time window.

116
Expectation of R w | I t Maximisation of I t | R w , I w Initialisation of I t | I w is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

119
We chose datasets where incidence data was available daily, and then artificially aggregated them to weekly 120 counts. Rt was estimated from daily incidence that was reconstructed from weekly aggregated data using our refer to those estimates as daily Rt estimates and weekly Rt estimates respectively. cases, we obtained ninety-seven weeks of data (21 st February 2020 to 30 th December 2021) for incidence by 143 date of specimen, which is the date that a sample was taken from an individual which later tested positive.

144
For COVID-19 deaths, we used ninety-six weeks of data (2 nd March 2020 to 2 nd January 2022) for incidence by 145 date of death within twenty-eight days of a positive test. We assumed a mean SI of 6.1 days and SD of 4.2 146 days. 12

148
Simulation study

150
We considered scenarios where Rt either remained constant or varied over time, with a stepwise or gradual 151 change. For each scenario, one hundred seventy-day epidemic trajectories were simulated using a Poisson 152 branching process as implemented in the R package projections. 20 Daily datasets were aggregated weekly and 153 used to estimate Rt using the proposed method; these values were compared to Rt estimates obtained from 154 simulated daily data using the original EpiEstim R package. We explored the impact of weekend effects on Rt 155 estimates, the ability to supply alternative temporal aggregations of data e.g., three-day, ten-day, or two-156 weekly aggregations, and finally, the number of iterations required to reach convergence when reconstructing 157 daily incidence data. The full simulation study description and details can be found in the appendix.

159
Role of funding source 160 161 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.08.22283241 doi: medRxiv preprint or writing of the report. and the daily incidence that has been reconstructed from weekly aggregated data, respectively.

171
The reconstructed incidence of influenza was much smoother than the reported incidence, which showed

177
In contrast, mean daily Rt estimates differed markedly depending on whether the reported or reconstructed 178 data were used, with an R 2 of 0.13 and much higher mean Rt and uncertainty in estimates obtained from 179 reported data ( Figure 2E-F). Higher mean Rt estimates coincided with large peaks in the reported daily 180 incidence (typically on Mondays), as daily Rt estimates were not smoothed and therefore more affected by 181 intra-weekly variability (appendix p2). The overall agreement in the classification of daily Rt estimates was 182 much lower, with only 44.4% agreement (appendix p9).

184
In this case study, the greatest differences in Rt estimates tended to correspond to time periods when the 185 reported and reconstructed incidence data were most dissimilar ( Figure 2B & appendix p3). There was no 186 apparent pattern in the estimates with regard to the outbreak phase, i.e. early, mid or late-phase, but this is 187 likely due to this dataset being a snapshot of incidence taken from within an established epidemic ( Figure 2).

189
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.08.22283241 doi: medRxiv preprint Figure 2. Rt estimates from daily incidence that was either reported or reconstructed from weekly aggregated influenza data. A) The reported (grey) and reconstructed (green) daily incidence of influenza by date of plotted on the last day of the time window used for estimation (i.e., starting on day 9 (19 th December) for daily 197 estimates and day 14 (24 th December) for weekly estimates). Note: the x-axis is shared with the incidence plot 198 above. C & E) Correlation between the weekly sliding (C) and daily (E) mean Rt estimates using reconstructed 199 data (y-axis) and reported daily data (x-axis). Vertical and horizontal lines depict the 95% credible intervals and   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.08.22283241 doi: medRxiv preprint across both approaches (R 2 = 0.97, Figure 3F). Most of the discrepant Rt estimates and higher levels of uncertainty coincide with the early phase of the outbreak when incidence was lower ( Figure 3E-F). Outside of agreement for daily and weekly sliding Rt estimates respectively (appendix p9).

232
for weekly estimates). Note: the x-axis is shared with the incidence plot above and the y-axis has been limited 233 to 0.5 for clarity. C & E) Correlation between the weekly sliding (C) and daily (E) mean Rt estimates using 234 reconstructed (y-axis) and reported (x-axis) daily data, excluding the first 30 days due to low incidence. Vertical   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.08.22283241 doi: medRxiv preprint

245
The reported incidence of COVID-19 deaths was much less influenced by day-to-day variation. The reconstructed daily incidence was more similar to the observed daily data than in the previous case studies estimates (appendix p9). Discrepancies between the two mostly coincide with periods of particularly low 251 incidence of deaths ( Figure 4B & appendix p7). The overall lower incidence of COVID-19 deaths compared to   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.08.22283241 doi: medRxiv preprint estimates, defined as the width of the 95% credible intervals, using the reconstructed (y-axis) and reported confidence intervals (grey shading). Dashed lines represent the x = y line.

275
In all case-studies, incidence reconstructions converged within 10 iterations of the EM algorithm. The overall 276 process of Rt estimation from weekly aggregated data took three seconds or less to run on MacOS (2 GHz 277 Quad-Core Intel Core i5) 16GB RAM (appendix p10); the influenza scenario, with over 57,000 cases, took two 278 seconds to run, whilst the COVID-19 cases and deaths scenarios, with an overall incidence over 149,000 and 279 13 million cases respectively, took three seconds to run.

281
Simulation study

283
The method performed well across all scenarios, successfully estimating Rt from the aggregated simulated 284 data (appendix pp10-20). Convergence of the EM algorithm was quick, with negligible differences in the 285 reconstructed incidence beyond 5 iterations (appendix p22).

287
When introducing weekend effects into simulated data, Rt estimates from reconstructed incidence were more 288 successful at recovering the true value of Rt than when using reported incidence (appendix p21). The method 289 can also be successfully applied to other temporal aggregations of data, e.g. three-, ten-or fourteen-day 290 windows (appendix pp22-24).

294
Estimates of the time-varying reproduction number (Rt) have frequently been used to inform and guide 295 policymaking during outbreaks, and a commonly used approach to estimate Rt is EpiEstim, which relies on 296 daily incidence data. However, maintaining daily incidence databases requires substantial time and 297 investment in resources, which is not always feasible, particularly for less acute or routinely reported diseases.

298
Therefore, in practice, many diseases are not reported on a daily basis, including influenza and other notifiable 299 diseases in the UK and US. 2-5 As the COVID-19 pandemic persists, daily reporting is also becoming less 300 common. 21 Coarsely aggregated data can be challenging to deal with in the context of Rt estimation methods, 301 restricting their applications in certain contexts. In this study, we develop a statistical framework and tool that 302 allows Rt estimation from aggregated incidence without introducing bias. Using influenza and COVID-19 data, 303 alongside a simulation study, we demonstrate how a simple expectation-maximisation algorithm approach 304 can rapidly reconstruct daily incidence data and accurately estimate Rt.

306
In all case studies, direct comparisons between weekly sliding Rt estimates show that very similar estimates 307 can be made from the reported daily incidence and the reconstructed daily incidence from weekly aggregated 308 data. However, daily Rt estimates are more influenced by noise, such as intra-weekly variability, leading to is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.08.22283241 doi: medRxiv preprint substantially when some of the variability in the reported data was smoothed by estimating Rt using weekly sliding windows (appendix pp8-9).
Despite both being affected by weekly periodicity in reporting, concordance of Rt estimates obtained from aggregation windows (appendix pp11-12). This occurs because in reconstructing daily incidence we impose that, if it were to be re-aggregated, it would match the original data. Methods that simply fit smoothing splines equal to or longer than the length of aggregation windows to reduce the impact of discontinuities on estimates Alternative approaches include modelling frameworks implemented in the Epidemia and EpiNow2 R 368 packages. 6,22,26 Daily infections are modelled as a latent process, back-calculated from observed data on cases 369 or deaths, depending on an appropriate infection to observation distribution. In addition, Epidemia integrates 370 further information, such as the infection ascertainment rate (for cases) or the infection fatality rate (for 371 deaths). 22 This facilitates a 'nowcasting' approach, allowing users to estimate Rt directly from the unobserved 372 infections, but they typically require more data (e.g. incidence of deaths and cases), more assumptions (e.g.

373
delay distributions and ascertainment rates), and are much more computationally intensive, which can be a 374 barrier to the adoption of such methods by users. 14 375 376 Here, Rt estimates are based on a single daily incidence reconstruction, meaning Rt can be estimated very 377 rapidly from aggregated data, which is particularly desirable during real-time outbreak analysis. 14 A potential 378 downside is that uncertainty in Rt estimates could be underestimated. However, the simulation study showed 379 that the 95% credible interval of estimates encompassed the correct value of Rt the majority of the time, and 380 we found no substantial indication that this approach detrimentally affected our characterisation of the 381 uncertainty.

383
Given that this method is directly derived from EpiEstim, it relies on similar assumptions and caveats. 15,27 As 384 time of infection is more difficult to observe than symptom onset, the SI is typically used as an approximation 385 of the generation time in the renewal equation, which may introduce bias. 28 The SI, the level of undetected 386 cases, and the reporting rate are assumed to remain constant, which is often not the case in practice. Factors 387 such as changes in population immunity, and the introduction of interventions, can alter the SI throughout an 388 epidemic. 29 Whilst changing case definitions, new testing practices, and increased healthcare-seeking 389 behaviour, can all affect case ascertainment. 15 Parameters chosen by users can also influence estimation 390 accuracy, for instance, the time window length for temporal smoothing and the prior for Rt. 27

392
To make the method simple to implement for current and future users of EpiEstim, this extension has been  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted December 13, 2022. ; https://doi.org/10.1101/2022.12.08.22283241 doi: medRxiv preprint reported daily data. This extension is easy to use and computationally efficient, which will enable