Use of penalized basis splines in estimation of characteristics of seasonal and sporadic infectious disease outbreaks

There is often a need to estimate the characteristics of epidemics or seasonality from infectious disease data. For instance, accurately estimating the start and end date of respiratory syncytial virus (RSV) epidemics can be used to optimize the initiation of prophylactic medication. Many widely-used methods for describing these characteristics begin with a regression model fit to a time series of disease incidence. The fitted model is then often used to calculate the quantities of interest. Calculation of these quantities from the fitted regression model typically involves combining together different components of the fitted model, and consequently only point estimates (rather than measures of uncertainty) of those quantities can be made in a straightforward way. Motivated by attempts to estimate the optimal timing of prophylaxis for RSV, we developed a general method for obtaining confidence intervals for characteristics of seasonal and sporadic infectious disease outbreaks. To do this, we use multivariate sampling of a generalized additive model with penalized basis splines. Our approach provides robust confidence intervals regardless of the complexity of the calculations of the outcome measures, and it generalizes to other systems (including outbreaks of other infectious diseases). Here we present our general approach, its application to RSV, and an R package that provides a convenient interface for conducting and validating this type of analysis in other areas.


Introduction
Analyses of infectious disease dynamics often make inferences based on timing for 2 various quantities of clinical and public health interest; for example, the start and the 3 end of seasonal outbreaks, duration of outbreaks, and/or optimal timing of prophylaxis 4 or treatment. Some aspects of timing can be estimated ad-hoc directly from disease 5 incidence data; more typically, a regression model (for example, a generalized linear 6 model (GLM) or a generalized additive model (GAM)) is fitted to a time series of 7 observed disease counts, and the calculations of the quantities of interest are made 8 based on combinations of parameters estimated from the fitted model. 9 Calculating the uncertainty of estimates from these types of GAM/GLM-based 10 methods presents a challenge under frequentist approaches. These challenges are 11 exemplified by analyses of the dynamics of respiratory syncytial virus (RSV) -a 12 respiratory infection that has annual or biennial seasonal pattern and is a major 13 infectious cause of infant morbidity and mortality globally. [1,2] In some countries, 14 high-risk infants are administered antibody prophylaxis throughout the period of high 15 RSV incidence. The therapy is expensive, and the duration of antibody protection is 16 short-lived, [3,4] so administration of prophylaxis needs to be timed with the RSV 17 season to provide optimal protection. [5] Therefore, estimating the timing of initiation 18 of the RSV season in different locations is important.

19
Frequentist model fitting procedures provide measures of uncertainty for the 20 individual model parameters. However, calculating the quantities of interest from those 21 model parameters often requires differentiation or integration with respect to time (for 22 example, to identify outbreak transition points or to determine overall disease burden), 23 which can complicate the process of deriving large sample standard errors. As a result, 24 these tools in RSV research, such as those evaluated in [6], typically ignore uncertainty 25 and focus solely on estimation. When they do consider uncertainty estimation, these 26 methods in infectious disease modeling are often either complex to use (for example, 27 block bootstrapping), computationally intensive, or both. Parameter resampling is an 28 approach that offers uncertainty estimation, balances ease of use with computational 29 efficiency, and has been previously applied in infectious disease modeling. [7,8]

30
In this study, our goal was to generalize prior work on parameter resampling to 31 develop a general framework of retrospective analysis of infectious diseases that allows 32 estimation of uncertainty intervals for time-based outbreak characteristics (such as 33 outbreak onset time) while being flexible as to the precise definition of the 34 characteristics of interest. To that end, we will show how functional data analysis using 35 GAMs with penalized splines can be used to fit smooth curves to infectious disease 36 incidence data, and how the fitted model can then be used to calculate uncertainty 37 intervals for characteristics of the epidemic, such as the start and end date of seasonal 38 epidemics.

40
We developed this method in the context of our RSV research, in which the principal 41 questions were:

42
• Is the period during which RSV prophylaxis is administered well-matched to the 43 timing of RSV seasons?

44
• Could the prophylaxis period be adjusted so that more of the prophylaxis-eligible 45 RSV cases fall within the prophylaxis period?

46
To answer these questions, we defined the following outcome measures:

47
• Season onset and offset, defined as the points in time when the cumulative 48 incidence of RSV rises above 2.5% and 97.5% of the total incidence of RSV for the 49 entire season.

50
• Preventable fraction of a prophylaxis regimen, defined as the fraction of all 51 prophylaxis-naive RSV cases occurring during the period when the prophylaxis 52 regimen provides protection from RSV.

53
Our approach is to fit a family of smooth curves through the observed disease 54 incidence time series, and to infer time-domain features of the outbreak from these 55 curves using parameter resampling. The fitting curves we use are penalized basis splines 56 (P-splines), which have several desirable qualities, including computational compactness 57 and efficiency and well-understood statistical behavior. Most notably for our 58 application, P-spline parameters estimated by GAM are asymptotically normally 59 distributed, provided that the underlying phenomenon is well-approximated by a 60 piecewise-polynomial function. [9] Additionally, we developed a simulation study to 61 validate this interval estimation method for our outcome measures.  To estimate the onset and offset for this RSV season, we begin by fitting a cyclic 65 P-spline GAM (see Supplement) to the observations, using incidence at time t, y(t), as 66 the model response variable, the log link function, and a 20-term cubic spline basis, 67 B k (t), as the only predictor such that The best fit produced by this model (using a 2nd degree smoothing penalty) is  July 14, 2020 3/9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20138180 doi: medRxiv preprint Using the asymptotic normality of the large-sample posterior distribution of β k , we 71 can use parameter estimates (and uncertainty estimates) from the P-spline GAM to 72 obtain a large sample credible interval for λ(t) by sampling β k from this distribution  Having obtained a smooth model with confidence intervals of the expected incidence, 80 we can turn to our first outcome measures of interest, time of outbreak onset (t on ) and 81 offset (t off ), defined by:  Given any λ(t) sampled via β k sampling, we numerically solve for the corresponding 83 t on and t off , thereby indirectly sampling t on and t off from their sampling distributions. 84 The same five y(t) with their corresponding t on and t off are shown in Fig 2c.

85
July 14, 2020 4/9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20138180 doi: medRxiv preprint  90 We defined our other outcome measure, the preventable fraction (F p ) of a 91 prophylaxis regimen that offers protection between t start and t end , as: We repeated the above method to estimate the preventable fraction of the 93 prophylaxis regimen recommended by the American Academy of Pediatrics as well as 94 several alternative prophylaxis regimens, which then enabled us to assess the potential 95 benefit of adjusting the prophylaxis regimen to better align with the local RSV season. 96 Simulation study 97 We began by generating N season = 60 RSV seasons. Each season was generated as

Calculate
to produce a curve roughly of the desired shape (rising from 0 to λ max , then 103 falling back to 0, between t start and t end ). 104 3. Fit a log-linked cubic P-spline to λ * (t) to obtain true instantaneous incidence 105 λ 0 (t); this ensures that the true incidence is smooth in the second derivative.

106
The number of seasons was chosen to balance power of the simulated outcome 107 estimation with computational demand. The ranges of the parameters for the simulated 108 seasons were chosen to include the observed characteristics in our actual dataset of 109 RSV-related hospitalizations in Connecticut.

110
For each season, we then generated N obs = 60 sets of observations by sampling  Finally, we used the method described above to obtaint on andt off estimates for each 114 of the N season · N obs = 3600 sets of observations, in order to verify that the true values 115 of t on and t off calculated directly from λ 0 (t)) were contained in the 95% CI ≈95% of 116 the time. We repeated the same process to validate the preventable fraction estimates. 117

118
To evaluate the results of our method, we considered three performance metrics:

119
• Coverage: the fraction of estimated 95% confidence intervals that include the true 120 value of the outcome measure.

121
July 14, 2020 5/9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 17, 2020. . https://doi.org/10.1101/2020.07.14.20138180 doi: medRxiv preprint  The process of performing a simulation study is straightforward, but the measures of 125 robustness it yields must be interpreted in the context of the research at hand. In our 126 particular case, the MAE of onset/offset estimates (∼ 0.15 weeks, see Table 1) was 127 small compared to time periods of interest in our analysis (which were 1 weeks or more). 128 The MAE for preventable fraction was similarly non-consequential to our results.

129
Overall, the simulation study showed that our estimation method for onset, offset, and 130 preventable fraction met the needs of our RSV research.

132
In this study, we demonstrate how P-spline regression, combined with parameter 133 resampling, can be used to generate uncertainty intervals for quantities derived from 134 epidemic disease data. This is a general approach that can be applied to other to other 135 systems, such as outbreaks of other infectious diseases, as well as other outcome 136 measures. 137 We focus on specific outcome measures (onset/offset time and preventable fraction) 138 that are relevant for RSV. Validating our estimation of those quantities using a 139 simulation study gives methodological grounding to this approach. Direct public health 140 applications of this approach include investigation of the potential benefits of 141 region-specific RSV prophylaxis guidelines.

142
To facilitate application of P-spline estimation to other research, we wrote an R 143 package, pspline.inference [12], that encapsulate our methods in an easy-to-use 144 programming interface. It allows the user to obtain interval estimates of a user-defined 145 outcome measure (as long as it's computable from the model response variable), 146 regardless of how complex its computation; it also provides an easy way to conduct a 147 simulation study to validate estimation of the user-defined outcome measure on  151 We considered including a head-to-head comparison between our approach and any 152 number of alternative approaches to estimation of time-based outbreak characteristics. 153 However, such comparison would have required defining what it means to compare a 154 point-estimation method to an interval-estimation method and what it means to 155 compare two methods of estimating onset and offset in the absence of a standardized 156 meaning of onset and offset. We therefore decided head-to-head comparison was not of analysis will have to be validated independently. Our R package makes this validation 162 simple to perform, but the validation is inherently time-consuming; whereas our analysis 163 took minutes-to-hours to run, validation of our analysis took hours-to-days. Even with 164 that computational cost, our approach is computationally less demanding than Bayesian 165 Markov chain Monte Carlo (MCMC) approaches on datasets of comparable size, because 166 the computationally expensive step in our approach is validation, not the analysis itself. 167 The estimates produced by our method are likely to be biased (as shown in our RSV 168 application); measuring that bias (using our validation tools) and assessing its impact 169 on results is, necessarily, left up to the individual application.

170
Our method is inherently retrospective; the strengths of our approach come from 171 considering the incidence time series as a whole, and it therefore cannot, in its current 172 form, be applied to real-time analysis.

173
In summary, ours is a method of retrospective analysis of infectious disease 174 outbreaks based on P-splines; it is computationally less demanding than Bayesian 175 MCMC methods, yet, unlike the common frequentist approaches, it allows for more 176 straightforward interval estimation of time-based outbreak characteristics. Although we 177 only validated it for our analysis of RSV seasonality, it is applicable to other similar 178 systems, and by publishing it on CRAN, we hope to facilitate its use in other analyses. 179