## Abstract

The moving epidemic method (MEM) and the WHO method are widely used to determine intensity levels for seasonal influenza. The two approaches are conceptually similar, but differ in two aspects. Firstly, the MEM involves a log transformation of incidence data, while the WHO method operates on the original scale. Secondly, the MEM uses more than one observation from each past season to compute intensity thresholds, fixing the total number to include. The WHO method uses only the highest value from each season. To assess the impact of these choices on thresholds we perform simulation studies which are based on re-sampling of ILI data from France, Spain, Switzerland and the US. When no transformation is applied, a rather large proportion of season peaks are classified as high or very high intensity. This can be mitigated by a logarithmic transformation. When fixing the total number of included past observations, thresholds increase the more seasons are available. When only few are available, there is a high chance of classifying new season peaks as high or very high intensity. We therefore suggest using one observation per season and a log transformation, i.e. a hybrid of the default settings of the MEM and WHO methods.

## 1 Introduction

Following the 2009 influenza H1N1 pandemic, the need for a rapid assessment tool for influenza intensity was recognized. The *Review Committee on the Functioning of the International Health Regulations and on Pandemic Influenza (H1N1)* recommended that member states perform yearly updates and evaluations of intensity thresholds (WHO, 2011, p.118). In the subsequent *WHO Pandemic Influenza Severity Assessment* (PISA) guideline (WHO, 2017) the so-called WHO method (WHO, 2014) and the moving epidemic method (MEM; Vega et al. 2013, 2015) were recommended to this end. The latter, which is also employed by the European Centre for Disease Prevention and Control (e.g., ECDC 2017), has been adopted by numerous national public health agencies (e.g., Dickson et al. 2020, Rakocevic et al. 2019, Redondo-Bravo et al. 2020, Vos et al. 2019); see Supplement C for an overview of recent applications. The statistical properties of the MEM and WHO methods, however, have not yet been studied in detail. We here address this aspect via simulation experiments based on re-sampling of historical influenza data. Our results indicate that it may be beneficial to use a hybrid of the default settings of these two methods.

## 2 Definition of the moving epidemic and WHO methods

We describe the computation of influenza/influenza-like illness (ILI) intensity thresholds, framing the MEM and WHO methods as two special cases of the same general approach. We assume that thresholds are based on weekly data (typically incidences) from *m* past seasons and applied to the (*m* + 1)-th season. Vega et al. (2015) recommend to use 5 ≤ *m* ≤ 10 seasons for the MEM to ensure a recent data basis. Computation of thresholds then proceeds as follows:

Select the

*n*largest observations from each of the*m*past seasons to construct a set of reference observations.Apply an (invertible) transformation to all selected observations.

Assume that the

*m*×*n*transformed reference observations come from a normal distribution and compute estimates ,*s*of its mean and standard deviation.Define intensity thresholds on the transformed scale as quantiles of the normal distribution . A common choice for both methods is

the 40th percentile as the threshold between low and medium intensity;

the 90th percentile as the threshold between medium and high intensity;

the 97.5th percentile as the threshold between high and very high intensity.

Transform thresholds back to the original scale (e.g. using the exponential function if a log transformation was used).

The MEM and WHO methods are special cases of this procedure:

In the WHO method,

*n*= 1 observation per season is used and by default no transformation is applied. If peak incidences vary strongly across seasons, a log transformation is recommended (WHO, 2017).In the MEM method, the default transformation is the natural logarithm, while the number of included observations per season is set to

*n*= 30*/m*, rounded to the nearest integer. The total number of historical observations is thus kept approximately fixed.

The rationale behind the percentile-based approach is that “about 50–60% of the season peaks should be above the moderate threshold, ±10% above the high threshold and ±2.5% above the extraordinary threshold” (WHO, 2017, p.10).

The implementation of the moving epidemic method in the R package `mem` (Lozano, 2020) permits the user to choose *n, m*, and *f* (`i.type.intensity = 5` for no transformation, `i.type.intensity = 6` for the log transformation). The term *moving epidemic method* could thus also be used as an umbrella term for the general procedure described above. We here use it in a more narrow sense for the specification from Vega et al. (2015), reflected in the default settings of the `mem` package.

In previous works (WHO 2014; Vega et al. 2015), the above thresholds have been described as upper ends of one-sided confidence intervals for the arithmetic (WHO method) or geometric mean (MEM) of the reference observations. This, however, is imprecise terminology as in the computations the standard deviation *s* rather than the standard error is used (see documentation of the `mem` package and WHO 2014, p.69). We note that the use of actual confidence intervals is also possible in the `mem` package (`i.type.intensity = 1` for the geometric, `i.type.intensity = 2` for the arithmetic mean). However, thresholds for all levels will then converge to the arithmetic or geometric mean if sufficient historical data are available. As illustrated in Supplement B, this is not a desirable behaviour.

## 3 Simulation study

### 3.1 Goal

We aim to assess how thresholds depend on the employed transformation, the number of observations *n* used per season and the number *m* of seasons included. In particular, the following aspects are of interest:

The application of a logarithmic transformation is expected to lead to higher thresholds for the extreme categories and thus a lower proportion of seasons peaks classified as high or very high.

In the MEM, the reference set also contains past observations which are not actually peak observations. This may introduce a downward bias in thresholds, especially if due to a small number

*m*of available seasons*n*is large.

The latter aspect is intuitive, but also supported by formal statistical considerations detailed in Supplement A. These imply that thresholds are unbiased for *n* = 1 in the sense that, as intended, they will be exceeded by around 60%/10%/2.5% of the season peaks in the long run (if some auxiliary assumptions are fulfilled). When *n >* 1, thresholds will tend to be lower and exceeded more frequently.

### 3.2 Simulation setup

We compare four versions of the general approach described in Section 2:

No transformation,

*n*= 1. This corresponds to the WHO method.No transformation,

*n*= 30*/m*.Logarithmic transformation,

*n*= 1.Logarithmic transformation,

*n*= 30*/m*. This corresponds to the MEM approach.

In order to closely mimic the seasonal patterns of influenza, we re-sample historical surveillance data rather than generating fully synthetic data. Assume *M* seasons of historical data on a measure of influenza activity are available. We then repeat the following steps 500 times:

Sample a sequence of 15 seasons from the

*M*available seasons. This is done with equal probability for each season and*with replacement*, meaning that the same season can appear more than once. This approach is called the*seasonal block bootstrap*(Politis, 2001).For each value

*m*= 5, …, 15:– Restrict the generated sequence to the first

*m*seasons.– Apply approaches (a)–(d) to compute thresholds for medium (40th percentile), high (90th percentile) and very high intensity (97.5th percentile).

– Compute which fraction of all

*M*available historical season peaks would be classified as low, moderate, high and very high.

The range *m* = 5, …, 15 is motivated by the range of values found in real-world applications, see overview in Supplement C. All analyses were performed using the R language for statistical computing (R Core Team, 2020) and the package `mem` (Lozano, 2020).

### 3.3 Data

We use publicly available influenza surveillance data from four countries. Data on the weekly incidence of influenza-like illness in France, 1985–2019, were obtained from Réseau Sentinelles (INSERM/Sorbonne Université, https://www.sentiweb.fr, Flahault et al. 2006). Data on weekly confirmed influenza cases in Spain, 1998–2019, were extracted from graphs shown in the *Informe Anual* of the *Sistema de la vigilancia de gripe en España* (https://vgripe.isciii.es/; Sistema de Vigilancia de Gripe en España 2019). Weekly ILI counts from Switzerland, 2000–2016, collected by the Swiss Federal Office of Public Health are available in the R package `HIDDA.forecasting` (Held and Meyer, 2019). Weekly weighted ILI (wILI) data at the US national level from CDC FluView (Charbonneau and James, 2019), 1998–2017, were obtained via the CDC FluSight influenza forecasting platform (https://github.com/FluSightNetwork/cdc-flusight-ensemble/). These wILI values correspond to the fraction of general practitioner visits due to influenza-like symptoms. All four time series are displayed in Figure 1, with the pandemic 2009/2010 season removed. We also show boxplots of the first through sixth largest observation per season. Not surprisingly, values on average get smaller for increasing ranks. Interestingly, they also get less dispersed, meaning that variability among e.g. the sixth largest observations per season is smaller than among the peak values.

### 3.4 Results

Results from our simulation study are shown in Figures 2 (France, Spain) and 3 (Switzerland and US). In the first and third column of each figure we also show an analytical approximation of the expected thresholds, which agrees very well with the simulation results; see Supplement A for details. In all four countries, using a log transformation leads to increased thresholds, in particular for the high and very high levels. Especially in France and Spain, thresholds obtained without the log transformation are too low, as can be seen from the large proportion of season peaks classified as very high (around 10% when using *n* = 1). As expected, when letting the number of observations used per season depend on the number of available seasons via *n* = 30*/m*, average thresholds increase in *m*. As a particularly striking example, the average threshold for high intensity in France (when using a log transformation) increases from 834 for *m* = 5, *n* = 6 to 1011 for *m* = 10, *n* = 3 and 1080 for *m* = 15, *n* = 2. For *n* = 1, in which case thresholds can be interpreted as unbiased (Supplement A), the average is 1147. Including historical observations which are not from actual peak weeks thus leads to a considerable lowering of alarm thresholds and will increase the number of alerts for high and very high influenza activity. This is not surprising given the pronounced differences between the distribution of season peak values and e.g. fifth largest observation per season, see right column of Figure 1. In the US and France, on average one in three season peaks is classified as high and one in seven as very high intensity when applying the MEM with log transformation and *m* = 5, *n* = 6. This is rather far from the intended exceedance probabilities of 10% and 2.5%.

When always using *n* = 1, the average thresholds and shares of the different categories are more well-behaved also for small *m*. This holds especially when applying a log transformation, even though certain mismatches with the nominal exceedance probabilities remain. Also, there is considerable variability in the estimated thresholds (shaded areas in Figures 2 and 3). These difficulties, however, are inherent in the problem of estimating a 90% or 97.5% quantile based on 5–10 observations.

## 4 Discussion

We provided a statistical assessment of implementation choices in a widely used framework for the computation of influenza intensity thresholds. We found that applying a log transformation leads to better behaved thresholds and closer to nominal exceedance rates. Concerning the question of how many observations per historical season should be included to compute thresholds, we found that the common choice *n* = 30*/m* results in too low thresholds when few historical seasons are available. We therefore recommend adopting *n* = 1 irrespective of the number of available historical seasons. This is possible in the `R` package `mem` by setting `i.n.max = 1` when the function `memmodel` is called.

We think that a simple and interpretable tool with a well-documented and open source software implementation like `mem` is a valuable tool in practice. The use of a standard approach at the European level will improve comparability of results, facilitating the evaluation and refinement of the tool. With this work we hope to contribute a statistical perspective on this topic, complementing public health practitioners’ experience from applied analyses.

## Data Availability

Materials to reproduce the presented results are available at https://github.com/jbracher/mem.

## Data and code

Materials to reproduce the presented results are available at https://github.com/jbracher/mem.

## Ethics statement

No ethics approval was necessary as this study uses exclusively publicly available data.

## Supplementary Materials

This supplementary material is hosted by Eurosurveillance as supporting information alongside the article *An empirical assessment of influenza intensity thresholds obtained from the moving epidemic and WHO methods*, on behalf of the authors, who remain responsible for the accuracy and appropriateness of the content. The same standards for ethics, copyright, attributions and permissions as for the article apply. Supplements are not edited by Eurosurveillance and the journal is not responsible for the maintenance of any links or email addresses provided therein.

### A Formal statistical analysis

#### A.1 Re-stating the definitions

We start by re-stating the definitions of the MEM and WHO method in a slighty more formal way and introducing relevant notation. We assume again that thresholds are based on data from *m* past seasons and applied to the (*m* + 1)-th season. Thresholds for an intensity measure *X* are then obtained as follows.

Within each historical season

*j*= 1, …,*m*order all observations in decreasing order, denoting the*i*-th largest observation from season*j*by .Select the

*n*largest observations from each of the*m*past seasons to construct a reference set .Apply an (invertible) transformation to all members of the reference set 𝒳 to obtain a reference set 𝒴 of transformed historical observations.

Assume that the transformed values in 𝒴 come from a normal distribution and compute estimates ,

*s*of its mean and standard deviation.Define intensity thresholds on the transformed scale as quantiles of the normal distribution , i.e. compute where

*z*_{α}is the*α*quantile of the standard normal distribution. A common choice for both methods isthe 40th percentile as the threshold between low and medium intensity;

the 90th percentile as the threshold between medium and high intensity;

the 97.5th percentile as the threshold between high and very high intensity.

Obtain thresholds on the original scale by applying the inverse transformation, i.e. setting

*q*_{0.4}=*f*^{−1}(*q*_{Y,0.4}),*q*_{0.9}=*f*^{−1}(*q*_{Y,0.9}),*q*_{0.975}=*f*^{−1}(*q*_{Y,0.975}).

The MEM and WHO methods are special cases of this procedure with the following specifications:

In the WHO method,

*n*= 1 observation per season is used irrespective of the number*m*of historical seasons. The standard procedure is to apply no transformation, i.e. set*f*to the identity function.In the MEM method, the default choice for

*f*is the natural logarithm, while the number of observations per season is set to*n*= 30*/m*. (The exact specification used in the R package,`mem`is*n*= max(1, ⌊ 30*/m*⌋), where ⌊ ⌋ denotes rounding to the nearest integer.) The total number of historical observations is thus kept approximately fixed.

#### A.2 Results

We assume that the different historical seasons are independent realizations of the same random process, which is certainly not strictly true, but should hold in good approximation. Denote by **Y**^{(j)} the random vector of the *n* largest transformed incidences from season *j* in decreasing order, i.e. with . The mean and covariance matrix of **Y**^{(j)} are denoted by
respectively, where to make notation more intuitive we also write . It can then be shown that the expectations of and *S*^{2} when using *m* seasons and *n* observations from each season are given by
respectively. The derivation is provided in the next subsection. For reasons detailed there, if the transformation function *f* is the identity or the natural logarithm,
usually holds in good approximation in our applied setting. It can be shown that for *n* = 1, we have
usually in good approximation. This is a desirable property. Loosely speaking, if in addition the come from a normal distribution, the nominal exceedance probabilities (60%/10%/2.5%) for the different thresholds would in the long run be achieved. If a different value *n >* 1 is chosen, this will generally no longer be the case. Equations (3)–(5) tell us by how much *q*_{α} can be expected to differ from the intended value *f* ^{−1}(*µ*_{1} + *z*_{α}*σ*_{1}). By the definition of the *µ*_{i} (with *µ*_{i} the expectation of the *i*-th largest observation in a given season), decreases in *n*. In most (possibly all) settings this also translates to the *q*_{α}. As a consequence, when choosing *n >* 1, one must expect to classify a larger number of season peaks as high or very high intensity. When choosing *n* = 30*/m* as suggested for the MEM, thresholds will then tend to increase the more years of data are used to compute thresholds.

#### A.3 Derivation

We start by addressing the expectations of empirical mean and variance *S*^{2} of the reference observations, where

It is straightforward to see that

For the variance *S*^{2}, we first note that it can be re-written as

We consider the two terms *a* and *b* separately, starting by

Then we note that

Plugging these results back into equation (8) we obtain

It is straightforward to see that for *n* = 1 the expressions (3) and (9) simplify to
as given in equation (4).

There is no general way of computing the expectation 𝔼 (*S*) from the respective standard deviation, but unless the true distribution of *S*^{2} has strong excess curtosis,
is a reasonable approximation. We can then plug equations (3) and (10) into the formulae for the thresholds *q*_{Y,α} on the transformed scale and obtain
where *z*_{α} is the *α* quantile of the standard normal distribution (with *α* ∈ {0.4, 0.9, 0.975}).

If *f* was set to the natural logarithm, the question remains how to obtain statements concerning thresholds *q*_{α} on the original scale. Approximation via a second-order Taylor expansion yields

Empirically, after transformation to the log scale, the variance of the reference observations is low in our applied setting. The resulting variances of *q*_{Y,α} are then quite small and do not play an important role in equation (11). We can thus use the even simpler approximation

As can be seen from Figures 2 and 3 from the main manuscript, this approximation works very well in praxis. To compute the values indicated by the small black crosses, we just plugged the empirical mean vectors and covariance matrices of the **y**^{(j)} in the respective data sets into equations (3)–(5).

### B Intensity thresholds based on confidence intervals

Previous works have referred to the thresholds discussed in our manuscript as upper ends of one-sided confidence intervals associated with the arithmetic (WHO) or geometric mean (MEM) of the reference observations (WHO 2014; Vega et al. 2015). This, however, is an imprecise use of terminology as the thresholds are computed using the standard deviation *s*. Confidence intervals, on the other hand, would be computed using the standard error . A more precise term would thus be “prediction interval”, which implies that the interval refers to one future realization rather than a theoretical mean.

These terminology questions aside, we note that the `mem` R package also offers threshold computation based on actual confidence intervals (`i.type.intensity = 1` for the geometric, `i.type.intensity = 2` for the arithmetic mean). On the transformed scale these are computed as

These thresholds show somewhat peculiar behaviour. We illustrate this by repeating the simulation study the main manuscript for France with the respective settings. The result is shown in Figure S1.

Here, two different mechanisms are at work. If one chooses *n* = 1 per season irrespective of the number *m* of available seasons, the number of included observations *mn* increases in *m*. With growing *m* this leads to confidence intervals which get narrower and a funnel-like pattern in the different thresholds (bottom row of Figure S1). If one chooses *n* = 30*/m*, the number *mn* will remain constant (or approximately, as some rounding is necessary). The funnel-like shape is thus not observed and an increasing pattern like in our main analyses emerges instead. In both cases, there is an excessive number of seasons classified as very high intensity. This reflects the fact that the extreme thresholds necessarily get smaller when using the standard error rather than standard deviation in equation (13). We thus conclude that these variants of the moving epidemic method are not advisable in practice.

### C An overview of recent applications

To get a better understanding of the different settings in which the MEM and WHO methods are applied in practice we performed a literature search of articles published in English and citing the papers Vega et al. (2015), WHO (2014) and WHO (2017) until December 2020 (identified via *CrossRef* and *Google Scholar*). The results are summarized in Table S1.

As can be seen from the large number of entries from the years 2019 and 2020, the MEM has quickly become a standard approach in the determination of intensity thresholds for influenza and other respiratory diseases. Indeed, the contributions come from numerous countries and in many cases have been co-authored by representatives of national or regional public health agencies. In most analyses, the suggested threshold levels at the 40th, 90th and 97.5th percentile as described in Section A.1 are used. Variability with respect to the number *m* of historical seasons included is considerable, with a range from 3 to 16 seasons. Consequently, the number *n* of observations included per season ranges from two to ten (none of the above papers indicated a modification of the default setting *n* = 30*/m* of the moving epidemic method). We only found three published applications of the WHO method, one of them providing a comparison to the thresholds from the MEM method (Rguig et al., 2020).

## Acknowledgements

We would like to thank Laura Werlen and Daniel Wolffram for helpful feedback. Johannes Bracher was supported by the Helmholtz Foundation via the SIMCARD Information and Data Science Pilot Project.