## Abstract

When new pathogens emerge, numerous questions arise about their future spread, some of which can be addressed with probabilistic forecasts. The many uncertainties about the epidemiology of emerging pathogens can make it difficult to choose among model structures and assumptions, however. To assess the potential for uncertainties about emerging pathogens to affect forecasts of their spread, we evaluated the performance of a suite of 16 forecasting models in the context of the 2015-2016 Zika epidemic in Colombia. Each model featured a different combination of assumptions about the role of human mobility in driving transmission, spatiotemporal variation in transmission potential, and the number of times the virus was introduced. All models used the same core transmission model and the same iterative data assimilation algorithm to generate forecasts. By assessing forecast performance through time using logarithmic scoring with ensemble weighting, we found that which model assumptions had the most ensemble weight changed through time. In particular, spatially coupled models had higher ensemble weights in the early and late phases of the epidemic, whereas non-spatial models had higher ensemble weights at the peak of the epidemic. We compared forecast performance of the equally-weighted ensemble model to each individual model and identified a trade-off whereby certain individual models outperformed the ensemble model early in the epidemic but the ensemble model outperformed all individual models on average. On balance, our results suggest that suites of models that span uncertainty across alternative assumptions are necessary to obtain robust forecasts in the context of emerging infectious diseases.

## 1 Introduction

Pathogen emergence, or the phenomenon of novel or established pathogens invading a new host population, has been occurring more frequently in recent decades [1]. In the last 40 years, more than 150 pathogens of humans have been identified as emerging or re-emerging [2, 3]. In these situations, host populations are largely susceptible, which can result in dynamics ranging from self-limiting outbreaks, as with Lassa virus [4], to sustained pandemics, as with HIV [5], depending on the pathogen’s traits and the context in which it emerges. When emergence does occur, mathematical models can be helpful for anticipating the future course of the pathogen’s spread [6, 7, 8].

A necessary part of using models to forecast emerging pathogens is making decisions about how to handle the many uncertainties associated with these unfamiliar microbes [8]. Given the biological and ecological diversity of emerging pathogens, there is often considerable uncertainty about various aspects of their natural histories, such as their potential for superspreading [9], the role of human mobility in their spatial spread [10, 11], drivers of spatiotemporal variation in their transmission [6, 12], and even their modes of transmission [13]. In the case of MERS-CoV, for example, it took years to determine that the primary transmission route was spillover from camels rather than sustained human-to-human transmission [14]. A lack of definitive understanding about such basic aspects of natural history represents a major challenge for forecasting emerging pathogens.

Inevitably, different forecasters make diverse choices about how to address unknown aspects of an emerging pathogen’s natural history, as they do for numerous model features. This diversity of approaches has itself been viewed as part of the solution to the problem of model uncertainty, based on the idea that the biases of different models might counteract one another to produce a reliable forecast when viewed from the perspective of an ensemble of models [15]. This idea has support in multi-model efforts to forecast seasonal transmission of endemic pathogens, such as influenza and dengue viruses [16, 17, 18, 19, 20], with ensemble forecasts routinely outperforming individual models. These successes with endemic pathogens have motivated multi-model approaches in response to several emerging pathogens, including forecasting challenges for chikungunya [21] and Ebola [22], vaccine trial site selection for Zika [23], and a multi-model decision-making framework for COVID-19 [15, 24].

Although there has been increased attention to multi-model forecasting of emerging pathogens in the last few years, these initiatives have involved significant effort to coordinate forecasts among multiple modeling groups [25, 26]. Coordination across multiple groups has clear potential to add value beyond what any single modeling group can offer alone. At the same time, using multiple models to hedge against uncertainties about a pathogen’s natural history could potentially improve forecasts from a single modeling group, too [16, 18]. This could, in turn, improve ensemble forecasts based on contributions from multiple modeling groups. An ensemble-based approach by one modeling group that contributes to forecasts of seasonal influenza in the United States demonstrates the success that a single modeling group can achieve with an ensemble-based approach [27], and that such an ensemble can contribute value to an ensemble of forecasts from multiple modeling groups [18]. Similar approaches have not been widely adopted for forecasting emerging pathogens by a single modeling group (although see [28]), despite the heightened uncertainty inherent to emerging pathogens.

Here, we evaluate the potential for an ensemble of models that span uncertainties in pathogen natural history, but share a common core structure, to forecast the dynamics of an emerging pathogen. We do so in the context of the 2015-2016 Zika epidemic in Colombia, which was well-characterized epidemiologically (Fig. 1) [29] and involved potentially consequential uncertainties about: i) the role of human mobility in facilitating spread across the country [30], ii) the relationship between environmental conditions and transmission of this mosquito-borne virus [6, 12], and iii) the number of times the virus was introduced into the country [31]. In our retrospective analysis, we used data assimilation to update 16 distinct models throughout the epidemic period and assessed forecast performance of all models relative to an equally-weighted ensemble model. This allowed us to quantify the contribution of variants of each of the three aforementioned uncertainties to model performance during different phases of the epidemic. In doing so, we sought to not only assess the performance of the ensemble model relative to individual models, but also to learn about features of individual models that may be associated with improved forecast accuracy over the course of an epidemic.

## 2 Results

### 2.1 General forecast performance

Before any data assimilation had occurred, our 16 models (See Table 1) initially forecasted very low incidence across most departments over the 60-week period of our analysis (Figs. 2 top row, S12). Even so, short-term forecasts over a four-week horizon were consistent with the still-low observed incidence at that time (Figs. S3 purple, S18). By the time twelve weeks of data had been assimilated into the models, forecasts over the 60-week period of our analysis were considerably higher than the initial forecasts and better aligned with the observed trajectory of the epidemic (Figs. 2 second row, S13). Over those first twelve weeks, model parameters changed modestly (Fig. S6) and correlations among parameters began to emerge (Figs. S7, S8, S9, S10). We observed a more substantial change in the proportion of individual stochastic realizations (i.e., particles) resulting in an epidemic, with those particles resulting in no epidemic being filtered out almost entirely by week 12 (Fig. S1). Because each particle retained its stochastic realization of past incidence across successive data assimilation periods, stochastic realizations of past incidence were inherited by particles much like parameter values. By week 24, many of the models correctly recognized that they were at or near the epidemic’s peak and forecasted a downward trajectory for the remainder of the 60-week period of our analysis (Figs. 2 third row, S26). The particle filtering algorithm replaced nearly half of the original particles by that point (Fig. S2), with the new particles consisting of stochastic realizations of past incidence selected through data assimilation and updated every four weeks with forward simulations based on either original or new parameter combinations. As the end of the 60-week period of our analysis was approached, parameter correlations continued to strengthen (Figs. S7, S8, S9, S10), our estimate of the reporting probability increased (Fig. S6), and only around 20% of the original particles remained (Fig. S1).

### 2.2 Model-specific forecast performance

To quantify the forecast performance of individual models over time, we used logarithmic scoring (hereafter, log scoring) to compare forecasts of cumulative incidence four weeks into the future to observed values at departmental and national levels. We assessed log scores once the first case was reported nationally for spatially coupled models (i.e., models with explicit human mobility), and once the first case was reported in each department for non-spatial models (i.e., models with no explicit human mobility). Log scores were generally high for spatially coupled models early in the epidemic, given that observed cases and forecasts were both low at that time (Fig. S18, a-c). By week 12, as cases were reported in more departments, the accuracy of forecasts from non-spatial models improved (Fig. S18 d onward). Forecast performance around the peak of the epidemic differed considerably across models and departments, with forecasts from non-spatial models being somewhat lower than observed incidence and forecasts from spatially coupled models being somewhat higher (Fig. S14, Fig. S18 f-j). Around the peak of the epidemic, forecasts from spatially coupled models generally had higher log scores in departments with lower incidence (e.g., Nariño). Later in the epidemic (weeks 40-56), some models continued to forecast higher incidence than observed in some departments, despite having passed the peak incidence of reported cases (Fig. S16). In particular, models that used the dynamic instead of the static formulation of the reproduction number (i.e., the temporal relationship between *R* and environmental drivers is dynamic instead of static) were more susceptible to this behavior (note lower log scores in “Rt” versus “R” models in Fig. S18 k-o), given that their forecasts were sensitive to seasonal changes in temperature and mosquito occurrence.

Next, we used these log scores in an expectation-maximization (EM) optimization algorithm [32] to identify an optimal weighting of retrospective model-specific forecasts into an ensemble forecast (Fig. S24–S28) in each forecasting period (Fig. S17). To understand how model assumptions affected the inclusion of different models into the optimally weighted ensemble for each forecasting period, we summed and then normalized models’ ensemble weights across each class of assumption (Fig. 3). Over the course of the epidemic, changes in weighting for the assumptions about human mobility and spatiotemporal variation in transmission, but not about the number of virus introductions into the country, closely followed patterns in the trajectory of the national epidemic. Spatially coupled models had most or all of the weight in the early and late stages of the epidemic, while non-spatial models had most of the weight around the peak of the epidemic (Fig. 3 b). Although the non-spatial models somewhat under-predicted incidence in the middle stages of the epidemic, this was often to a lesser extent than the spatially coupled models’ over-predictions of incidence (Fig. S3). As a result, the EM algorithm achieved a balance between the over- and under-predictions of these different models.

The maximum ensemble weight in any forecasting period was 0.802, held by one model with a static *R*, two ZIKV introductions into the country, and CDR-informed human mobility 12 weeks after the first reported Zika case (Fig. S17). Combined, the two models with static *R* and CDR-informed human mobility data had the most instances of a non-zero ensemble weight (Fig. S17), occurring in 13 of 15 assimilation periods, with an average weight of 0.18. Around the peak of the epidemic, non-spatial models had the highest ensemble weight, reflecting the accuracy of short-term forecasts in some departments (e.g., Magdalena and Vaupés) and their overall accuracy in nationally-aggregated forecasts (Fig. S11). Near the end of the epidemic, the ensemble weight for models with a static *R* (Fig. 3 c) increased as their forecasts more closely matched the downturn of incidence later in the epidemic relative to models with dynamic *R* (Fig. S20). This was likely the result of mosquito occurrence probability and temperature becoming more favorable for transmission in many departments later in the epidemic (Fig. S21–S22), causing the dynamic *R* models to forecast a late resurgence in Zika incidence.

### 2.3 Target-oriented forecast performance

Short-term changes in incidence are an important target of infectious disease forecasting, but there are other targets of potentially greater significance to public health decision making. To explore these, we evaluated the ability of the 16 models—and an evenly weighted ensemble—to forecast three targets at the department level: peak incidence, week of peak incidence, and onset week, which we defined as the week by which ten cases were first reported. We evaluated models based on log scores of these targets. Summing log scores across departments to allow for comparisons across different forecasting periods (Fig. 4), we found that, on average, the ensemble model outperformed every individual model for all three forecasting targets (indicated by the ensemble model’s location on the y-axis). Early in the epidemic, spatially coupled models with a static *R* performed only slightly better (up to 1%) than the equally-weighted ensemble (Fig. 4). For the remainder of the epidemic, the equally-weighted ensemble model outperformed all individual models (Fig. 4). By summing log scores across forecasting periods to allow for comparisons across departments (Fig. 5), we found that some individual models outperformed the ensemble model in forecasting the peak incidence and the week of peak incidence. In departments on the Caribbean Coast that experienced intermediate epidemic sizes (e.g., Antioquia, Sucre, Atlántico), spatially coupled models with a static *R* outperformed the ensemble model at forecasting the peak week by about 10% (Fig. 5 A). At those same locations, the equally-weighted ensemble performed better than or similar to those same models at forecasting peak incidence and onset week (Fig. 5 b-c). Over forecasting periods and departments, the non-spatial models consistently had lower average forecast scores than the spatially coupled models (indicated by their location on the y-axis in Figs. 4-5). This trend appeared because initial forecasts from non-spatial models were not updated until the first case appeared in each department, while initial forecasts from spatially coupled models were updated when the first case appeared in the country.

## 3 Discussion

Results from the general forecast performance analysis demonstrated that once we began assimilating data into models, forecasts rapidly became more accurate. Models were initialized with a wide range of parameter values [33], with many initial parameter combinations producing unrealistic forecast trajectories, but after only four assimilation periods (12 weeks), nearly 100% of those parameters that produced zero infections were dropped. Similar to other retrospective forecast analyses [16, 34], as more data were assimilated into the models over time, the model fits and forecasts generally became more closely aligned with temporal trends in the data. This was because the particle filter allowed model parameters to continually adapt to noisy data [35]. There were still some exceptions where the particle filter could not fully compensate for shortcomings of the transmission model, such as the drastic underestimates of incidence in departments with sub-optimal conditions for transmission (e.g., static *R* model in Risaralda in Fig. S20). At the same time, the broader suite of models buffered against shortcomings of any single transmission model.

In the model-specific forecast performance analysis, we identified clear temporal trends related to when models with a static *R* versus a dynamic *R* should be included in an optimally-weighted ensemble. In contrast, there were no clear temporal trends in weighting regarding the assumption about the number of times the virus was introduced into the country, potentially reflecting that, even with multiple introductions, most transmission may have been linked to a single introduction [31]. Models with a dynamic *R* had higher weights in the ensemble at the peak of the epidemic, while models with a static *R* had higher weights at the beginning and end of the epidemic. This was likely due to temporal shifts in temperature and mosquito occurrence probabilities dominating forecasts of transmission potential for the models with a dynamic *R*. For example, in the latter parts of the epidemic when reported cases were declining, mosquito conditions and temperature became more suitable for transmission in many departments. This caused models with a dynamic *R* to forecast a resurgence in ZIKV transmission in those departments, while models with a static *R* forecasted a downturn in incidence that was more similar to the observed dynamics. This finding that susceptible depletion may have been more influential than temporal variation in environmental conditions for the Zika epidemic is consistent with recent findings for SARS-CoV-2 [36].

Through the model-specific forecast performance analysis, we also found that spatially coupled models had higher ensemble weights in the early and late stages of the epidemic, while non-spatial models had higher weights around the peak of the epidemic. The importance of including spatially coupled models in the optimally-weighted ensemble early in the epidemic supports the general notion that human mobility may be particularly predictive of pathogen spread early in an epidemic [7, 30, 37, 38]. In part, temporal shifts in weighting around the peak of the epidemic were due to more accurate nationally-aggregated forecasts from the non-spatial models. This result was consistent with a previous modeling analysis of the invasion of chikungunya virus in Colombia, which showed that models fitted independently to sub-national time series recreated national-level patterns well when aggregated [39]. A shift in ensemble weights toward non-spatial models around the peak of the epidemic was also due to less accurate department-level forecasts from the spatially coupled models. At that point in the epidemic, prevalence was at its highest, which means that we would expect local epidemics to be more endogenously driven and less sensitive to pathogen introductions across departments.

In the target-oriented forecast performance analysis, we found that the equally-weighted ensemble generally outperformed individual models, with a few key exceptions. In the months leading up to the peak of the epidemic, spatially coupled models with a static *R* had slightly, but consistently, higher forecast scores with respect to peak week and onset week. Like the model-specific analysis results, this result illustrates the importance of human mobility in facilitating the spread of an emerging pathogen across a landscape [30]. Individual models outperforming the equally-weighted ensemble model in the early phase of the epidemic is not wholly surprising given that non-spatial models were represented equally in that ensemble throughout the epidemic. Non-spatial models may be realistic when locations have self-sustaining epidemics, but they are not appropriate for capturing early-phase growth and its dependence on importations [40]. Another instance when individual models had higher forecast scores than the equally-weighted ensemble was with respect to peak week for spatially coupled models with a static *R* in departments along the Caribbean Coast. Compared to dynamic *R* models, the static *R* models more accurately forecasted peak week in these departments (e.g., Magdalena, Cesar, Sucre), as they did not forecast a late-stage resurgence in transmission. The equal weighting of the dynamic *R* models in the ensemble therefore led to overall lower peak week forecast scores for the ensemble relative to static *R* models. Still, our results indicating that an equally-weighted ensemble mostly outperformed individual models adds to the growing literature highlighting the importance of ensemble methods in epidemiological forecasting [16, 17, 27, 41, 42].

We considered both optimally- and equally-weighted ensembles and found that the equally-weighted ensemble generally provided better forecasts of the observed data. With the optimally-weighted ensemble, which we updated at each data assimilation period, we found that model weights changed over the course of the epidemic. Although this is intuitive given the changing nature of an emerging epidemic through time [8], it may be problematic in practice. It is almost as if the ensemble weights require their own forecast. On the one hand, promising new advances in ensemble modeling [27, 41]—such as adaptive stacking for seasonal influenza forecasting [43]—are being used to address this issue of identifying optimal, adaptive weights without training to historical data. On the other hand, in an emerging pathogen context, establishing optimal model weights by way of model fitting and forecast generation is often reliant on incidence data that is highly variable, given the delayed nature of data reporting [44]. In this context, our results demonstrate that it is preferable to use an equally-weighted ensemble to buffer against uncertainty in optimal ensemble weights. As is also being demonstrated in forecasts of COVID-19, equally weighted ensembles can provide accurate forecasts [26, 45] and may be a better reflection of the considerable structural uncertainty inherent to models of emerging pathogen transmission [24].

A few limitations of our study should be noted. First, while an equally-weighted ensemble approach allowed us to consider contributions of several alternative model assumptions, there was high uncertainty associated with these forecasts. Although the median values for the ensemble forecast were often closely aligned with reported cases, uncertainty around those estimates spanned orders of magnitude at times. Potential end-users of these types of forecasts could consider high levels of uncertainty to be problematic and not useful for decision-making [46]. In the future, ensemble approaches aimed at increasing precision and reducing uncertainty [47, 27] could be used in conjunction with equally-weighted ensembles. Second, a related limitation of our analysis was that we considered alternative models across only three assumptions. With ZIKV transmission, there are additional structural uncertainties that could be considered, such as the role of sexual transmission [48]. In real-time applications of our or other Zika forecasting models, it could be worthwhile to explore these types of ZIKV-specific structural uncertainties. Third, in this analysis we did not explicitly consider delays in reporting that likely would have occurred had these forecasts been generated in real time [49]. In that context, temporally aggregating data to a wider interval (e.g., at 2-week intervals rather than 1-week intervals) could potentially help mitigate the effects of reporting delays to some extent. Fourth, we conducted this analysis at the departmental level instead of this municipality level, which could obfuscate meaningful differences across regions of a single department [29]. In future work, it would be useful to test and assess our forecasting algorithm and outputs at different spatial scales [39].

As the world is reminded of on a daily basis with COVID-19, pathogen emergence is an ongoing phenomenon that will continue to pose threats in the future [50]. A better understanding of an emerging pathogen’s natural history could help to reduce pathogen-specific structural uncertainties, but these insights may not always occur in time to inform model development for real-time forecasting [8]. Our results highlight important trade-offs between individual and ensemble models in this context. Specifically, we demonstrated that an equally-weighted ensemble forecast was almost always more accurate than individual models. Instances in which individual models were better than the ensemble, or greatly improved the ensemble, also provided insight. For example, incorporating human mobility into models improved forecasts in the early and late phases of an epidemic, which underscores the importance of making aggregated mobility data available early in an epidemic [51]. The range of outcomes resulting from alternative modeling assumptions in model-specific forecasts demonstrates why it will continue to be important to address structural uncertainties in forecasting models in the future.

## 4 Materials and methods

### 4.1 Data

We used passive mandatory surveillance data for reported cases of Zika, from the National Surveillance System (Sivigila) at the first administrative level (31 mainland departments) in Colombia. To span the beginning, peak, and tail of the epidemic in Colombia, we focused on the 60-week period between August 9, 2015 and October 1, 2016. We used the version of these data collated by Siraj et al. [29], as well as modeled values of weekly average temperature and estimates of department-level population from that data set. For some models, we worked with monthly estimates of mosquito occurrence probability (i.e., dynamic *R* models) obtained from Bogoch et al. [52], and for others we worked with time-averaged estimates (i.e., static *R* models) from Kraemer et al. [53].

For models that relied on cell phone data to describe human mobility across departments, we first derived daily mobility matrices at the municipality level in Colombia from February 2015 to August 2015. These were obtained from anonymized and aggregated call detail records (CDRs). As these data did not align with the time frame of the epidemic, we computed a representative mobility matrix by summing all available CDRs within the municipalities of each department and normalizing them to sum to one relative to the sum of CDRs originating from that department. In five departments (Amazonas, Cudinamarca, Guainía, Vaupés, Vichada), the proportion of CDRs linking callers within the same department was below 60%. Given that this implied an unrealistically low proportion of time spent within an individual’s department of residence, we interpreted those values as idiosyncrasies of the data and not representative of human mobility [54]. Thus, for those five departments, we replaced the proportion of within-department CDRs with the mean proportion of within-department CDRs from all other departments. We then re-normalized the number of CDRs originating from each department in our mobility matrix to sum to one.

### 4.2 Summary of models

We considered a suite of 16 models that spanned all combinations of four assumptions about human mobility across Colombia’s 31 mainland departments, two assumptions about the relationship between environmental conditions and the reproduction number (*R*), and two assumptions about how many times Zika virus was introduced to Colombia (Table 1). All models were stochastic and discrete in time [55], accounting for overlapping pathogen generations across up to five weeks of the generation interval distribution of ZIKV [56]. Twelve of 16 models further allowed for spatial connectivity across departments [55]. There were two steps in the transmission process: movement of infected people among departments and local transmission within departments.

Among departments, we simulated the movement of infected individuals using a spatial connectivity matrix (**H**), the *d*^{th} column of which corresponds to the proportion of time spent by residents of department *d* in all departments . Using this matrix, we redistributed infections in department *d* in week *t* (*I*_{d,t}) across as a multinomial random variable,
where the first and second arguments represent the number of trials and the probabilities of the outcomes, respectively. We assumed that mobility patterns were the same regardless of infection status.

Within each department, we defined a variable representing the effective number of infections that could have generated new infections in week as
where is the probability that the generation interval is *j* weeks [57]. The relationship between and the expected number of new local infections in week *t* (*I*_{d,t}) follows
where *β*_{d,t} is the transmission coefficient, *N*_{d} is the total population, and *S*_{d,t} is the total susceptible population prior to local transmission in week *t*. We accounted for the role of stochasticity in transmission by using the stochastic analogue of Eqn. 3, such that
where the first and second arguments are the mean and dispersion parameters, respectively [55].

To allow for comparison of the model’s simulations of infections (*I*_{d,t}) with empirical data on reported cases (*y*_{d,t}), we applied a reporting probability (*ρ*) to simulated infections to obtain simulated cases (*C*_{d,t}), such that *C*_{d,t} ∼ binomial(*I*_{d,t}, *ρ*). Using this, we defined the contribution to the overall log-likelihood of the model and its parameters from a given department *d* and week *t* as
where *ϕ* is a dispersion parameter that we estimated to account for variability in case reporting beyond that captured by *ρ*. Shifting *y*_{d,t} and *C*_{d,t} by one in eq. (5) was intended to safeguard against *𝓁* _{d,t} being undefined in situations where *C*_{d,t} = 0.

#### 4.2.1 Assumptions about human mobility

We allowed for spatial coupling across departments in 12 of 16 models. In these models, we informed **H** in three alternative ways: i) with mobility data extracted from mobile phone CDRs, ii) with a gravity model, or iii) with a radiation model (Fig. 1d-f). For the gravity model, we used parameters previously fitted to CDRs from Spain and validated in West Africa [11]. For the radiation model, we calculated human mobility fluxes according to the standard formulation of this model [58], which depends only on the geographic distribution of population. In four of 16 models, we assumed that departments were spatially uncoupled (Table 1), such that each department was modeled individually with its own set of parameters. In those models, each department’s epidemic was seeded independently with its own set of imported infections. Further details about the spatially uncoupled models can be found in the Supplemental Text.

#### 4.2.2 Assumptions about environmental drivers of transmission

We parameterized the transmission coefficient, *β*_{d,t}, based on a description of the reproduction number, *R*_{d,t}, appropriate to environmental drivers for department *d* and time *t*. We considered two alternative formulations of *R*_{d,t} that were informed by data that were available prior to the first reported case of Zika in Colombia. Both formulations were defined such that
where *k* is a scalar that we estimated over the course of the epidemic to account for the unknown magnitude of ZIKV transmission in Colombia. In addition to the summary below, further details about these formulations of *R*_{d,t} are provided in the Supplementary Methods.

The formulation of *β*_{d,t} that we refer to as “dynamic” is defined at each time *t* in response to average temperature at that time (*T*_{d,t}) and mosquito occurrence probability at that time (*OP*_{d,t}). This relationship can be expressed generically as
where *c, ψ, α*, and *v* are parameters governing the relationship among *T*_{d,t}, *OP*_{d,t}, and . We informed the component of related to mosquito density with monthly estimates of *OP*_{d,t}, which derive from geostatistical modeling of *Aedes aegypti* occurrence records globally [52]. Other components of , which include several temperature-dependent transmission parameters, were informed by laboratory estimates [12]. Given that this formulation of was not validated against field data prior to the Zika epidemic in Colombia, we estimated values of *c, ψ, α*, and *v* over the course of the epidemic.

The formulation of *β*_{d,t} that we refer to as “static” is defined as a time-averaged value that is constant across all times *t*. Temporal variation in *T*_{d,t} is still accounted for, but its time-varying effect on *R*_{d,t} is averaged out over all times to result in a temporally constant *R*_{d}. Mosquito occurrence probability is also incorporated through a temporally constant value (*OP*_{d}) [53]. The relationship among these variables can be expressed generically as
where *PPP*_{d} is purchasing power parity in department *d* [59]. This input is an economic index that was intended to serve as a proxy for spatial variation in conditions that could affect exposure to mosquito biting, such as housing quality or air conditioning use [6]. Given that this formulation of was informed by data from outbreaks of Zika and chikungunya prior to the Zika epidemic in Colombia, we did not estimate its underlying parameters over the course of the epidemic in Colombia.

#### 4.2.3 Assumptions about introduction events

Although many ZIKV infections were likely imported into Colombia throughout the epidemic, we assumed that ZIKV introductions into either one or two departments drove the establishment of ZIKV in Colombia [31]. Under the two different scenarios, there was either one introduction event into one department or there were two independent introduction events into two randomly drawn departments. For each particle, the initial number of imported infections was seeded randomly between one and five in a single week, the timing of which was estimated as a parameter. Following the initial introduction(s), we assumed that ZIKV transmission was driven by a combination of movement of infected people among departments and local transmission within departments, as specified by each model.

### 4.3 Data assimilation and forecasting

Beginning with the time of the first reported case in Colombia, we assimilated new data, updated parameter estimates, and generated forecasts every four weeks, consistent with the four-week frequency used by Johansson et al. in an evaluation of dengue forecasts [16]. We specified 20,000 initial particle values , indexed by *n*, by drawing independent samples from prior distributions of each parameter [60] (see Supplemental Methods). At each assimilation period, we normalized log-likelihoods summed across departments over the preceding four weeks to generate particle weights,
Proportional to these weights, we sampled 18,000 particles from time *t* with replacement and used them as particles at the next data assimilation step four weeks later. In doing so, information including the initial prior assumptions and the likelihoods at each four-week period was assimilated into the model sequentially over time. Given that particle filtering algorithms are susceptible to particle drift—or the convergence of particles onto very few states through iterative re-sampling [33]—we also generated 2,000 new particles at each data assimilation step. To do so, we drew random samples of model parameters from a multivariate normal distribution with parameter means and covariances fitted to the resampled 18,000 particles. Whereas the 18,000 resampled particles already included simulated values of *I*_{d,t} and *C*_{d,t} through time *t*, the 2,000 new particles did not and so we informed initial conditions of with draws from for those particles at the time they were *d,t d,ι*:*t* created. Together, the 18,000 resampled particles and the 2,000 new particles constituted the set of particles used as input for the next data assimilation step , where boldface indicates a set of particles. We also used this new set of particles as the basis for forecasts made at time *t*, which simply involved simulating forward a single realization of the model for each particle.

### 4.4 Evaluating forecast performance

At each of the 15 time points at which we performed data assimilation through the 60-week forecasting period, we created an ensemble forecast that evenly weighted contributions from each of the 16 models [45]. To populate this ensemble, we specified 20,000 total samples, with 1,250 samples from each model. We assessed the real-time performance of individual and ensemble forecasts using log scores, which are forecast scoring rules that assess both the precision and accuracy of forecasts [61]. For a specific forecasting target, *z*, and model, *m*, the log score is defined as log*f*_{m}(*z*^{*}|**x**), where *f*_{m}(*z*|**x**) is the predicted density conditioned on the data, **x**, and *z*^{*} is the empirical value of the target *Z* [16].

We assessed real-time forecast performance using log scores for departmental and national incidence over each four-week assimilation period. Following [17], we used an expectation-maximization algorithm to generate ensemble weights for each model in each assimilation period. For each model, we computed 32 log scores (i.e., one for each department and one nationally). To compute the ensemble weight for a given model feature, such as the static *R* assumption, we summed the weights of all models with that feature.

We assessed retrospective forecast performance using log scores for three forecasting targets: timing of peak week, incidence at peak week, and onset week, which we defined as the week by which ten cumulative cases occurred. These choices were motivated by forecasting assessments for influenza and dengue [16, 17, 18, 62] and deemed applicable to public health objectives for forecasting an emerging pathogen such as Zika. For peak week and onset week, we used modified log scores [18], such that predictions within two weeks of the correct week were considered accurate. We evaluated a total of 7,680 log scores, reflecting three targets for each of 16 models in each of 31 departments and at each of 15 time points at which data assimilation occurred.

As log scores only yield a relative measure of model performance, we used forecasting scores [18] as a way to retrospectively compare forecast performance. Whereas log scores are preferable for comparing performance across models on the same data, forecasting scores are preferable for comparing forecast performance across data composed of different locations and times. A forecasting score is defined simply as the exponential of the average log score, where the latter reflects an average over one or more indices, such as models, time points, targets, or locations.

## Data availability

The mobile phone data set used in this study is proprietary and subject to strict privacy regulations. The access to this data set was granted after reaching a non-disclosure agreement with the proprietor, who anonymized and aggregated the original data before giving access to the authors. The mobile phone is available on request after negotiation of a non-disclosure agreement with the company. The contact person is Enrique Frías-Martínez (efm{at}tid.es). Epidemiological, meteorological, and demographic data are available from Siraj et al. [29] and additionally available on https://github.com/roidtman/eid_ensemble_forecasting.

## Author contributions

RJO, EO, MUGK, CMB, MAJ, CAM, RCR, IR-B, MG-H, and TAP conceptualized the study; RJO, EO, MUGK, CAC-O, EC-R, AM-C, PC, LER, VC, PA, GE, JHH, SCH, ASS, EF-M, and MG-H provided and / or processed data; RJO, EO, MUGK, CA-O, EC-R, SM-C, PC, LER, VC, PA, EF-M, MG-H, and TAP participated in biweekly meetings to provide feedback on research; RJO, EO, MUGK, MG-H, and TAP developed the model and wrote the first draft of the manuscript; RJO, EO, MUGK, JHH, and SCH analyzed data; EO, MG-H, and TAP supervised the research; all authors reviewed the manuscript.

## Data Availability

The mobile phone data set used in this study is proprietary and subject to strict privacy regulations. The access to this data set was granted after reaching a non-disclosure agreement with the proprietor, who anonymized and aggregated the original data before giving access to the authors. The mobile phone is available on request after negotiation of a non-disclosure agreement with the company. The contact person is Enrique Frias-Martinez (efm{at}tid.es). Epidemiological, meteorological, and demographic data are available from Siraj et al.2018 and additionally available on https://github.com/roidtman/eid_ensemble_forecasting.

## Supplemental methods

### Priors on parameters common to all models

In each model that we considered, we iteratively estimated the reporting probability (*ρ*), dispersion parameter of the negative binomial distribution (*ϕ*), *R* multiplier (*k*), and the timing of the first infection (*ι*). When possible, we leveraged previous estimates of parameters to inform prior distributions for model parameters. In some instances, we used dengue-specific parameter estimates as priors for Zika-specific parameters [1].

For the reporting probability (*ρ*), we assumed a mean of 0.2 and a variance of 0.05. Although this mean and variance are not directly informed by an empirical study of Zika reporting, they are in line with what we would expect for dengue [2, 3] and Zika [4]. We assumed that *ρ* was a beta random variable and, using the method of moments, we specified a prior distribution for *ρ* such that
which resulted in mean and variance consistent with our prior assumptions.

The dispersion parameter of the negative binomial distribution accounts for variability in case reporting be-yond that captured by *ρ*. Lower values of the dispersion parameter indicate overdispersion, such that variability in cases cannot be explained by a single rate of case incidence, as would be generated by a Poisson distribution with rate *ρI*_{d,t}. Given the likelihood of variation in the reporting probability over the course of the epidemic [5] and across departments [6], we specified a uniform prior for *ϕ*,
which resulted in a level of overdispersion in reporting equal to at least a geometric distribution (*ϕ* = 1) but potentially greater (*ϕ* < 1).

To relate the transmission coefficient (*β*_{d,t}) to environmentally driven descriptions of the reproduction number (*R*_{d,t}), we used a multiplier (*k*) that applied to both the static and dynamic models of *R*. We specified a gamma prior distribution,
the parameters of which were chosen by moment matching to result in a mean of three and a variance of two.

To seed the Zika epidemic in Colombia, we assumed that undetected transmission could have been occurring at any time throughout the first 34 weeks of 2015. For reference, the first case was not reported until August 9, 2015 (week 35). Thus, we specified a uniform prior for the date of the initial introduction (*ι*_{1}) between weeks 1 and 34 of 2015. We assumed that the location of the first introduction (*l*_{1}) could have been any of the departments in Colombia with equal prior probability.

### Assumptions about human mobility

#### Spatially coupled models

For the spatially coupled models, we assumed that transmission was coupled across departments by human mobility. In these models, we informed the spatial connectivity matrix in three ways: i) aggregated mobility patterns extracted from mobile phone call detail records (CDRs), ii) a gravity model, or iii) a radiation model. In the gravity and radiation models, *T*_{i,j} is defined as the total number of trips from department *i* to department *j*. This takes the form in the gravity model and in the radiation model, where *N*_{i} and *N*_{j} are population sizes at the origin *i* and destination is the distance between *i* and *j, s*_{i,j} is the total population within radius *d*_{i,j} from *i*, and *T*_{i} is the total number of individuals who make a trip. The parameters *c, α, β*, and *γ* were fitted to CDRs from Spain and validated in West Africa [7]. All three mobility models were row-normalized to correspond to the proportion of time spent by residents of department *i* in department *j*.

#### Spatially uncoupled models

For the spatially uncoupled models, we assumed that each department’s epidemic occurred independently of all other departments. We used the same prior distributions as described above for *ρ, ϕ, k* for each department. Under this assumption, we did not include a parameter for the location of the initial introduction into Colombia, as we instead were concerned with the initial introduction into that department. It was still necessary to specify the timing of that introduction and, for models that considered it, the timing of a second introduction. Following the rationale that undetected transmission could have occurred for up to 34 weeks prior to the first reported case in a given department, we assumed an even prior for the timing of the introduction(s) into a given department.

### Assumptions about environmental drivers of transmission

#### Dynamic model

The environmentally driven component of *β*_{d,t} for the dynamic model, , was defined as the product of two functions: one that depends on *T*_{d,t} and one that depends on *OP*_{d,t}.

The function of *OP*_{d,t} was defined as *c*(−log(1 − *OP*_{d,t})), which converts occurrence probability into an expectation of mosquito abundance [8]. The constant *c* scales this expectation to account for uncertainty in its magnitude, which we estimated and assumed to have a gamma-distributed prior with a shape parameter of 16 and a rate parameter of 0.07. This choice of prior resulted in average being equal to one when evaluated at the mean values of the prior for *c*.

The function of *T*_{d,t} was based on a temperature-dependent description of the basic reproduction number by Mordecai et al. [9]. Specifically, we used the version of that model based on Briere functions for parameters that were not otherwise accounted for in estimates of mosquito occurrence probability [10]. Those parameters include the mosquito biting rate (*a*), mosquito-to-human transmission probability (*b*), human-to-mosquito transmission probability (*c*), average adult mosquito lifespan (*lf*), and parasite development rate (*PDR*). Those functions combine to form the temperature-dependent component of ,
We did not include other parameters from Mordecai et al. [9] related to mosquitoes, such as immature development rate and egg-to-adult survival rate, as the effects of those parameters were accounted for in estimates of *OP*_{d,t} from Kraemer et al. [10].

To reduce the number of parameters that we needed to estimate, we worked with a simplified description of the temperature curves produced by eq. (13) (Fig. S5). To do so, we took random draws from the posterior distribution of parameters from Mordecai et al. [9], computed functions of temperature according to eq. (13), and fitted parameters of a skew normal distribution to those curves by least squares. The skew normal distribution has three parameters—location (*ψ*), scale (*α*), shape (*υ*)—that together control the mean, variance, and skew of the distribution, which is sufficient to approximate the uncertainty in posterior predictions of the temperature curves described by eq. (13). We then took the mean and covariance across those estimates of *ψ, α*, and *υ* to describe their variation with a multivariate normal distribution, which was the prior distribution we used for those parameters at the first step of our particle filter.

#### Static model

The environmentally driven component of *β*_{d} for the dynamic model, , was defined as the product of three functions: one that depends on one that depends on *OP*_{d}, and one that depends on *PPP*_{d}. We used values of at the 5 km x 5 km grid cell level as calculated by Perkins et al. [8]. To aggregate them to the department level, we took a population-weighted mean of across 5 km x 5 km grid cells within each department. Although we made no other modifications to the calculation of we summarize the methodology used by Perkins et al. [8] in the interest of comparability with the dynamic model.

The function of *OP*_{d} was defined as —log(1 − *OP*_{d})), which was the same procedure used to convert occurrence probability into an expectation of mosquito abundance as used in the dynamic model. For the static model, however, we followed Perkins et al. [8] and relied on a description of occurrence probability that was not defined on a time-varying basis [10].

Rather than estimate a scaling parameter like *c* in the dynamic model, we relied on a scaling parameter defined as a function of *PPP*_{d} by Perkins et al. [8]. This function took the form of a monotonically decreasing, cubic spline function estimated with a shape-constrained additive model. The data that informed this estimate of both the relationship with *PPP*_{d} and the magnitude of were outbreak sizes of 12 chikungunya outbreaks and one Zika outbreak [8] that occurred prior to the ZIKV invasion of the Americas.

The function of was based on a temperature-dependent description of the basic reproduction number that includes temperature-dependent descriptions of mosquito mortality (*µ*) [11] and the extrinsic incubation period (*n*) [12]. Those functions contribute to an expression for monthly contributions to *R*_{d,t},
along with constants that represent the mosquito-to-human transmission probability (*b*), human-to-mosquito transmission probability (*c*), mosquito biting rate (*a*), and rate of recovery from human infection (*r*). For each department *d*, we took the average of the six largest values of eq. (14) as the contribution of to .

### Assumptions about introduction events

#### One-introduction models

Each of the one-introduction models assumed infections were seeded at one point in time (*ι*_{1}) in one location (*l*_{1}), as specified in the general model parameters section of the Supplementary Methods.

#### Two-introduction models

Each of the two-introduction models made similar assumptions about *ι*_{1} and *l*_{1} for the first introduction. For these models, we additionally specified the timing (*ι*_{2}) and location (*l*_{2}) of the second introduction events, for which we assumed even priors. In reality, there were likely more than only one or two introductions throughout the epidemic, but genomic epidemiology suggests the majority of local transmission resulted from only one or two introductions [13].

## Supplemental tables

## Supplemental figures

## Acknowledgements

The authors would like to thank Clara Palau Montava for help with managing the early stages of this project. The authors would additionally like to thank the UNICEF-Colombia Representative, Aida Oliver Arostegui, INS Director, Martha Lucia Ospina Martinez, and the past and present Ministers of the Ministry of Health, Juan Pablo Uribe Restrepo and Fernado Ruiz Gomez.

RJO acknowledges support from an Eck Institute for Global Health Fellowship, GLOBES grant, Arthur J. Schmitt Fellowship, and the UNICEF Office of Innovation. MUGK is supported by The Branco Weiss Fellowship - Society in Science, administered by the ETH Zurich and acknowledges funding from the Oxford Martin School and the European Union Horizon 2020 project MOOD (#874850). SCH is supported by the Wellcome Trust (220414/Z/20/Z).

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵