Abstract
Year-to-year emergence of West Nile virus has been sporadic and notoriously hard to predict. In Europe, 2018 saw a dramatic increase in the number of cases and locations affected. In this work, we demonstrate a novel method for predicting outbreaks and understanding what drives them. This method creates a simple model for each region that directly explains how each variable affects risk. Behind the scenes, each local explanation model is produced by a state-of-the-art AI engine. This engine unpacks and restructures output from an XGBoost machine learning ensemble. XGBoost, well-known for its predictive accuracy, has always been considered a “black box” system. Not any more. With only minimal data curation and no “tuning”, our model predicted where the 2018 outbreak would occur with an AUC of 97%. This model was trained using data from 2010-2016 that reflected many domains of knowledge. Climate, sociodemographic, economic, and biodiversity data were all included. Our model furthermore explained the specific drivers of the 2018 outbreak for each affected region. These effect predictions were found to be consistent with the research literature in terms of priority, direction, magnitude, and size of effect. Aggregation and statistical analysis of local effects revealed strong cross-scale interactions. From this, we concluded that the 2018 outbreak was driven by large-scale climatic anomalies enhancing the local effect of mosquito vectors. We also identified substantial areas across Europe at risk for sudden outbreak, similar to that experienced in 2018. Taken as a whole, these findings highlight the role of climate in the emergence and transmission of West Nile virus. Furthermore, they demonstrate the crucial role that the emerging “eXplainable AI” (XAI) paradigm will have in predicting and controlling disease.
Highlights
This study shows that the extraordinary 2018 West Nile virus outbreak in Europe was likely due to cross-scale effects between large climatic systems and local mosquito vector populations
We found that large areas in Europe are similarly vulnerable to large and sudden outbreaks
These findings were powered by a novel AI-driven engine for deriving locally precise models; this explanatory engine was supported by a high-performance XGBoost model (97% AUC).
AI-driven local models allow for high-power statistical analyses, including: hypothesis testing,, standardized effect size calculation, multivariate clustering, and tertiary inferential modeling
Background
Predicting West Nile virus in context
West Nile virus (WNV) is a versatile pathogen that is amplified in the environment via spread among local animal populations, most notably birds. Mosquitoes serve as the primary vector and facilitate transmission between reservoir hosts and on to humans. Many other potential vectors, amplification hosts, and zoonotic transmission routes have been implicated1,2. Infected humans also directly contribute to disease spread, via several confirmed routes3. In Europe, 2010 seemed to signal a new epidemiological phase: a new more virulent form of the previously mild lineage 2 variant emerged, supplanting the original in terms of morbidty and mortality, and geospatial spread4. And this expansive trend has continued into 2018 and beyond5. 2018 saw a 7.2 fold increase in reported cases and a markedly expanded geographic range (see Figure 1). 2019 then saw the first autochthonous human cases in Germany leading to concerns of sudden and accelerating spread6. Even mild manifestations of infection are liable to be misdiagnosed or missed7, resulting in wide-scale underreporting8. Suggestions of potential late onset functional and/or cognitive deficits among the seemingly healthy9, have furthermore fueled concerns. Recent seroprevalence studies suggest human infection rates between 1-3%, and as high as 6%, in in parts Europe10–12. And with seropositivity as high as 90% in endemic sub-Saharan Africa13, the potential for expansion is very real.
The spread of WNV among humans depends on a wide range of determinants, including climatic features, environment, and sociodemographic factors14. However, efforts to quantify these relationships are often confounded by complex interactions and time-varying associations15. To resolve such issues, some have suggested higher-dimensional models effected at local scales16–18.
In this paper, we demonstrate one such solution. Our solution is powered by a novel AI-based engine, the SHAP (SHaply Additive Explanation) framework19. SHAP uses the outputs of a popular classification tree ensemble, XGBoost21, to generate local explanatory models for each individual case20. These models are statistically robust. They furthermore possess many favorable properties that allow for robust, hypothesis-driven summarization and analysis. Our aim was therefore to evaluate this solution in the context of geospatial modeling and prediction of infectious disease. To this end, the extraordinary WNV outbreak of 2018 in Europe was selected as an ideal test case.
Results
The 2018 WNV outbreak season was extraordinary in many ways. The number of regions reporting cases as well as the proportion of regions affected per country was substantially higher than in previous years (Figure 1a). Beyond that, many regions affected in 2018 had a history of WNV outbreak, but emergence has been overall sporadic (Figure 1b). Overall, observed WNV range was substantially higher in 2018 compared to control (134 regions vs 44.9 mean regions during the 2010-2016 training period; see Figure 1c), with 25.4% previously naive to WNV.
Out-of-sample predictive results were found to be exceptional
Our XGBoost model delivered an AUC of 0.97. This is particularly notable given the substantial proportion of previously naive regions affected in 2018 and overall sporadic nature of occurrence during the training period (2010-2016). Refer to Figure 1. Sensitivity of the threshold-optimized model was .89 and balanced accuracy was .92, which is noteworthy given the overall sparsity of positive outbreak regions.
Observed feature effects dependent on geospatial scale
The nature of ensemble models precludes direct assessment of feature effects. XGBoost is no different. Feature importance can however be assessed indirectly. Refer to Methods and Figure 3. “Gain” is the amount by which global predictive power decreases upon removal of a given feature from the model. The feature describing year-to-year correlation, “Recent History of WNV [past year]”, was found to contribute the most by this metric (6.3%). This was followed closely by “maximum temperature of the warmest month” (5.6%). No other features achieved similar levels of imputed global importance. Substantial differences were observed between gain and two other importance criteria, “frequency” and “coverage”. For example, our largest contributor to predictive output, year-to-year autocorrelation, was found to impact only a small minority of ensemble outcomes (low “frequency”). And the features found to be most consistently relevant case to case (high “coverage”) were only moderately impactful in terms of gain. Considered simultaneously, few features rank highly in terms of all three criteria (Figure 3, first quadrant) – maximum temperature of the warmest month, vapor pressure in the second quarter and maximum temperature in the third quarter – all climatic. Features associated with hosts, vectors, and spatial covariates were found to be relevant only with respect to a limited set of regions (low coverage and frequency; third quadrant of Figure 3). Sociodemographic, environmental, and economic features perform marginally better in terms of coverage, indicating effect spanning multiple regions but far from globally consistent.
The localized nature of the SHAP output allows for individual feature effects to be independently assessed for each case (Figure 4a). This also allows for feature-wise model decomposition and assessment of aggregate effects specific to each feature class (Figure 4b). Here, we confirm our previous observations regarding the aggregate effects of each feature class. Effects associated with climate are dominant and of about equal combined weight to all remaining features. Mapping furthermore confirms the differential and additives effect of each feature class with respect to geospatial outbreak risk contribution (Figure S1). The mean magnitude of effect can furthermore be calculated for each individual feature, which is broadly equivalent to gain in the original model. We refer to this as the “SHAP score”. Based on this metric, the previously dominant autocorrelative feature was now removed to the fifteenth rank. Feature rankings otherwise remained broadly similar to those previously observed in the original XGBoost model (Figure S2), therefore confirming that the SHAP reanalysis effectively preserves the global importance of each feature, while ensuring relative consistency of effect variability. At the local level, substantial bimodality of effect was noted for the top climatic features, which is consistent with the literature. Methods to mitigate the potential impact of such scale-dependent effects across regions are explored in subsequent sections.
Case-wise stratification and hypothesis testing suggests mechanism driving 2018 outbreak
The statistical properties of the surrogate data model allow for segmentation and hypothesis testing, along with any other inferential analyses, including multivariate methods (see Figure S2). Here, this feature is exploited to conduct simple means testing (Figure 6) to determine which feature effects were stronger in the 2018 validation year compared to control (2010-2016). The feature strongly associated with increased outbreak risk in 2018 was vapor pressure (Q2), and it showed the largest relative increase by far in 2018. This coincided with a marked increase of the most commonly cited vectors, Culex pipiens and Culex modestus. In addition, the standardized effect of global climatic indices was also found to be substantially higher in 2018. This includes most notably, the North Atlantic Oscillation (NAO), variability of which has been associated with a broad range of infectious diseases in Europe. The protective effect of economic features, most particularly per person average income, was also found to be substantially attenuated in 2018 – considerably greater even than the effect size of climate.
Multivariate analysis suggests broad potential for expansion of WNV
A deterministic clustering procedure was applied the to stratify regions according to risk-based homogeneities (see Methods). We refer to these clusters as “epidemiologic risk zones” (ERZ). Seven deterministically independent ERZ were identified, four major and three idiosyncratic (Figure S7). Outbreak risk is strongly contraindicated in ERZ1 and ERZ2 due mostly to climate. Whereas ERZ5 corresponds with the region in which WNV is presently endemic. The present analysis therefore concludes by focusing on the ERZ most vulnerable to climate-driven changes in WNV outbreak risk, ERZ4. Findings suggest a region that is homogeneous with respect to increased outbreak risk driven by climate, environment, and host presence. Outbreak is presently contraindicated only due to insufficient presence of competent vectors (see Discussion), an assessment with potential to suddenly change given the underlying risk structure demonstrated in this analysis.
Prospective early warning model is similarly predictive and insightful
The explanatory model presented thus far has been used to identify associations and patterns indicative of increased WNV outbreak risk. However, it is not suitable for same-year prediction of outbreak risk, i.e. early warning. A derivative model was therefore constructed that included only those features measurable prior to the start of the WNV outbreak season. Despite the limited feature space, model performance remained statistically identical to the explanatory model (AUC of 96%, sensitivity of 87% and balanced accuracy of 91%; see Figure 2a). Geospatial effect distribution (Figure 7a) indicated several regions beyond those typically indicated for WNV (see Figure 1b) and those indicated by the exploratory model (see Figure 2b). This is consistent with the slightly lowered sensitivity but may also reflect unreported cases. Feature effect ranking, magnitude, and direction remained broadly consistent with the explanatory model. This was also true when considering aggregate effects in only the regions where outbreak was positively indicated (Figure 7b). Here, distribution of Anopheles plumbeus was once again found to be a associated with increased WNV outbreak risk.
Discussion
Several hypotheses have been proposed to explain the extraordinary expansion and severity of the 2018 WNV outbreak in Europe. One author concluded that aberrant temperature and drought preceded by anomalous precipitation in Q2 (i.e., a “wet spring”) were likely to be the leading cause8. Others have drawn similar conclusions, but the precise seasonal timeline varies region-to-region41. The present work lends support to such hypotheses. Indeed, even basic descriptives confirms the anomalous climate hypothesis (Figure S3a). For example, we found that minimum and maximum temperature were significantly higher in 2018, for both Q2 and Q3 and vapor pressure reflected this same trend. We also found accumulated precipitation to be significantly lower in 2018 for both quarters. Distribution of the most commonly cited mosquito vector for WNV, culex pipiens, was also found to be higher in 2018. And indeed, this corresponded with a transmission season that was longer and more flatly distributed (Figure S3b). The present work however expands upon this, by precisely confirming geospatial variability in determinant timing, while also robustly quantifying the overarching effect of large-scale climatic drivers and downstream causal dynamics.
Overcoming the modifiable area unit problem
The “modifiable area unit problem” (MAUP) refers to the geospatial manifestation of a very basic statistical reality – observed means change in relation to the underlying distribution. And it has been noted now for well over 40 years as major confounder in geospatial analysis25. The present work, in addition to demonstrating a novel method to mitigate this effect, also demonstrates the degree to which the observed magnitude, ranking, and direction of effect changes as a function of geospatial scale and context. This was accomplished by exploiting a recently introduced method for reanalyzing a classification tree model to statistically homogenized effect matrix for use as a surrogate data model. As noted previously, the effect matrix is dimensionally identical to the original data but homogenized with respect to units, scale, and interpretation. Diagnostic analyses furthermore confirmed the surrogate data model to be statistically comparable with the original data (Figure S4a) but more robust with respect to tertiary multivariate analyses (Figure S4b), particularly for smaller feature set sizes. The ability to robustly standardize for and compare the effect of scale-dependent variabilty is a direct product of these features. Such analyses are shown, not only to confirm associations previously only suggested in the literature, but also to allow for the discovery of novel associations at both global and locale-specific scales. This feature allowed for expert identification and quantification of the overarching determinative effect of large-scale climatic variability (Figure 6b). This method furthermore allowed for the identification of several under-reported or plausible locale-specific host and vector populations (Figure S5). Previous work similarly employed classification tree models to predict potential reservoir hosts for WNV26, but ours does so based on a direct assessment of contribution to disease outbreak risk. Taken together, this confirms the work of Cohen J.M. et al (2016)27 and others, who suggest that failure to address the MAUP may result in the effect of biodiversity and environment being downplayed in favor of climate.
Assessing the potential impact of climate change on outbreak risk
The geospatial distribution of outbreak risk was shown to vary as a function of the feature class being observed. Our results confirm the primacy of climate as a mechanistic determinant of WNV outbreak risk. However, we also identified an equally influential constellation of features with a markedly different geospatial profile. The effect of non-climatic drivers, taken as a whole, extend northward well past the presently recognized limits of the zone commonly considered to be at risk for WNV. The risk-zone associated with non-climatic drivers extends well into the Baltics and parts of Scandinavia. Given the context of anthropogenic climate change, this is a finding that should not be discounted. These regions have historically been home to the similarly arthropod-borne illness malaria28. Furthermore, these latitudes are host to several species of trans-Saharan migrants from whom WNV has already been isolated29. Germany experienced its first autochthonous WNV cases in the year immediately following the first detection of WNV in local bird populations30. Indeed, according to our model for 2018, the regions in Germany affected by WNV in 2019 share a similar risk profile with the at-risk Baltic and Scandinavian regions. These results therefore suggest the potential for sudden changes in the geospatial range of WNV based on the preexistence of underlying non-climatic determinants. This work furthermore identified patterns of causation suggestive of the pivotal role of increased vectorial capacity due to regular climate variability. Most particularly, we found that a vector noted by the ECDC as being transmission capable but of low competence31 may have contributed substantially to the increased spread of WNV noted in 2018 (Figure 5, S5). In addition, the present study also identified a trans-Saharan migrant whose range has shown notable increase over recent years (Table S1b), likely due to anthropogenic change. These findings suggest that the true threat of changing climate may potentially lie in the “activation” of: otherwise dormant components of the transmission chain. Consequently, as seen in 2018, large-scale expansion of outbreak risk may occur far more suddenly than suggested by present climate change impact models32–35. Indeed, our clustering analysis revealed a sizeable area (ERZ4, see Figure 7) vulnerable to this very outcome. About one-tenth of the regions within this zone were affected by WNV in 2018. A diagnostic assessment revealed which factors were most strongly associated with this distinction, at this scale (see Figure S6). Overall, as shown in Figure 7c, the underlying determinant structure suggests non-climatic factors presently contraindicate sustained transmission within this zone. Whether this might change due to shifts in vector populations due to anthropogenic changes or importation events is not immediately apparent. However, the similarity between ERZ4 and the historically endemic ERZ5 with respect to climatic risk suggest an urgent need to establish surveillance to prevent the establishment of more competent vector populations.
Limitations
The scope of the present paper explicitly excludes elaboration as to the mathematical properties and interpretative consequences of the underlying estimation method used to produce these results. Lundberg, et al19 provide extensive methodological discussion on the present application; and the underlying theory, both with respect to local feature estimation and interpretation, has been established in literature and practice for well over 60 years36,37. Nevertheless, given the novelty and wide-ranging implications of the present application with respect to geospatial disease modeling, efforts to quantify the scope of interpretative application are considered to be a necessary next step.
Conclusions
Recent years have seen substantial increases in the number of locales reporting West Nile virus (WNV) in Europe. The emergence of new lineages or strains has been another factor contributing to the increased number and severity of cases. Improved surveillance resolutions and predictive capabilities are therefore urgently needed. In this study we identify and describe the drivers of WNV outbreak risk in Europe using a novel method for generating fine-scale, local models. State of the art predictive performance and resolution are demonstrated using a novel methods in eXplainable AI. Novel analytical applications are demonstrated, including methods for: standardizing for scale-dependent feature effect variability, explanatory hypothesis testing, risk-based geospatial clustering, and even tertiary feature effect modeling. Results align well with the literature and several under-reported drivers are highlighted. Our results suggest anomalous variability in large scale climatic systems may have likely enhanced local climate suitability in 2018, partly driving increased spread and intensity of transmission in Europe.
The methods presented herein allow for an unprecedented degree of experimental efficiency. Given the parallelism among drivers of many infectious diseases, we expect broader scale adoption of such methods to lead to rapid advancements in capabilities to predict and control a broad array of outbreaks – which is crucial in light of changing climate and emerging infectious disease.
Data Availability
All data is available upon request.
Author contributions
The authors worked collaboratively and contributed equally to the article.
Competing interests
The authors declare no competing interests.
Legends
Acknowledgments
A.A.G. received funding from the European Center for Disease Control and Prevention (contract no ECDC.9504). J.R. received funding from the Swedish Research Council Formas (grant nos. 2018– 01754 and 2017–01300). We would like to thank the European Center for Disease Control for kindly providing funding, case and vector data, and support. And special thanks to the team at Exploratory.io who provided technical guidance and support.