Updating estimates of Plasmodium knowlesi malaria risk in response to changing land use patterns across Southeast Asia

Background Plasmodium knowlesi is a zoonotic parasite that causes malaria in humans. The pathogen has a natural host reservoir in certain macaque species and is transmitted to humans via mosquitoes of the Anopheles Leucosphyrus Group. The risk of human P. knowlesi infection varies across Southeast Asia and is dependent upon environmental factors. Understanding this geographic variation in risk is important both for enabling appropriate diagnosis and treatment of the disease and for improving the planning and evaluation of malaria elimination. However, the data available on P. knowlesi occurrence are biased towards regions with greater surveillance and sampling effort. Predicting the spatial variation in risk of P. knowlesi malaria requires methods that can both incorporate environmental risk factors and account for spatial bias in detection. Methods & Results We extend and apply an environmental niche modelling framework as implemented by a previous mapping study of P. knowlesi transmission risk which included data up to 2015. We reviewed the literature from October 2015 through to March 2020 and identified 264 new records of P. knowlesi, with a total of 524 occurrences included in the current study following consolidation with the 2015 study. The modelling framework used in the 2015 study was extended, with changes including the addition of new covariates to capture the effect of deforestation and urbanisation on P. knowlesi transmission. Discussion Our map of P. knowlesi relative transmission suitability estimates that the risk posed by the pathogen is highest in Malaysia and Indonesia, with localised areas of high risk also predicted in the Greater Mekong Subregion, The Philippines and Northeast India. These results highlight areas of priority for P. knowlesi surveillance and prospective sampling to address the challenge the disease poses to malaria elimination planning.


Introduction
Articles identified through searching Web of Science for "knowlesi" or "monkey malaria"

145
The infection risk model incorporated 20 environmental covariates (Table 1) . Here, we instead used the healthcare accessibility surface through the calculation of both a tree coverage and a tree loss metric. We defined tree loss to be the proportion 168 of forest area lost within each 5

MODIS/IGBP landcover
Open shrublands Proportion of land with given land classification [43]. Annual Woody savannas " " " Annual Savannas " " " Annual Grasslands " " " Annual Permanent wetlands " " " Annual Croplands " " " Annual Cropland/natural vegetation mosaic " " " Annual Urban and built up " " " Annual is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint tween a region's environment and the occurrence of P. knowlesi transmission. Regression trees produce an 176 approximation of some latent function (e.g. the probability of a P. knowlesi infection occurring) by recursively 177 splitting across potential predictor variables (e.g. environmental covariates). The points at which these splits 178 occur and the value assigned across each split region are selected such that the error between the regres-179 sion tree and the observations is minimised [44]. Boosted regression trees extend upon the regression tree 180 framework by producing a large number of trees and combining them in an ensemble (a process known as 181 boosting) such that they better approximate the latent function [45]. Boosted regression trees are able to fit 182 complex nonlinear responses including high-dimensional interactions between explanatory variables due to 183 their hierarchical tree structure and have been shown to exhibit high predictive accuracy [46]. Finally, boot-184 strapping of the boosted regression tree process can be performed, allowing for uncertainty in the output to 185 be estimated [47].

186
When applied to presence-absence data (such as from a systematic survey), niche models generally use a 187 binomial likelihood to represent the probability of a species being present at a given location. Where most 188 of the data available for modelling are presence-only, as is the case for P. knowlesi malaria, it is common 189 practice in niche modelling to supplement occurrence records with "background" points to represent areas 190 where the species or disease has not been reported [46]. A variety of approaches have been employed to select 191 background points, including sampling to ensure that their spatial distribution emulates the sampling bias in  Most P. knowlesi occurrences to date are recorded in Malaysia, Brunei and Singapore, with all three of these 194 countries having eliminated the human malaria species (e.g., P. vivax and P. falciparum), such that that P. 195 knowlesi is routinely considered a potential cause of malaria cases. Outside of these countries, surveillance for  [35], as we expected this survey to have similar sampling bias to that of macaque P. knowlesi infection records.

208
This approach is not biased by the under-ascertainment of P. knowlesi infections that arise due to asymp-209 tomatic/submicroscopic or spontaneously resolving disease [9, 10, 11], given such effects would be expected 210 to be uniform geographically. year the sample was recorded. We produced a covariate for host species, indicating if the sample was collected 223 from a human, a mosquito or a macaque. We repeated this process to produce 500 bootstrapped datasets. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint perparameters for model fitting were unchanged from the defaults provided by seegSDM version 0.1-9 (initial 226 trees = 10, learning rate/shrinkage = 0.005, tree complexity = 4, maximum trees = 10,000). We produced pre-227 dictions across each of the 500 bootstrapped models, with summary statistics including mean, variance, and 228 interquartile range calculated for each 5 × 5 km grid cell across Southeast Asia ( Figure S1). As in the 2015 229 model, we restricted predictions to areas within the range of macaque and mosquito species known to be re-  . 246 We calculated covariate relative influence scores for each bootstrapped model, representing the number of 247 times a variable is selected for regression tree splitting, weighted by the squared improvement to the model as 248 a result of each split and averaged over all trees [52]. We summarised these scores across the models as means 249 and 95% confidence intervals, with mean values also being used to rank the relative covariate importance.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint    is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint used in model fitting and evaluation was 524. Of these, 396 were within the training region of Malaysia, Brunei 277 and Singapore, with the remaining 128 located elsewhere in Southeast Asia (Table S1).  Figure S2A).

301
Such predictions may be suggestive of a lack of surveillance in these regions, or that an environment is con-302 ducive to transmission but currently lacking widespread occurrence of a necessary vector or host species (e.g.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint The new covariates of tree coverage and forest loss were found to be highly influential; out of 21 covariates (20 308 environmental covariates and the species covariate), the median rank for tree coverage was 5 (95% confidence 309 interval: 2-11), and for forest loss was 10 (95% confidence interval: 6-15). Plots of the accumulated local 310 effects (ALE) describing the influence of each continuous covariate across the covariate's range are presented 311 in Figure S6.

313
In this study, we utilised an environmental niche modelling approach to predict the relative suitability for P.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint Figure S4: Histograms (A) and quantile-quantile plot (B) comparing the distributions of mean predicted transmission suitability for the 2015 and 2020 models of P. knowlesi transmission risk. Histograms are presented on a relative x-axis (ranging from minimal to maximal predicted mean risk), with quartiles of predicted risk displayed as dashed vertical lines. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint Figure S6: Accumulated local effect (ALE) plots for each covariate, indicating the mean effect of changing a covariate's value upon the prediction (on logistic scale) across the range of that covariate. The ALE values are calculated for each bootstrap, with the median value, 50% and 95% confidence intervals presented as lines, darker shaded regions and lighter shaded regions respectively.

22
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted August 8, 2023. ; https://doi.org/10.1101/2023.08.04.23293633 doi: medRxiv preprint