Assessing the impact of data aggregation in model predictions of HAT transmission and control activities

Since the turn of the century, the global community has made great progress towards the elimination of gambiense human African trypanosomiasis (HAT). Elimination programs, primarily relying on screening and treatment campaigns, have also created a rich database of HAT epidemiology. Mathematical models calibrated with these data can help to fill remaining gaps in our understanding of HAT transmission dynamics, including key operational research questions such as whether integrating vector control with current intervention strategies is needed to achieve HAT elimination. Here we explore, via an ensemble of models and simulation studies, which aspects of the available data and level of data aggregation, such as separation by disease stage, would be most useful for better understanding transmission dynamics and improving model reliability in making future predictions of control and elimination strategies.

Introduction the former Equateur province of DRC between 2000 and 2012 [5]. However, the screen, 27 diagnose and treat strategy has been unable to effectively control transmission to this 28 level in all endemic foci (e.g. some health zones of Kwilu province, DRC), probably due 29 to insufficient levels of coverage, imperfect diagnostics, or people at high risk of 30 transmission not participating in screening activities. 31 Where epidemiological and/or control campaign data of infectious diseases are 32 available, data-driven models have proved to be a valuable tool for quantitatively 33 assessing epidemiological assumptions about disease transmission dynamics or 34 evaluating the effectiveness of intervention measures [6][7][8]. For HAT, data arising from 35 several interventions implemented in recent years have enabled modelling and 36 quantitative analyses of the potential advantages of novel interventions in endemic 37 regions such as Kwilu and former Equateur province in DRC [9][10][11], Mandoul in 38 Southern Chad [12], and Boffa in Guinea [13]. Nonetheless, many epidemiological 39 aspects of HAT remain unclear, and additional data are needed to fill these knowledge 40 gaps. For example, the role of certain subpopulation groups in maintaining transmission 41 in endemic areas, such as those not covered by screening programmes or at unusually 42 high risk due to behavioral or geographical characteristics; or the potential existence of 43 reservoir animal hosts or asymptomatic human carriers is not fully understood [14]. 44 With the 2030 EOT goal on the horizon, it is crucial to determine which efforts in 45 which locations could maximise the potential benefits of any intervention against HAT. 46 Modelling could provide the HAT community with a better understanding of the 47 important factors affecting observed changes in intensity of disease reporting and 48 explain some of the variations in effectiveness of HAT control and surveillance activities 49 across different settings. 50 In this study we analyse a longitudinal human epidemiological data set of HAT from 51 former Bandundu province in the DRC to outline how the type of data and its level of 52 aggregation may affect projections of HAT transmission models. Four independent HAT 53 models, fitted to different data aggregation sets, are used to investigate how the level of 54 data aggregation impacts the projections of HAT incidence and likelihood of achieving 55 the EOT goal for current and intensified intervention strategies. Although the 2030 goal 56 is defined as EOT for the continent, and therefore meeting EOT within Bandundu is not 57 directly equivalent, failure to meet the goal in this high-endemicity region would imply 58 4/27 failure to meet the global EOT target. Implications of data resolution on the estimated 59 effectiveness of strategy is analysed in order to suggest potential improvements in data 60 collection and availability that could contribute to robust assessment of control 61 programme effectiveness and reliable estimates of HAT elimination. 62 Materials and methods 63 Data description and assumptions 64 Former Bandundu province in the DRC has the world's highest HAT burden despite a 65 significant coordinated effort between national and international HAT control 66 programmes [5]. This province covers an area of 296,500 km 2 (12.6% of DRC) and

69
In this study we used publicly available provincial level human case data from 70 Bandundu province [5] to calibrate models of HAT transmission. The data contains the 71 annual number of positive cases for each stage of the disease detected through active 72 screening and passive detection (the primary HAT control interventions implemented in 73 this area); and the total screened population across the province for the years 2000-2012. 74 Although the geographical scale of this province-level data is large, this data was chosen 75 because -to the authors' knowledge -this is the only (either publicly or under-request) 76 available data providing details on the stage of reported cases for many consecutive 77 years.

78
Estimates of the population of Bandundu were taken from publicly available census 79 data [15] for 2000-2012 and a 3% annual growth rate was assumed for projections.

80
Although target populations are usually estimated prior to each active screening round, 81 this data was not publicly available and the target varies from year to year depending 82 on the health zones screened. To determine a consistent estimate over 13 years, each 83 model assumed a constant proportion of the population at risk over the entire period, 84 either fixed or estimated during model calibration (see details in S2 Text). 85

5/27
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) All of them were based on models previously used in either simulation or data-driven 90 studies [9,10,[16][17][18] and include modifications, independently implemented by each 91 group, to improve calibration to the data analysed here. Differences in structural 92 assumptions (e.g. disease progression, heterogeneity in risk to infection) and 93 parameterisation reflect the variety of complexities and biological uncertainties typically 94 found in epidemiological models. Furthermore, a range of different fitting methodologies 95 were employed which also have implications on results. An overview of key aspects of 96 model structure, interventions and fitting procedure is given in Table 1 and more details 97 of each of the models can be found in S2 Text. 98

6/27
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted September 16, 2019.

7/27
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) The reported number of cases detected through active and passive screening and the 100 number of people tested were used to calibrate the models emulating the effects of a 101 typical medical control strategy. The data do not contain information on the timing and 102 duration of active screening, so each modelling group independently managed these 103 aspects (see Table 1).

104
The models were calibrated to three different configurations of the data to reflect the 105 diversity of data resolution usually available, allowing the analysis of the impact of data 106 detail on both uncertainty and reliability of model projections. The three configurations 107 were labelled: "unstaged data", "staged data" and "subset staged data". "Unstaged    [19,20], although estimates of the 151 improvement on the associated detection rate have not yet been quantified.

152
• Vector control. This intervention focuses on increasing the mortality and 153 reducing the density of tsetse flies by, for example, deploying insecticidal baits 154 (e.g., insecticidal targets, insecticide-treated cattle) to attract and kill tsetse. In 155 particular, tiny targets [21] offer great promise for the large-scale and cost-effective 156 9/27 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted September 16, 2019. ; https://doi.org/10.1101/19005991 doi: medRxiv preprint control of the riverine tsetse species which transmit gambiense HAT [12,[21][22][23]. underreporting. We also assumed that the treatment rate of detected cases 175 remained the same so that increased detection led to a corresponding increase in 176 the treatment rate.

177
The calibrated models were used to simulate the "future" effects of these three 178 strategies (Table 2)   The increasing trend in the proportion of stage 1 cases out of total reported cases across 206 years (Fig 2) indicates improved screening in Bandundu; this is observed in both active 207 and passive case data (S1 Fig and S2 Fig). Model fits not informed with staging ratios 208 produced the worst estimates of this proportion and the highest uncertainties (Fig 2),   In all but one case (Model I fitted to unstaged data), the models found that it was 253 extremely unlikely that elimination would occur by 2030 using the baseline strategy. All 254 fits for Models W and Y predicted elimination using vector control tools in addition to 255

14/27
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) in Models I and W contrasted to results of Models S and Y where no significant changes 268 were found (S1 Table). The higher disparity among models in predicting elimination impact our optimism about a particular strategy. A key example is that model 287 calibrations using staged data for Bandundu province strongly suggest that passive 288 detection rates have improved over time, whilst this is unobservable in the unstaged 289 data.

290
The data that countries use to determine their elimination policies for HAT are 291 usually limited and come mainly from screening activities. Our results emphasize the 292 need for incorporating staging information in data sets. With current screening 293 protocols, minimal additional effort in data recording is required to systematically 294 include staging, which would help to reduce uncertainties in assessing progress towards 295 elimination goals.

296
In the future, staging information may no longer be collected if new diagnostic tools 297 and treatments are stage-independent. For example, the new drug, fexinidazole [24], is 298 an all-in-one oral treatment for both stages recently approved by the European 299 Medicines Agency. However, until such tools become part of regularly implemented 300 policy, we emphasise the utility of making routinely collected staging data available.

Data delays
There are routinely delays between case detection in the field and the availability of the 315 data for modeling purposes. The extreme example of a six years delay between data 316 collection and availability considered in this study, though unlikely due to improvements 317 in data availability, is chosen to demonstrate how the absence of up-to-date data 318 impacts model predictions. One or two missing years would still provide less accurate 319 results than up-to-date data, especially due to the lack of information on recent active 320 screenings. Nevertheless, we expect that model predictions generated with fewer missing 321 years would generate predictions more similar to predictions using the full data set than 322 those generated with six missing years as investigated in this study.

323
As we approach elimination, including recent data sets is necessary to better assess 324 the actual trends, as our results have suggested. Use of most recent data sets can be 325 sufficient to reproduce current epidemiological trends and the absence of these data sets 326 could affect model projections, especially for short timelines. Improvements in the time 327 between data collection and availability could enable modelling to provide more 328 up-to-date guidance and monitor for early-warning signs of obstacles on the road to estimated for this period [5]. Model W uses overdispersion parameters to capture the 340 variation in data between different years, so fitting to finer resolution data would likely 341 explain the source of this variation, and reduce the very large credible intervals from the 342 current results. 343

17/27
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted September 16, 2019. ; https://doi.org/10.1101/19005991 doi: medRxiv preprint The peaks observed in the data could arise due to differences in HAT prevalence in 344 the geographical areas in which the active screening occurs between years, due to 345 differences in the quality or coverage of the screening campaigns between years, or 346 reflect true inter-annual variation in HAT epidemiology. Only detailed case data at a 347 finer spatial scale could help models to explore alternative assumptions, capture spatial 348 heterogeneity to better identify geographic reservoirs and improve predictions in global 349 HAT status. Model calibrations at a health zone or finer spatial scale are needed to Our results agree with previous modelling work indicating that potential strategies that 361 integrate vector control with medical interventions could accelerate progress towards 362 elimination, particularly in high endemicity or persistent hotspots [10,11,13,17,18].

363
This is consistent with reductions in HAT transmission reported after implementation of 364 cost-effective vector control methods in highly endemic locations in Guinea [22] and  costs, but it is non-trivial to assess the costs of these interventions without simulating 417 cessation strategies and using a cost model.

418
Cost-effectiveness analyses using dynamic modelling frameworks require assessment 419 of health outcomes (such as years of life lost, and disability adjusted life years due to 420 disease) against a budget or willingness-to-pay threshold which can lead to strategies 421 which are not the least expensive being selected due to the relative gain in health 422 benefits [27]. This health-economic work is beyond the scope of the present study, which 423 primarily seeks to address the impact of data aggregation on model fitting and 424 projections. Assessment of cost-effectiveness is clearly an interesting and important 425 objective for future analyses which aim to provide specific, regional recommendations 426 for strategy selection. Such work would ideally provide more local strategy guidance 427 (smaller than the province scale considered here) so that only regions that require 428 complementary interventions include them rather than assuming blanket coverage of 429 additional strategies across large areas. 430

20/27
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) Between 2011 and 2013, a study was performed to analyse the effects of coordinated 432 vector control (using tiny targets) and mass screening in an area of over 300 km 2 in the 433 endemic focus of Boffa in Guinea [22]. This study recorded highly detailed 434 pre-intervention geo-referenced data of households and inhabitants (familial clustering 435 via a unique code; name, sex and age of family members); annual screening data; and  We investigated the role of the type and level of aggregation of epidemiological data on 456 recommended control strategy by analysing publicly available HAT case data using four 457 different mathematical models. Our results show that the lack of detailed Presence of alternative sources of blood meals (e.g. pigs) x Better understand feeding behaviour of tsetse flies to investigate potential roles of animal reservoirs Family clustering x Spatial modeling to better identify foci The list is not exhaustive. Abbreviations: AS: active screening; PD: passive detection. intervals and either over or underestimate effectiveness of interventions. Across all 461 models and configurations of data sets, the present study suggests that adding vector 462 control to current active and passive screening is likely to be the best strategy to reduce 463 transmission quickly in this region (former Bandundu province, DRC). For the other 464 strategies (including current active and passive detection, and enhanced passive 465 detection with active screening), the probability of achieving elimination and the 466 prediction of the time to elimination vary among models and depend on the data 467 configuration used for calibration. 468 Our study suggests that improved availability of epidemiological data, particularly is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. was not certified by peer review) (which The copyright holder for this preprint this version posted September 16, 2019. ; https://doi.org/10.1101/19005991 doi: medRxiv preprint subsequent reduction in HAT transmission. Given the highly focal nature of HAT, we 473 expect that models fitted to recent staged data at smaller spatial scales (e.g. health 474 zone level) will provide valuable information for local planning, monitoring and 475 adapting HAT interventions to reduce transmission and achieve elimination. 476 Supporting information 477 S1 Text. Remarks on former Bandundu province case report data.