Dengue Hospitalizations in Brazil: Forecasting with Climatic and Physicians’ Digital Search Data under Real-World Reporting Delays
====================================================================================================================================

* Dayanna Quintanilha Palmer
* Marcela Motta
* Eduardo Cardoso de Moura
* Danielly Xavier
* Guilherme Schittine
* Angélica Caseri
* Ronaldo Altenburg Gismondi

## Abstract

Timely forecasting of dengue hospitalizations is essential for public health preparedness but is frequently limited by delays in official reporting systems. Climatic conditions strongly influence dengue transmission, yet hospitalization data often become available weeks after patient admission, reducing their value for early response. Digital information generated during clinical practice may provide a more timely signal of emerging disease activity.

This study evaluates whether integrating climate data with real-time records of physicians’ searches for dengue-related information improves short-term forecasts of dengue hospitalizations in Brazil under both ideal and realistic reporting conditions. Weekly hospitalization counts, weather indicators, and anonymized physician search data from a widely used clinical decision-support platform were combined to generate forecasts across multiple geographic regions. Model performance was compared under two scenarios: one assuming immediate availability of hospitalization data and another incorporating typical reporting delays.

When hospitalization data were timely, climate-based models achieved the highest predictive accuracy. Under realistic reporting delays, however, models incorporating physicians’ search behavior consistently outperformed approaches relying solely on climate information or hospitalization history. In several regions, increases in physician search activity preceded or coincided with rises in hospital admissions, indicating early clinical engagement with dengue cases.

These findings indicate that physician search behavior constitutes a valuable real-time indicator of dengue activity. Integrating digital clinical behavior with climate data enhances forecasting performance under real-world reporting constraints and may strengthen early-warning systems and public health decision-making for dengue and other climate-sensitive diseases.

**Author Summary** Dengue is a major public health challenge in Brazil, where large outbreaks place sudden pressure on health services. Although climate conditions influence dengue transmission, public health responses often rely on hospitalization data that become available only weeks or months after patients are admitted, limiting the ability to act early. In this study, we explored whether real-time digital information generated by physicians could help overcome this delay.

We combined weather data with anonymized records of physicians’ searches for dengue-related information within a widely used clinical decision-support platform in Brazil. We then tested whether these digital signals could improve short-term forecasts of dengue hospitalizations across different regions of the country, especially when official hospital data were delayed.

We found that climate patterns were strong predictors when hospitalization data were timely. However, under realistic reporting delays, models that incorporated physicians’ search behavior produced more accurate forecasts. These findings show that digital clinical behavior can provide early insight into rising disease activity and support more timely public health responses.

Key words
*   Dengue
*   Climate indicators
*   Reporting delay
*   LSTM forecasting
*   Digital epidemiology

## 1. Introduction

Dengue virus (DENV) is an arbovirus with four distinct serotypes (DENV-1 to DENV-4), primarily transmitted by *Aedes* mosquitoes, especially *Aedes aegypti*. While primary infection may be asymptomatic or present with mild symptoms, severe cases can progress to dengue hemorrhagic fever and dengue shock syndrome, characterized by coagulopathy and increased vascular permeability, which can ultimately be fatal if not promptly recognized and managed (1,2). In 2024, Brazil recorded approximately 6.42 million probable dengue cases, representing a 327% increase compared to 2023, when 1.51 million probable cases were reported (3)

Several factors may contribute to the rise in dengue cases, with climate change emerging as a key driver of mosquito-borne disease dynamics. Temperatures exceeding regional threshold levels accelerate the life cycle of dengue vectors (*Aedes aegypti*, and *Aedes albopictus*), increase dengue virus proliferation and mosquito bite frequency, and shorten the extrinsic incubation period (4,5). Higher humidity and precipitation have been linked to increased dengue transmission, as they create favorable conditions for virus replication and spread (4,6). Furthermore, socio-economic factors play an important role in shaping dengue transmission patterns, contributing to a complex system that influences the probability of future outbreaks (7,8).

Accurately mapping geographic distribution and predicting outbreaks is essential to guide resource allocation and enable timely public health interventions (6,9). Several predictive models have been proposed, but most focus exclusively on climate variables and often neglect the effect of time lags in transmission dynamics (8). In addition to these analytical lags, forecasting efforts are also constrained by reporting delays in official hospitalization and case-notification systems, which reduce the timeliness and operational usefulness of surveillance data (10). Moreover, digital behavioral data from physicians—such as clinical search activity within medical decision-support platforms—have not yet been explored in the context of dengue modeling in Brazil, despite their potential to provide early indicators of clinical demand and contribute to real time decision making.

Integrating meteorological and non-climatic factors may substantially improve predictive accuracy (11,12). In this context, digital health data provides a novel opportunity for syndromic surveillance (13,14). Patterns of access to medical applications can reflect real-time shifts in medical concerns and population health trends. In Brazil, the Afya Whitebook® application, a widely used digital health tool among physicians, captures search behaviors, including queries related to dengue. In a previous study, these digital traces showed significant correlation with hospitalizations for dengue, supporting their use as a proxy of clinical demand (15).

Building upon this rationale, the present study proposes a predictive framework to forecast weekly dengue hospitalizations in Brazil by integrating climatic variables, modeled with lagged predictors to capture delayed environmental effects, and digital health indicators derived from physicians’ search behavior within the Afya Whitebook® platform. To address hospitalization reporting delays, we developed two modeling scenarios: an ideal-data model, assuming real-time data availability, and a real-world model, which incorporates the typical delay observed in hospitalization record updates. By comparing these model configurations, we aim to evaluate how incorporating realistic data latency and temporal dependencies can enhance the accuracy and operational relevance of short-term dengue forecasting.

## 2. Methods

### 2.1 Study design

We developed a forecasting framework based on Long Short-Term Memory (LSTM) neural networks to predict dengue-related hospitalizations in Brazil at the Immediate Geographic Region (IGR) level. **Figure 1** provides an overview of the analytical workflow, which includes: 

*   (i) integration of climate, physicians’ clinical search, and hospitalization datasets from public service;

*   (ii) preprocessing steps such as weekly aggregation, imputation of missing values, and consolidation at the IGR level;

*   (iii) derivation of additional climate indicators;

*   (iv) variable screening using LSTM-based relevance evaluation; and

*   (v) model training and interpretability using SHAP (SHapley Additive Explanations) values.

![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2026/01/15/2026.01.12.26343977/F1.medium.gif)

[Figure 1.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/F1)

Figure 1. 
Workflow of the analytical pipeline, including preprocessing of weather station data, weekly aggregation, imputation of missing values, microregion-level aggregation, derivation of climate variables, LSTM-based variable selection, and SHAP analyses for both ideal-data and real-world scenarios.

Given the temporal nature of the data and the need to capture both short and long-term dependencies in dengue transmission, we employed LSTM networks, a type of recurrent neural network specifically designed to handle sequential data with temporal correlations(16). LSTM models have been successfully applied in infectious disease forecasting, including dengue(11), demonstrating strong performance in predicting case trends based on climatic and other exogenous variables.

The modeling strategy was designed to generate 8-week-ahead forecasts, a horizon chosen to reflect both epidemiological relevance and operational usefulness for public health planning. To capture the intrinsic temporal structure of dengue transmission, we implemented a rolling sliding-window approach, in which recent historical sequences were used to predict subsequent hospitalization counts. This setup enabled the model to learn meaningful temporal patterns and produce actionable short- to medium-term forecasts.

### 2.2 Study site

According to the Brazilian Institute of Geography and Statistics (IBGE), an IGR corresponds to a group of neighboring municipalities organized around one or more urban centers that function as local hubs for goods, services, and labor. These regions reflect short-distance functional relationships, particularly commuting flows and the daily circulation of people and economic activities.

The current territorial division, comprises 510 IGRs, that represent an intermediate geographic scale between municipalities and states, ensuring spatial coherence for socioeconomic, environmental, and health analyses (17).

### 2.3 Data sources

#### 2.3.1 Hospitalizations Data

The SIH/SUS database, maintained by the Department of Informatics of the Unified Health System (DATASUS), provides anonymized, publicly available information on hospital admissions throughout Brazil (18). Data are updated monthly, with a typical release delay of one to two months after hospital care. In accordance with Portaria SAES n° 1.110/2021 of the Brazilian Ministry of Health, the final consolidation of hospitalization data within the SIH/SUS may occur up to four months after the reference month, due to the administrative cycles of data submission, verification, and correction of hospitalization authorizations (10).

#### 2.3.2 Clinical Search Data

Afya Whitebook® is a widely used clinical decision-support platform among healthcare professionals in Brazil. The platform provides structured, evidence-based content such as disease overviews, diagnostic algorithms, therapeutic guidelines, drug dosing information, and clinical calculators. Optimized for mobile and desktop use, it supports real-time clinical decision-making across a variety of care settings. Although its primary audience is licensed physicians, the user base also includes residents, medical students, and other allied health professionals. For this study, only search data generated by verified licensed physicians with active professional credentials were included. In Brazil, there were 597,428 practicing physicians as of January 2024, according to the recent edition of Demografia Médica(19). In comparison, the Afya Whitebook® platform registered approximately 150,000 monthly active physician users by the end of the same year (20).

The Afya Whitebook® research database contains metadata on user search behavior, collected in real time and accessible retrospectively for research purposes via a structured query interface. All data are anonymized to ensure confidentiality and compliance with ethical standards, preventing the identification of individual healthcare professionals.

#### 2.3.3 Climate data

Daily observations of precipitation, maximum, mean, and minimum temperature, and mean relative humidity were obtained from the Brazilian National Institute of Meteorology (Instituto Nacional de Meteorologia – INMET) automatic weather stations distributed throughout Brazil(21).

### 2.4 Data pre-processing

Climate data underwent several cleaning and harmonization steps. Dates were converted to datetime format, variable names standardized, and station metadata (name, geographic coordinates, operational period) appended to each record. Only observations between 2021 and 2024 were retained.

Stations with more than 30 consecutive days of missing data were excluded, following the revised criterion established after model calibration. Among IGRs containing multiple meteorological stations, only the station with the highest percentage of completeness was retained.

Daily measures were aggregated into weekly series, with averages computed for temperature and humidity, and weekly sums for precipitation. Missing weekly values were imputed using a centered moving average, calculated with two weeks forward and two weeks backward (minimum of one available neighbor), ensuring temporal continuity of the time series.

### 2.5 Study variables

The primary outcome of interest was the weekly number of hospitalizations for dengue fever extracted from the SIH/SUS database. Predictor variables encompassed both meteorological and digital health domains. Meteorological predictors included lagged indicators of precipitation, temperature, and humidity (1–4 weeks), alongside a structured derivation process that resulted in a final set of 42 climate-related variables, including weekly change flags, interaction terms, extreme climate indicators, seasonal dummies, temperature categories, and precipitation occurrence and patterns. In addition, clinical search data were incorporated through weekly counts of physician searches for dengue in Afya Whitebook®, restricted to verified licensed physicians. Together, these variables were selected to capture the combined influence of climatic fluctuations and real-time medical information-seeking behavior on dengue-related hospitalizations. All derived variables are summarized in Table 1, and their detailed definitions and computation procedures are provided in the Supplementary Material (Table S1).

View this table:
[Table 1.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/T1)

Table 1. Summary of the categories and types of variables included in the models.

### 2.6 Feature selection

Given the presence of 42 climatic input variables derivations, we sought to reduce model dimensionality and avoid multicollinearity by selecting only one variable from each climatic category—humidity, temperature, precipitation, and seasonality. To achieve this, we employed a variable-by-variable evaluation approach, in which the model was trained using the hospitalization variable alongside a single climatic predictor at a time. For each category, the climatic variable yielding the lowest RMSE was retained. As a result, each IGR retained four climatic input variables, one from each category.

Feature importance was further assessed using SHAP values, which enabled interpretation of the most influential predictors per IGR and quantification of the relative contribution of climatic versus digital health variables.

### 2.7 Data analysis

Hospitalization, digital query, and climatic datasets were merged at the IGR–week level to form a unified analytic panel. Exploratory analyses included visualization of temporal trends and descriptive statistics across regions. Forecasting was performed using a LSTM neural network trained on weekly climatic and digital health predictors. To account for temporal dependence, the dataset was split chronologically into training and testing sets, with the training period spanning 2021–2023 and the testing period covering 2024.

All data preprocessing and modeling were conducted in a Python-based analytical environment (22). Hyperparameters were optimized using a grid search evaluating different architectural configurations, including the number of hidden layers (2–4) and hidden units (20 or 40). The final model employed an input window of 24 weeks (seq_len = 24), a prediction horizon of 8 weeks (output_size = 8), a learning rate of 0.01, and a MSE loss function. Each configuration was trained in triplicate to account for stochastic variability during optimization. Model performance was assessed using the RMSE, the MSE, the Mean Absolute Error (MAE), and the coefficient of determination (R²).

These metrics were computed as follows: ![Formula][1]</img>  To compute the coefficient of determination, we used the Residual Sum of Squares (RSS) and the Total Sum of Squares (TSS): ![Formula][2]</img>  These quantities quantify both the magnitude of prediction errors and the proportion of variance in weekly hospitalizations explained by the forecasting model.

To evaluate the relative contribution of different data domains, five model structures were compared: 

*   **Hospitalization-only model (LSTM – Hospitalization):** Trained exclusively on past hospitalization counts to capture the intrinsic temporal dynamics of dengue without external predictors.

*   **Clinical-search model (LSTM – Hospitalization + Clinical Search):** Trained using hospitalization time series combined with Afya Whitebook® physician search data.

*   **Climate model (LSTM – Hospitalization + Climate):** Trained using hospitalization data and climatic predictors, selecting the best-performing variable from each domain (temperature, humidity, precipitation).

*   **Integrated model (LSTM – Hospitalization + Clinical Search + Climate):** Combines hospitalization data, climatic indicators, and real-time physician search behavior, restricted to the most informative variable per domain.

*   **SARIMAX model (SARIMAX – Hospitalization):** A classical benchmark trained only on hospitalization counts, used as a baseline for comparative performance assessment (23).

In addition, two complementary versions were implemented: 

*   **Ideal-data model:** predictors and hospitalization data were temporally aligned.

*   **Real-world model:** hospitalization data were shifted by up to eight weeks (approximately two months).

The two-month delay window was empirically defined based on the average reporting lag observed in the hospitalization dataset, ensuring that the model reflected the actual latency pattern present in official records. Performance differences between models were statistically compared using paired t-tests across triplicate RMSE results.

A walk-forward expanding-window validation approach was adopted during training to mimic real-time forecasting and assess model robustness across temporal segments. Predicted versus observed hospitalization counts were visually inspected to evaluate both numerical accuracy and temporal alignment.

#### 2.7.1 LSTM model Architeture

LSTM networks are a special type of recurrent neural network (RNN) designed to overcome the vanishing and exploding gradient problem in long sequences (16). An LSTM unit contains a memory cell with a constant error carousel (CEC) that preserves information across time steps, while multiplicative gates regulate information flow. Gers, Schmidhuber, and Cummins (2000) extended the model with the forget gate, allowing adaptive resetting of memory contents ((24).

In our study, the LSTM architecture was adapted to model the temporal dynamics of dengue hospitalizations in Brazil. The input, forget, and output gates controlled the flow of information across time, while the memory cell preserved long-range dependencies, allowing the model to integrate both climatic fluctuations and digital health signals. Following this principle, we structured the implementation to use weekly lagged climate indicators, extreme event markers, and physician search behavior from Afya Whitebook® as inputs. Through a sliding-window training strategy, the LSTM was able to capture short-term variations and longer seasonal patterns, providing forecasts at the microregion level that aligned with the theoretical design of the architecture.

### 2.8 Ethics approval

The research complies with the ethical principles established by Resolution CNS 466/12 of the Brazilian National Health Council. All data used were anonymized and derived from public or internal databases, with no individual-level identifiers or direct participant interaction.

The research protocol was reviewed and approved by the Research Ethics Committee of the Centro Universitário Presidente Tancredo de Almeida Neves (UNIPTAN), under CAAE 81123624.4.0000.9667 and Opinion number 6.944.607. The Committee granted a waiver of informed consent given the exclusive use of anonymized secondary data.

## 3. Results

### 3.1 Characterization of included Immediate Geographic Regions and data coverage

The initial dataset encompassed 510 IGRs with dengue hospitalizations recorded in the SIH/SUS system between 2021 and 2024. After integrating hospitalization, climate, and Afya Whitebook® digital search data, sequential filtering steps were applied to ensure data completeness and model stability.

First, INMET weather stations with more than 30 consecutive missing days were excluded, and duplicated stations within the same IGR were resolved by retaining those with the highest proportion of complete observations. Clinical search data were available for 46 IGRs; however, only those with consistent weekly dengue-related searches and valid hospitalization data were retained. Additional exclusions occurred during preliminary modeling, in which IGRs whose base LSTM models produced non-positive R² values, reflecting insufficient temporal structure for forecasting, were removed.

After all filtering stages, a final set of 27 epidemiologically meaningful IGRs remained: Alegre, Belo Horizonte, Campina Grande, Campos Dos Goytacazes, Catalão, Cruz Alta, Distrito Federal, Frederico Westphalen, Ijui, Juiz De Fora, Linhares, Marília, Maringá, Oliveira, Passo Fundo, Passos, Pirapora, Porto Alegre, Ribeirão Preto, Rio De Janeiro, Salvador, Santa Cruz Do Sul, Santa Maria, São Miguel Do Oeste, São Paulo, Uberaba, Uberlândia. These regions displayed sufficient data density across all domains, enabling robust multi-step forecasting.

### 3.2 Spatiotemporal Trends in Hospitalization and Clinical Search Data (2021–2024)

Hospitalizations increased across most IGRs between 2021 and 2024 (Table 2). Smaller-population regions such as Frederico Westphalen, Oliveira, São Miguel do Oeste, and Uberlândia exhibited high hospitalization rates, frequently surpassing 200 to 600 per 100,000 inhabitants in peak year, while large metropolitan centers, including São Paulo, and Rio de Janeiro, reported low hospitalization rates, even though they presented the highest absolute numbers.

View this table:
[Table 2.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/T2)

Table 2. Hospitalizations and Rates per 100000 inhabitants in 27 Selected Brazilian Immediate Geographic Regions, 2021–2024

Despite regional variability, most IGRs exhibited a progressive increase in hospitalizations over the study period, with the most pronounced growth occurring between 2023 and 2024. Santa Cruz do Sul displayed a non-linear pattern, with an early peak in 2021, a subsequent decline, and a renewed increase in 2024. Other regions, such as Passo Fundo and Santa Maria, began with very low hospitalization rates in 2021–2022 but rose to moderate levels by 2023–2024, reflecting a delayed but evident escalation.

As shown in **Table 3**, most IGRs displayed higher dengue-related search activity in the later years of the study period compared with 2021. Although the trajectory was not strictly linear, many regions showed a general upward tendency over time, with notable increases in 2022 and again in 2024. This temporal pattern is consistent with the national hospitalization profile, which also registered higher counts in these two years.

View this table:
[Table 3.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/T3)

Table 3. Dengue-related search activity by Immediate Geographic Region (2021–2024)

### 3.3 Correlation Analysis Between Hospitalizations, Climate Variables, and Digital Clinical Search Activity

The resulting correlation matrix revealed a clear contrast in the strength and consistency of associations across variable categories, as illustrated in **Figure 2**. Afya Whitebook® access demonstrated the strongest and most frequent positive correlation with dengue hospitalizations across IGRs.

![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2026/01/15/2026.01.12.26343977/F2.medium.gif)

[Figure 2.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/F2)

Figure 2. 
Correlation matrix showing pairwise correlations between dengue hospitalizations, Afya Whitebook® search activity, and climate-related predictors across the 27 Immediate Geographic Regions. Afya Whitebook® access demonstrates the strongest association with hospitalizations.

In contrast, climatic predictors displayed predominantly weak and heterogeneous correlations with dengue hospitalizations across IGRs. Humidity and temperature indicators showed low to moderate associations, with no consistent pattern across regions. Precipitation-related variables exhibited minimal and highly variable correlations, without any clear or recurrent signal in most regions.

Time-series curves of Afya Whitebook® searches and dengue hospitalizations further supported these findings, as illustrated in **Figure 3**. In nearly all IGRs, peaks in dengue-related search activity preceded or coincided with increases in hospitalizations.

![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2026/01/15/2026.01.12.26343977/F3.medium.gif)

[Figure 3.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/F3)

Figure 3. 
Weekly dengue hospitalizations (orange line) and Afya Whitebook® dengue-related search activity (blue line) across Immediate Geographic Regions from 2021 to 2024. In this figure, six of the 27 IGRs were selected as representative examples for visualization. Peaks in clinical search activity typically precede or coincide with hospitalization increase, indicating early clinical engagement with dengue and supporting its use as a real-time digital indicator.

### 3.4 Selected Variables Across Immediate Geographic Regions (IGRs)

During the feature-selection procedure, we ensured that each IGR retained at least one predictor from the four predefined climatic categories—humidity, temperature, precipitation, and seasonality—reflecting their established relevance in dengue transmission dynamics. Despite this harmonized structure, the LSTM model selected markedly different climatic signatures across regions. Some IGRs were primarily influenced by short-term humidity fluctuations or temperature lags, whereas others showed stronger contributions from precipitation extremes, or accumulated rainfall.

The full set of predictors selected for each IGR, grouped by climatic category, is provided in **Supplementary Table S2**.

### 3.5 Predictive Performance of Climate, Digital, and Integrated Models

Under the Ideal-data model (Figure 4), the Climate model consistently achieved the lowest prediction errors across nearly all IGRs, displaying the smallest RMSE medians and the narrowest interquartile ranges. The Integrated model performed as the second-best approach, although its accuracy gains over climate alone were modest. In contrast, the Clinical-search model showed higher error levels and greater variability. Detailed RMSE values (mean ± standard error) for all IGRs, as well as the corresponding pairwise statistical comparisons between model configurations, are reported in **Supplementary Tables S3 and S5**.

![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2026/01/15/2026.01.12.26343977/F4.medium.gif)

[Figure 4.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/F4)

Figure 4. 
Distribution of RMSE values across the 27 Immediate Geographic Regions (IGRs) for the four modeling scenarios without hospitalization data delay: (i) LSTM- Hospitalization only, (ii) Clinical Search model (LSTM-Hospitalization + Clinical Search), (iii) Climate model (LSTM-Hospitalization + Climate), and (iv) Integrated model (LSTM- Hospitalization + Clinical Search + Climate). Each box represents the dispersion of RMSE scores across triplicate model runs for each IGR, allowing comparison of model accuracy under different predictor sets.

Under the Real-data model (Figure 5), where hospitalization data include reporting delays, the Integrated model consistently achieved the lowest prediction errors across most IGRs. Both the Climate and Clinical-search models displayed higher RMSE values, although the Clinical-search model showed relatively better performance than in the Ideal-data scenario. Climatic predictors remained informative but showed reduced accuracy relative to the no-delay condition. Consistent with this increased heterogeneity, a greater number of regions exhibited statistically significant pairwise differences between models in the Real-data scenario, as detailed in **Supplementary Tables S4 and S6**.

![Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2026/01/15/2026.01.12.26343977/F5.medium.gif)

[Figure 5.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/F5)

Figure 5. 
Distribution of RMSE values across the 27 Immediate Geographic Regions (IGRs) for the four modeling scenarios when incorporating realistic hospitalization data delays. The scenarios mirror those used in the no-delay analysis, enabling comparison of prediction performance under real-world reporting constraints

### 3.6 SHAP

When comparing SHAP patterns across scenarios (Figure 6; Supplementary Figures 7 and 8), a consistent shift emerged: the predictive influence of hospitalization counts decreased markedly in the Real-world model, reflecting the loss of temporal alignment introduced by reporting delays. As hospitalization became less informative, the relative importance of the clinical-search indicator increased, appearing more prominently across multiple IGRs. Although climatic predictors remained the most influential domain overall, their predominance varied substantially across regions—some IGRs were driven by humidity-related changes, others by temperature categories or precipitation lags—highlighting strong geographic heterogeneity.

![Figure 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2026/01/15/2026.01.12.26343977/F6.medium.gif)

[Figure 6.](http://medrxiv.org/content/early/2026/01/15/2026.01.12.26343977/F6)

Figure 6. 
Across both models, the contribution of hospitalization counts dropped sharply under reporting delays. In contrast, the relative importance of the clinical-search indicator increased slightly in the Real-world scenario, reflecting its greater utility when hospitalization data becomes less informative.

## 4. Discussion

This study yielded three main findings. First, dengue-related searches on the Afya Whitebook® platform showed strong temporal alignment with hospitalizations, with search peaks often preceding or coinciding with increases in admissions. Second, climatic predictors varied substantially across IGRs, reflecting regional heterogeneity in environmental drivers. Third, under realistic reporting delays, the integrated model combining climate variables and digital search activity outperformed all other approaches, indicating that physician search behavior provides timely epidemiological information that compensates for the loss of predictive signal caused by delayed hospitalization data.

Between 2021 and 2024, clinical searches and hospitalizations increased across regions, with strong temporal alignment between the two. This indicates that physicians rapidly adjust their information-seeking behavior in response to changes in clinical demand, consistent with evidence that digital engagement can serve as an early indicator of infectious disease activity (13,14). Prior studies have shown that public digital indicators—such as Twitter activity—can forecast dengue trends by several weeks (25). By using physician search data rather than public-facing digital behavior, our study likely provides a more precise and clinically relevant digital signal, directly connected to medical decision-making and more suitable for operational forecasting.

Climatic variables contributed mainly through lagged effects, aligning with findings from Chen & Moraga (2025) that forecasting accuracy improves when models explicitly incorporate temporal delays in environmental exposures (11) In our study, contemporaneous associations between climate and hospitalizations were modest, but the LSTM consistently selected lagged temperature, humidity, and precipitation indicators as relevant inputs, and SHAP analyses confirmed their predictive influence. These results reinforce that climate is a key determinant of dengue dynamics, but its effects become detectable only when temporal structure and non-linear relationships are modeled explicitly.

Model comparisons further supported these patterns. In the ideal scenario without reporting delays, climate models achieved the lowest errors, consistent with the recognized environmental dependence of dengue transmission (4,5). When reporting delays were incorporated, predictive performance shifted: the integrated model became superior in most regions, and digital search data gained importance by providing real-time information that compensated for the diminished value of delayed hospitalization counts.

SHAP analyses clarified the interplay between predictors across scenarios. In the absence of reporting delays, climatic variables and recent hospitalization history dominated model contributions, whereas under delayed-reporting conditions, the influence of hospitalization history decreased sharply and digital search activity rose in relative importance. Together, these findings demonstrate that medium-range climatic signals and real-time behavioral indicators provide complementary information, enhancing the accuracy and operational value of forecasting models. This has practical implications for environmental and health surveillance: in settings where reporting delays are structural, incorporating digital search behavior into forecasting frameworks can improve situational awareness, support earlier public health responses, and strengthen regional early-warning systems. Similar benefits have been observed in other hybrid environmental–digital approaches used for dengue nowcasting in Brazil (26), reinforcing the potential of integrating clinical digital signals into surveillance tools.

Different limitations should be acknowledged when interpreting the generalizability of our model. The ecological design restricts inference to population-level associations and precludes individual causal interpretation. SIH/SUS hospitalization data may contain underreporting and reporting delays, potentially distorting temporal dengue patterns. Uneven INMET meteorological station coverage led to exclusions of regions with substantial data gaps, limiting nationwide applicability. Afya Whitebook® search patterns, while high-volume, represent only a subset of clinicians and may be biased toward younger, urban, or digitally engaged users. The absence of entomological or virological indicators (e.g., vector density or circulating serotypes) may also constrain the model’s sensitivity to short-term transmission shifts. Finally, LSTM models require large training datasets and remain partially opaque despite SHAP-based interpretability.

To mitigate these potential sources of bias, we implemented several measures. First, we restricted Afya Whitebook® data to verified physician accounts, reducing the risk of non-clinical or misclassified users and ensuring that digital indicators reflected genuine clinical activity. Second, we applied rigorous data-cleaning steps to the SIH/SUS time series and interpreted results only at the IGR level, where aggregation minimizes random fluctuations and reporting noise. Finally, we limited our analysis to IGRs with adequate climatic coverage, improving the reliability of environmental predictors.

This study demonstrates that clinician search behavior within a point-of care medical decision tool platform can serve as a timely and reliable digital indicator of dengue activity in Brazil. By integrating climatic conditions with real-time physician search data into our models, we consistently improve the forecasting of dengue hospitalizations, particularly when accounting for reporting delays. This integrated approach provides a more rapid and clinically grounded signal to support outbreak detection and public health decision-making. The next steps are to assess this framework’s generalizability across diverse settings and then evaluate its suitability for operational early-warning use, paving the way for its real-world application.

## 5. Supporting information

S1 Table. Climatic and clinical variables used in the forecasting models.

Description of all climatic, seasonal, and physician clinical-search variables, including units, derivation, and lagged indicators.

S2 Table. Predictors selected across Immediate Geographic Regions.

Climatic and digital predictors selected by the machine learning model across all Immediate Geographic Regions.

S1 Fig. Correlation between dengue hospitalizations, climate variables, and physician search activity.

Pairwise correlation analysis across regions.

S2 Fig. Time-series patterns of dengue hospitalizations and physician search activity.

Weekly trends showing the temporal alignment between clinical searches and hospitalizations.

S3 Table. Model performance under ideal-data conditions.

Root mean squared error values across regions for all predictive model configurations.

S4 Table. Statistical comparison of models under ideal-data conditions. Pairwise statistical tests comparing predictive performance.

S5 Table. Model performance under real-world reporting conditions.

Root mean squared error values across regions accounting for reporting delays.

S6 Table. Statistical comparison of models under real-world reporting conditions. Pairwise statistical tests comparing predictive performance under reporting delays.

S7 Fig. Feature importance under ideal-data conditions.

Relative contribution of predictors across regions based on model interpretation analyses.

S8 Fig. Feature importance under real-world reporting conditions. Changes in predictor importance when reporting delays are incorporated.

## Funding sources

Dayanna Q. Palmer reports financial and administrative support from Afya® and financial support from Google Research through the 2025 Google PhD Fellowship. Eduardo C. Moura, Marcela Motta, Danielly Xavier, Guilherme Schittine and Ronaldo Gismondi report financial support from Afya®. Angélica Caseri declares no financial support relevant to this work.

## Declaration of competing interest

Dayanna de Oliveira Quintanilha Palmer, Eduardo C. Moura, Marcela Motta, Danielly Xavier, Ronaldo Gismondi, and Guilherme Schittine are employees or collaborators of Afya®, the company that owns the Afya Whitebook® platform evaluated in this study. Afya® had no role in the study design, data collection, data management, analysis, interpretation of results, manuscript preparation, or the decision to submit this article for publication. Dayanna de Oliveira Quintanilha Palmer also receives support from the 2025 Google PhD Fellowship in Health (Google Research), which likewise had no involvement in any stage of the research. Angélica Caseri declares no competing interests. The authors declare no additional competing interests.

## Data availability

Hospitalization data from the Brazilian Hospital Information System (SIH/SUS) and meteorological data from the National Institute of Meteorology (INMET) are publicly available through the official portals listed in the references. Clinical-search data from the Afya Whitebook® platform contain proprietary and sensitive usage information and cannot be publicly released. These data can be made available upon reasonable request to the corresponding author, subject to approval by Afya®.

## Code availability

All code used for data processing, model development, and analysis is available in the project’s Jupyter notebooks. These materials can be provided upon reasonable request to the corresponding author.

## Acknowledgements

We thank Beatriz Landiosi Teixeira for creating the graphical abstract and supporting the visual development of the study’s illustrative materials.

*   Received January 12, 2026.
*   Revision received January 12, 2026.
*   Accepted January 15, 2026.


*   © 2026, Posted by openRxiv

This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/)

## 6. References

1.  1.Khan MB, Yang ZS, Lin CY, Hsu MC, Urbina AN, Assavalapsakul W, et al. Dengue overview: An updated systemic review. J Infect Public Health. 2023 Oct;16(10):1625–42.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37595484&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

2.  2.Messina JP, Brady OJ, Scott TW, Zou C, Pigott DM, Duda KA, et al. Global spread of dengue virus types: mapping the 70 year history. Trends Microbiol. 2014 Mar;22(3):138–46.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.tim.2013.12.011&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24468533&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000332905000006&link_type=ISI) 

3.  3.Ministério da Saúde B. TABNET/SINAN – Notificações de Dengue no Brasil (2001–2025) [Internet]. 2025 [cited 2025 Oct 19]. Available from: [http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sinannet/cnv/denguebbr.def](http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sinannet/cnv/denguebbr.def)
    
    
4.  4.Colón-González FJ, Sewe MO, Tompkins AM, Sjödin H, Casallas A, Rocklöv J, et al. Projecting the risk of mosquito-borne diseases in a warmer and more populated world: a multi-model, multi-scenario intercomparison modelling study. Lancet Planet Health. 2021 Jul;5(7):e404–14.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S2542-5196(21)00132-7&link_type=DOI) 

5.  5.Damtew YT, Tong M, Varghese BM, Anikeeva O, Hansen A, Dear K, et al. Effects of high temperatures and heatwaves on dengue fever: a systematic review and meta-analysis. EBioMedicine. 2023 May;91:104582.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=37088034&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

6.  6.Messina JP, Brady OJ, Golding N, Kraemer MUG, Wint GRW, Ray SE, et al. The current and future global distribution and population at risk of dengue. Nat Microbiol. 2019 Jun 10;4(9):1508–15.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31182801&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

7.  7.Teurlai M, Menkès CE, Cavarero V, Degallier N, Descloux E, Grangeon JP, et al. Socio-economic and Climate Factors Associated with Dengue Fever Spatial Heterogeneity: A Worked Example in New Caledonia. PLoS Negl Trop Dis. 2015 Dec 1;9(12):e0004211.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pntd.0004211&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26624008&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

8.  8.Leung XY, Islam RM, Adhami M, Ilic D, McDonald L, Palawaththa S, et al. A systematic review of dengue outbreak prediction models: Current scenario and future directions. PLoS Negl Trop Dis. 2023 Feb 13;17(2):e0010631.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pntd.0010631&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36780568&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

9.  9.Bhatt S, Gething PW, Brady OJ, Messina JP, Farlow AW, Moyes CL, et al. The global distribution and burden of dengue. Nature. 2013 Apr 25;496(7446):504–7.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature12060&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23563266&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000317984400041&link_type=ISI) 

10. 10.Ministério da Saúde. Portaria SAES n° 1.110, de 18 de novembro de 2021. 2021.
    
    
11. 11.Chen X, Moraga P. Forecasting dengue across Brazil with LSTM neural networks and SHAP-driven lagged climate and spatial effects. BMC Public Health. 2025 Mar 12;25(1):973.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12889-025-22106-7&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=40075398&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

12. 12.Roster K, Connaughton C, Rodrigues FA. Machine-Learning–Based Forecasting of Dengue Fever in Brazilian Cities Using Epidemiologic and Meteorological Variables. Am J Epidemiol. 2022 Sep 28;191(10):1803–12.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/aje/kwac090&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=35584963&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

13. 13.Brownstein JS, Freifeld CC, Madoff LC. Digital Disease Detection — Harnessing the Web for Public Health Surveillance. New England Journal of Medicine. 2009 May 21;360(21):2153–7.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMp0900702&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19423867&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000266206000001&link_type=ISI) 

14. 14.Salathé M. Digital epidemiology: what is it, and where is it going? Life Sci Soc Policy. 2018 Dec 1;14(1).
    
    
15. 15.Quintanilha D, Moura E, Xavier D. Hospitalisations in Brazil: an ecological time series analysis of the impact of medical decision support data as an exogenous variable. BMC Public Health. 2025 Aug 18;25(1):2827.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=40826054&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

16. 16.Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997 Nov 1;9(8):1735–80.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1162/neco.1997.9.8.1735&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=9377276&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1997YA04500007&link_type=ISI) 

17. 17.Instituto Brasileiro de Geografia e Estatística. Divisão Regional do Brasil em Regiões Geográficas Imediatas e [Internet]. [cited 2025 Oct 16]. Available from: [https://www.ibge.gov.br/](https://www.ibge.gov.br/)
    
    
18. 18.Ministério da Saúde. DATASUS – Departamento de Informática do Sistema Único de Saúde [Internet]. 2025 [cited 2025 Oct 19]. Available from: [https://datasus.saude.gov.br/](https://datasus.saude.gov.br/)
    
    
19. 19.Conselho Federal de Medicina. Observatório Demografia Médica – Painel de Demografia Médica [Internet]. 2024 [cited 2025 Dec 2]. Available from: [https://observatorio.cfm.org.br/demografia/dashboard/](https://observatorio.cfm.org.br/demografia/dashboard/)
    
    
20. 20.AFYA. Afya Whitebook Access and Engagement Data. Internal company platform. 2025.
    
    
21. 21.Instituto Nacional de Meteorologia (INMET). BDMEP — Banco de Dados Meteorológicos para Ensino e Pesquisa [Internet]. 2025 [cited 2025 Mar 3]. Available from: [https://bdmep.inmet.gov.br/](https://bdmep.inmet.gov.br/)
    
    
22. 22.Python Software Foundation. Python: A Programming Language [Internet]. 2021 [cited 2025 Jul 18]. Available from: Python Software Foundation
    
    
23. 23.Statsmodels Developers. statsmodels.tsa.statespace.sarimax.SARIMAX [Internet]. 2025 [cited 2025 Dec 2]. Available from: [https://www.statsmodels.org/](https://www.statsmodels.org/)
    
    
24. 24.Gers FA, Schmidhuber J, Cummins F. Learning to Forget: Continual Prediction with LSTM. Neural Comput. 2000 Oct 1;12(10):2451–71.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1162/089976600300015015&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=11032042&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

25. 25.Marques-Toledo C de A, Degener CM, Vinhal L, Coelho G, Meira W, Codeço CT, et al. Dengue prediction by the web: Tweets are a useful tool for estimating and forecasting Dengue at country and city level. PLoS Negl Trop Dis. 2017 Jul 18;11(7):e0005729.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28719659&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2026%2F01%2F15%2F2026.01.12.26343977.atom) 

26. 26.Codeco C, Coelho F, Cruz O, Oliveira S, Castro T, Bastos L. Infodengue: A nowcasting system for the surveillance of arboviruses in Brazil. Rev Epidemiol Sante Publique. 2018 Jul;66:S386.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.respe.2018.05.408&link_type=DOI)

 [1]: /embed/graphic-4.gif
 [2]: /embed/graphic-5.gif