Abstract
How do we forecast an emerging pandemic in real time in a purely data-driven manner? How to leverage rich heterogeneous data based on various signals such as mobility, testing, and/or disease exposure for forecasting? How to handle noisy data and generate uncertainties in the forecast? In this paper, we present DeepCovid, an operational deep learning framework designed for real-time COVID-19 forecasting. Deep-Covid works well with sparse data and can handle noisy heterogeneous data signals by propagating the uncertainty from the data in a principled manner resulting in meaningful uncertainties in the forecast. The framework also consists of modules for both real-time and retrospective exploratory analysis to enable interpretation of the forecasts. Results from real-time predictions (featured on the CDC website and FiveThirtyEight.com) since April 2020 indicates that our approach is competitive among the methods in the COVID-19 Forecast Hub, especially for short-term predictions.
1 Introduction
Motivation
The devastating impact of the currently unfolding global COVID-19 pandemic has sharply illustrated our enormous vulnerability to emerging infectious diseases. Forecasting disease trajectories is a non-trivial and important task. Estimating various measures related to the epidemic gives policymakers valuable lead time to plan interventions and optimize supply chain decisions. Hence, accurate forecasts are critical in combating epidemic outbreaks including the current pandemic (Holmdahl and Buckee 2020).
To encourage research in this space and to provide a unified forecasting platform, the US Centers for Disease Control and Prevention (CDC) is currently organizing a collaborative forecasting task under the umbrella of the COVID-19 Forecast Hub. The forecasting targets include various COVID-related metrics including mortality and hospitalizations at various temporal and spatial resolutions. The initiative has attracted submissions from more than 58 teams from industry and academia (as of Sept 2020).
The majority of the participating approaches can be classified into two categories (i) statistical and (ii) mechanistic. The mechanistic approaches model the underlying disease transmission over the population contact network under various assumptions using either ordinary differential equations (Zhang Figure 1: Schematic of DeepCovid framework for real-time COVID-19 forecasting. The data module is dedicated to data pre-processing including imputation of missing values and aggregating at the right temporal and spatial resolution. The prediction module generates probabilistic forecasts based on the curated data. Finally, the explainability module (with interface) allows both the real-time and retrospective analysis of forecasts to an build intuitive explanation of forecasts. et al. 2017) and/or agent based models (Venkatramanan et al. 2018). While very valuable for long-term ‘what-if’ scenario generation, mechanistic approaches have several challenges in real-time forecasting of an emerging pandemic. The first drawback is that it the complexity of the model rises with the number of data sources being used. Hence, to manage the complexity, most models often use only one or two data signals (such as weather) along with the observed disease incidence (Shaman, Goldstein, and Lipsitch 2010). Hence it is not trivial to extend these models to include various other data signals (like social media data).
Schematic of DeepCovid framework for real-time COVID-19 forecasting. The data module is dedicated to data pre-processing including imputation of missing values and aggregating at the right temporal and spatial resolution. The prediction module generates probabilistic forecasts based on the curated data. Finally, the explainability module (with interface) allows both the real-time and retrospective analysis of forecasts to an build intuitive explanation of forecasts.
On the other hand, statistical approaches are fairly new in this space. They exploit correlations between various data sources and the forecast targets to learn a functional dependence between the two and use the learnt function to make predictions (Yuan et al. 2013). There are several challenges in designing statistical approaches as well as we discuss later (see Sec 4). Most of the existing statistical approaches for epidemic forecasting (usually developed for influenza forecasting, including work by the authors (Adhikari et al. 2019)) cannot be readily adapted for COVID forecasting due to several issues such as lack of data and poor data quality. Indeed there is hardly (if any) work on describing a completely data-driven approach for real time forecasting.
Our goals
In this paper, we describe our framework Deep-Covid (see Fig.1), the first purely data-driven deep learning (DL) model for real time pandemic forecasting in the COVID-19 Forecast Hub (Reich et al. 2020)1 and the official ensemble. Our real-time forecasts are showcased in the hub, the CDC website (CDC 2020) and the popular FiveThirtyEight website (538 2020). By our work, we aim to address a gap in the literature pertaining to using purely data driven approaches for emerging pandemics, with the following goals: G1. Coping with heterogeneous, scarce and noisy data: A DL-based model can ingest many heterogeneous signals that are more sensitive to what is happening on the ground, without laborious feature engineering. To fully take advantage of this, our framework is designed with careful consideration of data and modeling challenges faced in robust real-time forecasting with principled uncertainty estimation. G2. Bring a complementary forecasting perspective: A good ensemble needs diverse perspectives (Reich et al. 2019; Ray et al. 2020). The overwhelming majority of the teams in the ensemble perform mechanistic modeling. By being the first data-driven DL method in the ensemble, DeepCovid brings a unique perspective closer to the observed data signals with minimal assumptions. G3. Excellence in short-term forecasting: The utility of statistical models in short-term forecasting is well-known (Holmdahl and Buckee 2020), which are useful to plan intervention and allocate resources. We demonstrate Deep-Covid excels in this task by comparing to a strong baseline. G4. Enable communication to domain-experts: This is to ensure our framework gives explanations for its forecasts, which are very important for communication and interpretation by both the public and decision makers.
Contributions
Our contributions are as follows:
We propose DeepCovid, one of the first deep learning based real-time COVID forecasting frameworks, whose performance in the CDC COVID-19 Forecast Hub (since April 2020) demonstrates that it is consistently competitive, especially at short term forecasting.
We address several challenges of such real time forecasting including interpretability and uncertainty estimation in a principled fashion.
We provide valuable observations and ‘lessons learnt’ from our experience for modeling emerging infectious diseases using purely data-driven methods.
2 Related Work
Epidemic Forecasting: Modeling approaches for epidemic forecasting can be broadly categorized in mechanistic (Zhang et al. 2017; Shaman and Karspeck 2012) and statistical (Tizzoni et al. 2012; Osthus et al. 2019; Brooks et al. 2018), the latter closely related to time series analysis (Box et al. 2015 Jha et al. 2015). Past work has explored leveraging multiple sources of data, from search engine (Ginsberg et al. 200 Yuan et al. 2013) to weather data (Shaman, Goldstein, and Lipsitch 2010; Tamerius et al. 2013; Volkova et al. 2017) for epidemic forecasting. There has been a recent spate of work in deep learning approaches that tackle the data sparsity problem by leveraging seasonal time series intra- and inter-similarity (Adhikari et al. 2019) and principled simulation (Wang, Chen, and Marathe 2019).
COVID-19 Forecasting
Other approaches adopted by the contributing models are mechanistic (Zou et al. 2020; Chinazzi et al. 2020; Baek et al. 2020) and statistical (Altieri et al. 2020; Murray et al. 2020). The official hub ensemble (Ray et al. 2020) combines forecasts from ours and other models.
3 Background
Forecasting requirements from CDC
Starting in April 2020, CDC requests probabilistic forecasts for COVID-19 associated mortality and hospitalizations at various temporal and spatial resolutions to be used by policymakers to plan intervention and allocate resources. As mentioned before, our model has been part of the ensemble since its inception.
Forecasting Targets
The forecasting targets are: T1. Incidence and cumulative weekly deaths: Reported incidence (new) and cumulative deaths for US states and the US overall. The data reported by Johns Hopkins University (JHU) (Dong et al. 2020) serves as gold standard for the CDC. T2. Incidence daily hospitalizations: Reported new hospitalizations for US states and the US overall. CDC does not fix a gold standard for this but we found the data are provided by the COVID Tracking Project (cov 2020c) to be the closest.
Problem formulation
We can state our real-time forecasting problem for a specific geography as follows.
Given
an observed multivariate time series of COVID-related signals and corresponding values for the forecasting target
, where N is the size of the sequence until the current date.
Predict
next k values of forecasting target, i.e., where k = 4 for T1 (4 weeks ahead) and k = 28 for T2 (28 days ahead).
4 Our Framework: DeepCOVID
DeepCOVID is our framework for explainable real-time COVID-19 forecasting, which contains three modules depicted in Figure 1: data module, prediction module, and explainability module. By separating data and prediction modules, our goal is to differentiate between the handling of noisy data from the learning process. For the predictive module, we use deep learning because it is a flexible, scalable, and efficient technology, and an excellent choice to model non-linearities. As mentioned before, explainability is a challenge in data-driven models. However, we want to understand and connect forecasts with epidemiological reasons. Once we have insights about our predictions, we will have a feedback loop to improve performance. Therefore, we have an explicit module for explainability which helps to shed some light on our predictions in a dynamic situation
4.1 Data Module
In this section, we describe our data collection and preprocessing for input to our deep neural network.
Challenges
Data pre-processing for real-time COVID forecasting brings several challenges primarily due to a novel emerging scenario. C1. Data collection: Collecting data in such a chaotic scenario is challenging because it comes from multiple sources (often collected by volunteers), is in different formats, some even changing over time (e.g. reported deaths from JHU), and unexpected hiatus. C2. Selection of signals: We carefully select signals that describe the different facets of the disease spread. To enable epidemiological observations about our predictions, this selection has to be driven by appropriate rationale. C3. Temporal misalignment: Since the signals are collected from multiple sources, they often have temporal misalignment. Some signals presented 1-2 weeks of lag due to delays in reporting from hospitals, public records and government officials; others have different temporal resolutions (days vs weeks). C4. Spatial misalignment: Some records are reported at specific spatial granularity and their conversion to higher geographical levels is non-trivial. C5. Missing values: Most prominently, our forecasting target, i.e., incidence hospitalizations, has not been reported in 11 states (CA, DC, TX, IL, LA, PA, MI, MO, NC, NV, DE).
Our approach
To address C1, we developed a data extraction program personalized for each data format to convert them all to a standard format. For C2, we extensively searched for meaningful signals from an epidemiological perspective. The outcome of our search is summarized in Table 1 for the signals consistently used in our predictions. We collected the signals for 52 geographical regions: US National, the 50 US states and Washington D.C. We address the data misalignment in C3 as follows. For the lags in reporting, we downshift the signals. Our idea is to align all the signals based on their latest records since it is safe to assume that the latest records are more indicative for future targets. To address weekly/daily inconsistency, signals require different treatment depending on their nature. For example, since weekly hospitalization rate and CLI% ER visits have been recorded as percentages, we choose to consider the same value for daily incidence because it is not meaningful to transform it. For C4, signal 13 in Table 1 is recorded only at county-level, while other signals (6 and 7) contain records at the HHS region-level2. To provide a state-wise forecast model, we transformed the signal record state-wise by aggregating and de-aggregating, respectively. For this, it is crucial to consider the population of such geographic regions. Last, it is important to address C5 in a meaningful way. For 11 states with missing values, we had related signals such as hospitalization and ICU patients. We inferred missing values by leveraging these signals and making reasonable assumptions (such as effect of people’s recovery and death in one week).
Overview of data signals used in DeepCovid. (ILI=Influenza like Illness; CLI=COVID like Illness)
More details about how we handle these challenges and detailed description of our data signals are in our appendix.
4.2 Prediction Module
Challenges
Real-time COVID forecasting is a difficult problem with challenges originating from data and CDC requirements. Some of the challenges we encountered in designing the prediction module are: C1. Data Sparsity: One of the major challenges in forecasting emerging disease, specially at an early stage is data sparsity. Extracting enough information from the few available data points to ensure generalizable forecasts is a challenging problem. C2. Robust point and probabilistic forecasting: Forecasting disease in real-time is already challenging, and the problem is even more difficult in presence of data issues mentioned above. Hence, the questions we tackle in designing our framework are (1) how do we principally translate data noise to forecast uncertainty?; and (2) how do we ensure that our approach is robust to the noise and other data issues such that ensures reliable point and probabilistic forecasts? C3. Temporal consistency between the forecasts: Due to data sparsity we cannot hope to train deep networks enforcing temporal consistency such as LSTM and GRU. So the question is, how to design a neural architecture, which has few enough parameters to train from sparse data while ensuring temporal consistency between the forecasts?
Our approach
See Alg. 1. When we started to participate in the CDC task (April 2020), there was 1 month of meaningful (non-zero) data; so it was unclear how to effectively train a recurrent neural network. Hence we opted to use a feedforward network with autoregressive inputs to incorporate short-term dependencies in the time series. To fully address C1, we also need to avoid overfitting, for which we empirically evaluated several number of layers & sizes for them.
To address C2, we want capture uncertainty from data in a principled way. Note that we want data-based uncertainty, not model-based uncertainty (e.g. MC dropout). Thus, for each future incidence target, we obtain one single prediction per bootstrap sample so that we obtain a set of M predictions representing uncertainty in data (lines 7-10 in Alg. 1). Robustness of our point and probabilistic predictions are closely related to optimization. Optimizing the parameters of a neural network with sparse, noisy and heterogeneous data is challenging. In fact, we found that our optimization sometimes fell into local optima that returned absurd predictions. Hence, to improve robustness, we used batch normalization (Ioffe and Szegedy 2015) to alleviate initialization problems. Besides that, we run the same optimization for several times and select the one that leads to lowest loss in the (training) data.
To address C3, we capture temporal correlations between consecutive forecasts by using predictions from week k as part of the training data for week k + 1. In this way, we are also propagating uncertainty to future predictions (lines 4-5 in Alg. 1). This process can be regarded as a semi-supervised learning procedure because we use our model to create new labels that are used in training data for future targets.
Note: Cumulative counts is non-decreasing. Adding this restriction to our optimization problem would make it even more prone to get stuck in local optima. Therefore, for predicting cumulative deaths, once we get incidence predictions, we convert them to cumulative ones by aggregating the point predictions. We found this gives consistent and stable performance.
4.3 Explainability Module
Challenges
Policymakers have been constantly placing and lifting travel restrictions and quarantine orders to balance the trade-off between the pandemic burden and economic costs. Moreover, these policies differ in each administrative region. Hence, the data signals that we collect have different meaning and usability at different times and regions. Hence, the challenge here is to ensure that the our framework is able to generate forecasts explainable with respect to the data signals and enable analysis of signal strengths over different geographies and periods.
Our approach
See Alg. 2. We selected data ablation accompanied with an interface as means for this purpose. It is a systematic and simple way to quantify the contribution of signals in the predictions. In Alg. 2, we train models with every group of signals removed and quantify the deviation with respect to a reference point. Then, we rank the groups of signals by this deviation, and check if the contribution of the signal is statistically significant with respect to the model with all signals (this model is denoted as s = ∅). Our reference can be of two types: (1) available ground truth values for target yN+k until time N (lines 8-9), useful to explain strengths of signals in the past; or (2) mean of predictions of s = ∅ (lines 11-12), useful to enable understanding on of signals in the current predictions. To test statistical significance, we use two-sample t-test with null hypothesis H0 : E[I(s)] = E[I()] (line 17). This test will tell if removing signals in s will truly change predictions. If we fail to reject H0, then s is removed from the ranking output.
Interface: To enable interaction in the process of understanding the predictions, we constructed a web-based graphical user interface using Vega, a tool for creating web-based interactive visualization designs. The user interacts with our interface (see Fig. 1) by setting which region to analyze. The system retrieves a ranking of the signals whose removal impact the most to the predictions (output of Alg. 2). Then, the user can visualize the actual predictions along with the input predictors for a selected geographical region. NEnabled Analysis: Together with Alg 2 and the interface, our module enables the following two-fold analysis, both driving insights that can inform our data module (feedback loop in Fig. 1). Real-time insights: With our interface, a user can understand which signals are driving the predictive behavior (e.g. trends, slope). In addition, we can reason about what is happening with this group of signals. Important insights can be that an important group of signals is displaying erratic or unreliable behavior. If such signals are found, then we might choose not to include it in our model or we might want to see if there are any issues with the collection, cleaning or transformations (i.e. send them back to the data module). Retrospective insights: Our interface also enables analysis of signal strengths in past predictions. We can evaluate how we could have done in the past given that we had removed some signals (using the training data available in that week). Therefore, we can understand which signals have a positive contribution to our performance and which others have a negative contribution, which ultimately informs our data module.
5 Empirical Results
We first present the metrics used to evaluate predictive performance, and then make quantitative and qualitative observations about our performance and properties of our forecasts.
Setup
All experiments are conducted with a Linux machine of 40 processors Intel Xeon CPU E5-2698 v4 @ 2.20GHz, with 252 GB of RAM. Training and obtaining predictions is fast, for a single target takes about 1.5 min. All the results are based on the real-time forecasts submitted during three months (May to August 2020).
Metrics
In epidemic forecasting, predictive performance is usually measured for both point estimates and confidence interval of our probabilistic distribution of predictions (Tabataba et al. 2017). Hence we utilize the one metric to evaluate each aspect of the predictive power of our method.
For measuring performance of our point estimates, we use the mean absolute percentage error measures the average of absolute percentage error, i.e., and its value describes how large on average the error is, compared with the actual value.
For measuring performance from a probabilistic perspective, we adopt the probabilistic interval performance metric used in the COVID-19 Forecast Hub introduced in (Bracher et al. 2020). Given the central 1 − α prediction interval, the interval score Γα is computed as follows:
where y is ground truth, 𝟙(·) is the indicator function, l corresponds to the
confidence interval number, and u corresponds to the
confidence interval number.
Questions
With the observations in this section, we aim to demonstrate the our framework DeepCovid achieves our goals introduced in Section 1. Specifically, our questions are:
Q1. Is DeepCovid able to anticipate trend changes? (G1: first goal from Section 1, G2)
Q2. Does DeepCovid capture finer grain patterns? (G1, G2)
Q3. How does DeepCovid perform in US National shortterm forecasting? (G1, G3)
Q4. Does DeepCovid’s emphasis on short-term forecasting sacrifice longer-term performance? (G3)
Q5. Can DeepCovid explain its predictions to epidemiological experts for interpretation? (G4)
Remind that the main goal of the paper is to explore the utility of purely data-driven models for emerging pandemics. Questions Q1-4 directly address this, and Q5-6 are related to communication, an important asset in a forecasting model.
5.1 Observations
Observation 1 [Q1] DeepCovidis able to anticipate important changes in trends several weeks ahead
We are able to anticipate important changes in trends several weeks ahead, which suggests that we achieve goals G1 and G2 thanks to our capability of exploiting many heterogeneous signals that are sensitive to what is happening in the ground. Most notably, in Fig. 2(a), we can see we predicted the second peak value and time for US National three weeks early. Our method also predicted with three weeks of anticipation that California, after a stable period, was going to suffer a new increase on deaths (epidemic onset) (Fig. 2(b)).
Two examples when we were able to anticipate the upcoming change of trends.
Observation 2 [Q2] DeepCovidis able to capture finer grained reporting patterns
One of the advantages of purely data-driven is that they are able to capture micro patterns in time series data, which brings a complementary forecasting perspective (goals G1 and G2). In forecasting daily hospitalizations, we noticed many regions have the following patterns: P1: drop during weekends; P2: rise on Monday, and continues stable during weekdays. DeepCovid is able to capture them as noted in Fig. 3. In fact, our model is the only approach submitting forecasts to the CDC that captures these patterns (cdc 2020a).
Two examples of finer grained reporting patterns captured by DeepCovid. Note the dips in reporting for weekends.
Observation 3 [Q3] DeepCovidexcels in US National short-term forecasting
Here we compare against the official ensemble of all contributing models in the COVID-19 Forecast Hub (including ours). The ensemble has been regarded as one of the best performing models by different independent assessments published on the Web. Needless to say, national-level forecasts are crucial for federal decision makers and are the most visible forecasts in national media. In Fig. 4(a), DeepCovid clearly outperforms this very strong baseline in 1- and 2-week ahead across three months. Fig. 4(b) indicates probabilistic performance of our confidence intervals. The fact that we are close suggests we are propagating the uncertainties in the right way. These results show our success in goals G1 and G3.
(a) DeepCovid outperforms the official ensemble in US National short-term (1-2 wk ahead) forecasting in MAPE. (b) Our in US National short-term confidence intervals are close probabilistic metric Γα with α = 0.7. (c) Our focus on short-term predictions does not compromise longerterm (1-4 wk ahead) performance in multiple geographical regions.
Observation 4 [Q4] DeepCoviddoes not compromise longer-term performance
In Fig. 4(c), we can notice that our excellence in short-term forecasting does not compromise when we consider longer-term predictions in the evaluation. We are very close to the ensemble in US National but at state-level our performance is mixed. States such as Texas (TX), Vermont (VT), but in some other states such as California (CA) we are far from this strong baseline. This is indicative that there are still open questions for forecasting in lower-level granularities, where some signals are more bursty. This is also a good example of where mechanistic models can help. This complements our success for goal G3.
Observation 5 [Q5] DeepCovidexplains its predictions to epidemiological experts for interpretation
Our explainability module has been key to enable the previously shown quantitative and qualitative performance in real-time forecasting, meeting goal G4.
For example, to predict the second US peak in Fig. 2(a), we were able to understand that mobility was the main signal driving this prediction with clear statistically significance (α = 0.05), followed by testing, whose predictions were partially statistically significant. By July, it was unclear if mobility patterns were still capturing social distancing measures, however, the fact testing signals (which were rapidly increasing in the US) were also contributing to predict the same peak time allowed us to have more confidence in this peak prediction. For predicting the uptrend in Fig. 2(b), using our explainability module, we found that impact of line-list, mobility and exposure were statistically significant (with α = 0.05) and removing them one at a time did not change the uptrend, which suggests that signals from different groups are being utilized to predict the uptrend. This gave us confidence in our predictions of such important characteristic of the epidemic trajectory.
6 Discussions and Conclusion
In this paper, we introduced DeepCovid, an operational DL driven framework for real-time COVID forecasting, whose predictions have been submitted to the CDC via COVID-19 Forecast Hub on a weekly basis since April 2020. This was the first of purely data driven and DL approach to be submitted to the COVID-19 Forecast Hub. DeepCovid exhibits interpretability, encouraging short-term and trend performance, principled uncertainty estimation, correlation between forecasts, and ingestion of several data sources despite the chaotic and fast-moving pandemic scenario which naturally brings several modeling and data challenges.
There are several lessons learnt from our experience apart from the observations we have already noted above. First is the lack of standard data reporting, even among traditional sources (e.g. several states do not report new hospitalizations, or report deaths in different ways causing different lags). This means that while some artifacts can be handled statistically (lags), some mild assumptions are also needed to do meaningful predictions (like for hospitalizations) which are still useful. Second, due to the chaotic situation, a purely data driven model needs to be regularly (weekly) updated to reflect the changing dynamics and data quality, to ensure good performance. Third, explainability is very important, as it can serve as a sanity check based on domain knowledge and also highlight why we are making forecasts at different points of the pandemic e.g. we found that testing and mobility data signals were most important for predictive performance on average. Finally, we also found that data revisions are pervasive, and even the ground truth can be revised (see appendix), which implies measures of performance may be unreliable till data stabilizes. As future work we plan to extend this methodology to handle such revisions, and also work more robustly at smaller scales (e.g. county level).
Data Availability
All data are publicly available. https://github.com/CSSEGISandData/COVID-19 https://covidtracking.com/ https://cmu-delphi.github.io/delphi-epidata/ https://www.google.com/covid19/mobility https://www.apple.com/covid19/mobility
Acknowledgments
This paper is based on work supported by the NSF (Expeditions CCF-1918770, CAREER IIS-1750407, RAPID IIS-2027862, Medium IIS-1955883, NRT DGE-1545362), CDC MInD-Healthcare U01CK000531-Supplement, funds from Georgia Tech Research Institute (GTRI) and funds/computing resources from Georgia Tech.
Appendix
A1: Data Module (More Details) Detailed Description of Signals (C2)
Next, we give a brief description about what each type of signal represents, its time and spatial granularity, and rationale behind using such signals.
Line list signals (DS1): These signals are derived from records that report who, when and where facets of an infected person; thus, they are directly related to the disease spread. This type includes number of persons infected, hospitalized, and recovered (or deceased). Others report visits of emergency room (ER) due to influenza-like illness (ILI) symptoms and COVID-like illness (CLI). Signal 8 represents the difference between the observed deaths (overall, not only COVID-related) from the expected deaths in during specific time periods.
Testing signals (DS2): These type of signals reflect social policy (local leader’s efforts for escalating testing) and social behavior, e.g. people with fever caused by any other disease may request to be tested for COVID. Here, we have signals that report number of tests (total and negative), number of emergency facilities and health care providers reporting data in DS1. The latter two are measures of the perceived importance of the disease by health care providers, as it has been previously observed in influenza forecasting (Brooks et al. 2015).
Crowdsourced symptomatic signals (DS3): This signal is collected from individuals using Kinsa digital thermometers at home who present fever and influenza-like symptoms, which have a significant overlap with COVID-like symptoms. This signal is potentially an indirect measure of both reported and unreported COVID cases.
Mobility signals (DS4): These signals are collected from the record of people visits to points of interest (POI) in different regions. Google collects the daily change of visits in multiple categories of POI compared with the period January 3-February 6, 2020 (signals 16-21 in Table 1). We also collected a daily change of visits from Apple (signal 22), which indicates the relative volume of direction requests in the geographic map compared to January 13 across different US states. Mobility signals implicitly show the impact of different non-pharmaceutical interventions (NPI) and change of policies adopted by different states.
Exposure based signals (DS6): The signals measure social contacts in different groups of individuals. The records have been collected from tracking the overlapping location of distinct smartphone devices in commercial venues. They estimate the signals considering standard sample (signal 23) and removing biased samples (signal 24), i.e., removing the devices from sample which have less movement due to stay-at-home order. The signals implicitly indicate the impact of social contacts and NPI interventions on COVID-19 cases.
Social Media signals (DS6): Facebook collects the daily percentage of people with covid-like-illness (%CLI) and influenza-like-illness (% wILI)(see signals 35-36 in Table 1) across national level and different states of US. They estimate this percentage from the records of voluntary surveys based on disease symptoms.
Addressing missing values in incident hospitalizations (C5)
While collecting incident hospitalizations for forecasting target T2, we observe that daily records of 11 states (CA, DC, TX, IL, LA, PA, MI, MO, NC, NV, DE) are missing. Instead, there are other signals such as the current records of hospitalizations and ICU patients. However, assuming current hospitalizations as target T2 would affect our forecast by over-estimating incidences. Hence, our goal was to find a way to estimate T2 from the current hospitalizations for these 11 states. We tried a compartmental model based on recovered cases and deaths to address our goal. However, we found this challenging to do with the available data because recovered cases are not only considering people that was hospitalized, and the number of deaths is being reported with a different delay than hospitalizations (because they come from different sources). Therefore, as we could not include recoveries nor deaths, we assumed that after a week only a fraction β of the current hospitalizations at time t − 7 would stay in the hospital. Therefore, we used
where t is time in days, hospinc are the incident hospitalizations, and hospcur are the current hospitalizations. We used grid-search and found β = 0.5 closely matches with the ground-truth incident hospitalizations of the states that do not have missing values.
A2: Additional Empirical Results
Observation 6 Mobility and testing signals were found as the most important for predictive performance
Our explainability module also allow us to communicate domain-experts which signals are helping the most (goal G4), and provide feedback to the data module for selection of signals.
By analyzing several geographical regions over the span of three months, we noticed that the largest negative effect comes from removing mobility and testing, while removing line list had positive effect for some states. However, each geographical region requires its own optimized set of data signals. In Table 2, when looking at individual geographical regions, contribution on performance varies. For example, testing has a large positive contribution in US National, but has a negative contribution in Texas. Therefore, each region requires a different treatment to optimize its performance based on the input data signals.
Contribution of signals in 1-4 wk ahead forecasting performance (retrospective analysis) for US National and four states. We present the t-statistic to measure the contribution (higher t-stat, higher contribution). Green indicates positive contributions, red negative contributions, and black non-statistically significant contributions.
Observation 7 Importance of signals change as the spread of the disease progresses
DeepCovid With our explainability module, we can communicate temporal insights about the importance of signals (goal G4)
For instance, we analyzed mobility in predictions for US National and California, two regions where mobility was found to have a positive contribution (see Table2). We considered two periods: (1) May to June, when stay-at-home orders were lift in most states, businesses reopened, and mobility signals increased; (2) July to August, a period when most mobility signals have already stabilized. We noticed a high contribution of mobility during (1) in both US National and California, and low or non-existing in (2). This observation suggests that data driven model needs to be regularly updated to reflect the changing dynamics of the disease spread.
A3: Data Revisions
Several of our data sources, especially disease surveillance data from health agencies, report an initial value that undergoes several rounds of revisions to reach a stable value, a process which typically lasts several weeks. As also noted by (Reich et al. 2019), our experiments suggest that these data revisions have an impact into real-time forecasting performance.
To illustrate this issue, we computed the revision error |vw − vs|/vs, where vw is the value at revision week w (w = 0 denotes first release) and vs is the stable value. When we average this error over several past observations, we get a time series over revision weeks, which allow us to see how many revision weeks in average it takes for a signal to stabilize. In Fig. 5, we can notice that even our ground truth target, reported death incidences (JHU), exhibit this issue, which implies measures of performance may be unreliable till data stabilizes.
Average revision error for ground truth reported deaths from JHU (our forecasting target). We can see there are a few clusters of geographical regions. Arizona (AZ) and Michigan (MI) have none to minor data revisions problems, which are resolved promptly. Montana (MO) has large revision error, but it stabilizes rapidly. US National and Texas (TX), and New York have large revision error and are slow in stabilizing, taking up to 15 weeks.
Footnotes
arodriguezc{at}gatech.edu, jiamingcui1997{at}gatech.edu, jxie{at}gatech.edu, jho67{at}gatech.edu, pulak.agarwal{at}gatech.edu, badityap{at}gatech.edu, anikat1{at}vt.edu, bijaya-adhikari{at}uiowa.edu
↵1 Indeed, when we started participating, there were only 11 teams in the hub. We were the first two teams for predicting hospitalizations.
↵2 https://www.hhs.gov/about/agencies/iea/regional-offices/index.html