Recurrent Neural Reinforcement Learning for Counterfactual Evaluation of Public Health Interventions on the Spread of Covid-19 in the world

As the Covid-19 pandemic soars around the world, there is urgent need to forecast the expected number of cases worldwide and the length of the pandemic before receding and implement public health interventions for significantly stopping the spread of Covid-19. Widely used statistical and computer methods for modeling and forecasting the trajectory of Covid-19 are epidemiological models. Although these epidemiological models are useful for estimating the dynamics of transmission of epidemics, their prediction accuracies are quite low. Alternative to the epidemiological models, the reinforcement learning (RL) and causal inference emerge as a powerful tool to select optimal interventions for worldwide containment of Covid-19. Therefore, we formulated real-time forecasting and evaluation of multiple public health intervention problems into off-policy evaluation (OPE) and counterfactual outcome forecasting problems and integrated RL and recurrent neural network (RNN) for exploring public health intervention strategies to slow down the spread of Covid-19 worldwide, given the historical data that may have been generated by different public health intervention policies. We applied the developed methods to real data collected from January 22, 2020 to June 28, 2020 for real-time forecasting the confirmed cases of Covid-19 across the world. We forecasted that the number of laboratory confirmed cumulative cases of Covid-19 will pass 26 million as of August 14, 2020.


Introduction
As of June 30, 2020, global confirmed cases of Covid-19 passed 10,475,817, including 511,251 deaths and has spread to 213 countries, causing an immense public health crisis. The government officers and people around the world have implemented various nonpharmaceutical interventions to slow the spread of Covid-19 [1]. These public health interventions include cessation of public gatherings, traffic restriction, stay-at-home orders, closures of schools and nonessential businesses, face mask ordinances, maintaining social distancing, quarantine, isolation and expanding virus testing. However, implementing public health interventions will cause substantial economic losses and social damage. Now the critical question is how to reopen the economy, while containing the Covid-19 pandemic? A key to correctly answering this question is to reconstruct the complex epidemic dynamic systems from the data, precisely predict the extent or duration of Covid-19, and develop algorithms to evaluate the effects of public health intervention on the transmission dynamics of Covid-19 and devise practical implementable public health interventions to control the spread of Covid-19 in the world.
Widely used statistical and computer methods for modeling of Covid-19 simulate the transmission dynamics of epidemics to understand their underlying mechanisms, forecast the trajectory of epidemics, and assess the potential impact of a number of public health measures on curbing the spread speed of Covid-19 [2][3][4][5][6][7][8]. Covid-19 Forecast Hub collected 48 models for Covid-19 forecasts [9]. The majority of these models are epidemiological models. Although these epidemiological models are useful for estimating the dynamics of transmission, they have some critical limitations [10,11]. First, most epidemiological models assume that the reproduction number is constant. However, in the real world, the reproduction number is affected by various interventions such as lockdown of the epidemic areas, travel restrictions, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint population mobility, social distancing, and climate factors [12]. Therefore, the reproduction number R often changes over time. The assumptions that the parameters in the model are constant will dramatically limit our ability to simulate interventions and improve prediction accuracy. Second, the epidemiological models consist of ordinary differential equations that have many unknown parameters and depend on many assumptions. Most analyses used hypothesized parameters, which often lead to poorly fitting data. Third, the successful application of public health intervention planning highly depends on the model parameter identifiability. However, some researchers show that the parameters in the complex compartmental dynamic models are unidentifiable [13]. The values of parameters cannot be uniquely determined from the real data [14]. The variances of the estimators of these parameters are very high. Fourth, the intervention measures are not explicitly included in the epidemiological models. These models lack the mechanisms to evaluate the actual effects of public health interventions on infection rates in the ongoing Covid-19 [2].
An essential issue for overcoming these limitations is to explicitly incorporate counterfactual evaluation mechanisms into the models. Reinforcement learning (RL) and counterfactual outcome can be used as a general framework for evaluating the dynamic response of Covid-19 to the intervention measures and optimizing the intervention strategy [15][16][17][18][19][20][21][22]. RL is learning actions or interventions. It arises from solving optimal control problems of partially observed Markov Decision Processes by learning an intervention policy [23].
The control problem consists of identifying the dynamic systems and optimal control design.
We can view the transmission dynamics of Covid-19 as a dynamic system or Markov Decision Process. A typical dynamic system is usually modeled by nonlinear state space equations, which can in turn be transformed into recurrent neural networks (RNN) [24]. The RNN is an ideal tool . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint to learn a partially observed Markov Decision Process. After the dynamic system or Markov Decision Process is learned from historical data, we can use RL or optimal control theory (dynamic programming for a discrete system or pontryagin's maximum principle for a continuous system) to infer control signal or actions, which transforms the system to the desired state [25]. RL provides a wealth of information about the consequences of actions, or information about cause and effect.
The goal of public health interventions is to contain the Covid-19 as soon as possible.
However, the set of actions or health interventions for stopping the spread of Covid-19 is limited.
The environments that determine the transition dynamics of Covid-19 may change rapidly over time. The future environments of Covid-19 may be substantially different from the previous one.
The actions or interventions cannot be only inferred from the historical data. To fully design optimal actions or interventions in the RL may not be feasible. Therefore, we formulated the real-time forecasting and evaluating multiple public health intervention problem into off-policy evaluation (OPE) and counterfactual outcome forecasting problem within the RL framework where the aim is to estimate the response of a new public health intervention policy, given historical data that may have been generated by different public health intervention policies [26].
We interpreted the interventions as treatments where multiple interventions were implemented at different time points and the number of new cases as treatment responses. The accurate estimation of effects of public health interventions over time would allow health officers to make plans on what intervention strategies should be used and at what times to implement interventions [27].
Public health interventions including virus testing, isolation and contact tracing, travel restriction, strict self-quarantine for families, maintaining social distancing, stopping mass . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020.

RNRL as a framework for modeling and evaluating the effect of the interventions on the spread of Covid-19
Markov Decision Process (MDP) is a theoretic process for the RL. RL has three components: state, action and reward and consists of system identification and optimal control of design [28].
The RNRL combines the RL with RNN [23]. The RL can be viewed as an open dynamic system with a correspondent reward function (or loss function). The dynamic system can be a discrete time or continuous time dynamic system. Here we focus on discrete time dynamic systems and partially observed MDP.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020.
where equation (1) is the system equation, equation (2) is the observation equation, and , are two nonlinear functions. System equation (1) states that the next hidden state ℎ +1 is transitioned from the current hidden state ℎ and influenced by the current action or intervention .
The corresponding reward function is defined as : → , which is a function of the current action. The reward at time is defined as = ( ). Since the current reward may make a small contribution to the total reward in the long run, an accumulated reward over time with a possible discount factor ∈ [0,1] is defined as The MDP and agent (learner) generate a sequence: ℎ 0 , 0 , 1 , ℎ 1 , 1 , 2 , …. The RL consists of two step learning: (1) system identification and (2) optimal intervention policy learning. The reward functions in two step learning are different.

Reward function for system identification
The system identification serves two purposes. First, since the dynamics of Covid-19 is partially observed, the hidden states should be estimated from the historical data. Second, to . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint learn the optimal control (intervention) policy, we need to identify the system underlying the dynamics of Covid-19. It serves as a basis for the second step, optimal intervention policy learning. For the convenience of discussion, equation (2) is modified to Our goal is to minimize the reward (loss) function: where = [ 0 , 1 , … , −1 ] are estimated from the data, , ℎ functions are implemented by

Reward function for optimal intervention policy learning
Inferring the optimal intervention (control) policy depends on the model identified in the previous step. In the second step, we search an optimal intervention (control) policy that minimizes the number of cumulated cases or the number of deaths. Therefore, the reward function at time is defined as In other words, we want to make the number of new cases at time as small as possible.
Let be the action selection policy which determines the model's next action . The action selection policy which depends on the hidden state, observed data and covariates is given by We attempt to minimize the reward function: . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

RNN for system identification
System identification is to learn a model underlying the dynamics of Covid-19 from available historical data. The historical data includes the number of cases (new or cumulative) , the covariates such as age, sex, race, the action or intervention . The model captures the main developments of the underlying system and explains the system evolvement beyond the observed data region. Recurrent neural networks (RNN) are a powerful tool for system identification [29].
The RNN can learn the complex dynamics within the temporal ordering of input time series of Covid-19 and use an internal memory to remember.
The RNN consists of two types of inputs and outputs: (1) internal input and output and (2) external input and output ( Figure 1). The internal output of RNN can be viewed as "system state" ℎ which is passed to the next timestep. An RNN cell receives a prior internal state ℎ −1 and a current external input: the number of cases , … , − +1 , action (intervention) and covariates , and generates a current internal state ℎ and an external current output +1 (the number of cases) at time + 1. The RNN models input the time series (past history of the number of cases of Covid-19 over time) and predicts future response time series (number of cases of Covid-19 in the future with a planned sequence of interventions).
Define the input vector as ] .
The RNN model a state transition and an output equation of the dynamic system underlying Covid-19 as follows: . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint state transition ℎ = ℎ ( ℎℎ ℎ −1 + ℎ + ℎ + ℎ + ℎ ) , (8) output equation where ℎℎ is a × dimensional weight matrix that connects the previous state to the current is a dimensional vector of covariates, and ℎ = [ ℎ 1 , … , ℎ ] is a dimensional bias vector that corrects the bias, and ℎ is a element-wise nonlinear activation function. ℎ is a dimensional weight vector, is an activation function and is the bias vector of the output neurons.
In summary, using RNN to identify the system underlying the dynamics of Covid-19 can be formulated as the following optimization problem: where ̂+ 1 is the ( + 1) ℎ iteration of intervention measure at time , is a nonlinear activation function, ℎ is a 1 × dimensional matrix, the parameters are the weight matrices and bias vectors. The above minimization problem will be solved by a backpropagation method and forward dynamic programming [27]. The detailed algorithm for training is summarized in the Supplementary Note A.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. .

RNN for learning actions
The main purpose of the RL is to make the best decision from historical data. The second part of the typical RL is to learn optimal control policy ( Figure 2). Learning optimal control policy is usually formulated as an optimal control problem. If the state space is discrete, dynamic programming is used to find the optimal control policy [27]. If the state space is continuous, the Hamilton-Jacobi-Bellman (HJB) equation is used to solve the optimal control problem [29].
Choices of public health interventions are restricted by multiple political, cultural, technological and economic factors. Policy optimization is often practically infeasible. Therefore, we do not attempt to design optimal control actions.
In contrast, we use off-policy methods that evaluate or improve a policy different from that used to generate the data to select suitable actions (interventions) from a set of feasible actions (interventions). We propose to use RNN-based counterfactual action evaluation as a general framework for modeling and forecasting the spread of Covid-19 over time with multiple interventions [30]. Second RNN is used for learning counterfactual actions (interventions). The RNN encoder was explained in the previous section. Here, we focus on the RNN decoder. Unlike . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020.
where + is defined as before, ≥ 1.
The algorithm for action (intervention) evaluation and selection are summarized in Supplementary Note A.

Data Collection
The

Data Pre-processing
Data were split into a training dataset (01/22-06/21, 2020) and validation dataset (06/22-06/28/2020). All the input number of lab-confirmed cumulative cases was pre-processed by . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Minibatches, Normalization and RNRL Flowchart
The RNRL algorithm flowchart was shown in Figure S1. We first randomly picked

Forecasting Procedures
The trained RNN decoder was used to forecast the future number of new or cumulative cases of Covid-19 worldwide and for each country. The recursive multiple-step forecasting involved using a one-step model multiple times where the prediction for the preceding time step and intervention strategy were used as an input for making a prediction on the following time step.
For example, for forecasting the number of new confirmed cases for one more next day, the predicted number of new cases and intervention measure in one-step forecasting would be used as an observational input in order to predict day 2. Repeat the above process to obtain the twostep forecasting. The summation of the final forecasted number of new or cumulative confirmed . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint cases for each country was taken as the prediction of the total number of new or cumulative confirmed cases of Covid-19 worldwide.

Prediction accuracy of the dynamics of Covid-19 using RNRL
Accurate prediction of the transmission dynamics of Covid-19 is important for health decision making. To demonstrate that the RNRL was an accurate forecasting method, the RNRL was To further reliably evaluate the forecasting accuracy, we reported 7-step ahead forecasted numbers of cumulative cases and errors of Covid-19 worldwide and in 15 countries in Table 1 starting with June 22, 2020. The average forecasting error was 0.0197, ranging from 0.000016 to 0.087. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020.  Table S3. The number of new cases in ten countries decreased, the average decrease rates ranged from -175 (US) to -4 (Mexico). Although the average increase rates in Chile and Brazil were 148.946479 and 462.9452744, respectively, they decreased quickly from the peak.

Outbreak of Covid-19 worldwide continues to grow exponentially
Although most European countries have almost stopped the spread of Covid-19 infections, outbreaks in Brazil, Chile, Russia, India, Peru, Mexico, and Pakistan are still growing fast. The spread of Covid-19 worldwide has not slowed down. The reported and forecasted curve of the number of cumulative cases of Covid-19 worldwide was shown in Figure 8. Table S4 summarized the number of cumulative and new cases of Covid-19 worldwide, starting from June 16, 2020 to August 14, 2020. We observed that the outbreak of Covid-19 worldwide is growing exponentially.
On August 14, 2020, the number of new cases of Covid-19 worldwide increased from 192,343 to a frightening number of 313,380 and the number of cumulative cases of Covid-19 is . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020.   is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020.  (Table 2).

Clustering Intervention Patterns of the Countries across the World
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint Clustering algorithm and geographical information system GIS were used to analyze the intervention strategies of all 187 countries across the world. Clustering results would provide information about the spread pattern of the coronavirus across the countries and how to best combat Covid-19. All 187 countries were grouped into 10 clusters using k-means clustering algorithms and intervention measure time curves of the 187 countries across the world (Figure 9 and Table S6). Africa. Tables S7 and S8 showed that the number of cumulative cases of Covid-19 in these countries recently surged. Countries in the second and fifth clusters were less affected.

Discussion
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint When the cases of Covid-19 still surge worldwide and the coronavirus gains steam in some countries, planning and implementing strong public health interventions are urgently needed. As an alternative to the epidemiologic transmission models, we developed the RNRL method to help health officers plan public health interventions and combating the spread of Covid-19. We viewed interventions to stop the spread of Covid-19 as actions to control the states of dynamic system and intervention plan as the design of optimal control. A key step for optimal control design was identification of the dynamic system. Therefore, we integrated the identification of In this study, we presented a new concept of intervention measure. To improve interpretation of the intervention measure, we compared the intervention measure with the reproduction number. In general, the correlation coefficients between the intervention measure and reproduction number was high except for the less controlled countries. Intervention measure quantified the strength of intervention (control action), while reproduction number measured the state of the spread of Covid-19 being controlled, i.e., measures how well the spread of Covid-19 was curbed. In other words, intervention measure is to quantify how strong the action is, while the reproduction number is to study the effect or the response of intervention. Intervention measure is complimentary to the reproduction number.
The world is in the crossroad of combating the rapid spread of Covid-19. The RNRL provided a powerful tool for fighting the surge of Covid-19 worldwide. The dynamic system consists of two essential components. One is the state of the system and the second is action taken. The . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint evolution of the dynamic system highly depends on a sequence of actions. Actions influencing the dynamics of Covid-19 cannot be directly measured or observed. In this report, we proposed to use an intervention measure to quantify the actions. The intervention measure was estimated.
The intervention measure curve characterized the dynamics of Covid-19 and can be used to assess the stages of the spread of Covid-19 and strength of the control. The intervention measure curves were used to cluster 187 countries into five basic groups: the well-controlled group (31 countries), being controlled group (4 countries), newly surged group (34 countries), and less affected group (119 countries). Although the number of cumulative cases of Covid-19 worldwide passed 10 million, if the less controlled and newly surged groups of countries continuously strengthen interventions, our analysis demonstrated that the spread of Covid-19 worldwide will be finally stopped. We are confident that we will win the combat to contain the Covid-19.
Since the politics and economics strongly affect the dynamics of Covid-19, the evolutionary trajectories of Covid-19 in most countries will be uncertain. The accuracy of long-term forecasting of Covid-19 may not be very high. However, accuracy of short-term estimation of the number of new cases can be quite good. We suggest that every 10 days we update the data and run the RNRL to forecast the trajectory of Covid-19 in 15 days or one month.

Conflict of interest
We have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Many thanks to Ms. Sara A. Barton for editing and the Texas Advanced Computing Center for computation support.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint          . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020.   . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint Figure 9. All 187 countries were grouped into ten clusters.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 10, 2020. . https://doi.org/10.1101/2020.07.08.20149146 doi: medRxiv preprint