Abstract
Background National governments have implemented non-pharmaceutical interventions to control and mitigate against the COVID-19 pandemic. A deep understanding of these interventions is required.
Objective We investigate the prediction of future daily national Confirmed Infection Growths – the percentage change in total cumulative cases across 14 days – using metrics representative of non-pharmaceutical interventions and cultural dimensions of each country.
Methods We combine the OxCGRT dataset, Hofstede’s cultural dimensions, and COVID-19 daily reported infection case numbers to train and evaluate five non-time series machine learning models in predicting Confirmed Infection Growth. We use three validation methods – in-distribution, out-of-distribution, and country-based cross-validation – for evaluation, each applicable to a different use case of the models.
Results Our results demonstrate high R2 values between the labels and predictions for the in-distribution, out-of-distribution, and country-based cross-validation methods (0.959, 0.513, and 0.574 respectively) using random forest and AdaBoost regression. While these models may be used to predict the Confirmed Infection Growth, the differing accuracies obtained from the three tasks suggest a strong influence of the use case.
Conclusions This work provides new considerations in using machine learning techniques with non-pharmaceutical interventions and cultural dimensions data for predicting the national growth of confirmed infections of COVID-19.
Introduction
Background
In response to the COVID-19 pandemic, national governments have implemented non-pharmaceutical interventions (NPIs) to control and reduce the spread in their respective countries [1–4]. Indeed, early reports suggested the potential effectiveness of the implementation of NPIs to reduce the transmission of COVID-19 [2,5–9] and other infectious diseases [10–12]. Many epidemiological models forecasting future infection numbers have therefore suggested the role of NPIs in reducing infection rates [2,6,9,13], to aid with implementing national strategies and making policy decisions. In this paper, we also include the implementation of NPIs at the national level as features in predicting the national growth of the number of confirmed infection cases. Prior work has focused on the NPI variations in different regions of specific countries [2,3,5,8,14].
Various metrics may provide different perspectives and insights on the pandemic. In this study, we focus on one: Confirmed Infection Growth (CIG), which is the 14-day growth of the cumulative number of reported infection cases. Other common metrics to measure the transmission rates of an infectious disease are the basic reproduction number, R0, which measures the expected number of direct secondary infections generated by a single primary infection when the entire population is susceptible [4,15] and the effective reproduction number, Rt [2], which accounts for the immunity within the specified population. While such metrics are typically used by epidemiologists as measures of the transmission of an infectious disease, these metrics are dependent on estimation model structures and assumptions, making them application-specific and potentially misapplied [15]. Furthermore, the public may be less familiar with such metrics, as opposed to more practical and observable metrics, such as the absolute or relative change in cumulative reported cases.
Related Work
Mathematical modelling of the transmission of infectious disease has been a common method to simulate infection trajectories. A common technique for epidemics is the SIR model, which separates the population into three sub-populations (susceptible, infected, and removed) and iteratively models the interaction and shift between these sub-populations, which change throughout the epidemic [16,17]. Variations of this model have since been introduced to reflect other dynamics expected of the spread of infectious diseases [18–20]. These variations of the SIR model have also been applied to the ongoing COVID-19 pandemic [21–24].
Recent studies have also used various statistical and machine learning techniques for short-term forecasting of infection rates for the COVID-19 pandemic [23,25,26], using reported transmission and mortality statistics, population geographical movement data, and media activity. Machine learning has also been used for other applications to combat the pandemic, such as in-patient monitoring and genome sequencing [27–30].
Description of Study
The CIG because it is a verifiable metric that, due to its direct inference from the number of reported cases, may have a greater impact on the public’s perception of the magnitude of the pandemic compared to the actual transmission rate. CIG reflects the growth in the total number of reported cases within a country in 14 days relative to the total number of previously reported infections, including recoveries and mortalities. We emphasize that the reported number of infections may not necessarily be correlated with the actual transmission rate, due to factors such as different testing criteria and varying accessibility in testing over time.
To predict the CIG for individual countries, we deploy five machine learning models. We use features representing the implementation levels of NPIs and the cultural dimensions of each country. We obtain daily metrics of the implementation of NPIs at the national level from the Oxford COVID-19 Government Response Tracker (OxCGRT) dataset [1]. Although different countries may implement similar NPIs, some have suggested that cross-cultural variations across populations may lead to different perception and responses towards such NPIs [31–33]. We intend to capture any effects due to national cross-cultural differences by complementing the OxGCRT dataset with national cultural norm values from Hofstede’s cultural dimensions [34]. Our non-time series models predict the expected future national CIG using both NPI implementation and cultural norm features. While time-series deep learning models (e.g., RNNs or transformers) may also provide CIG predictions, such models generally require greater amounts of accurately labeled trajectory data and assume that past trajectory trends are readily available and representative of future trajectories. Instead, our non-time series models train on more granular data which does not necessarily need to be temporally concatenated into a trajectory. We also opt for less complex non-time series models due to indeterminacies in acquiring and verifying sufficient trajectory data, especially due to the lack of reliable data at the onset of the COVID-19 outbreak.
Our results suggest that non-time series machine learning models can predict future CIG according to multiple validation methods, depending on the user’s application. While we do not necessarily claim state-of-the-art performance for infection rate prediction given the rapidly growing amount of parallel work in this area, to the best of our knowledge, our work is the first to use machine learning techniques to predict the change in national cumulative numbers of reported COVID-19 infections by combining NPI implementation features with national cultural features. Our implementation uses publicly available data retrieved from the Internet and relies on the open-sourced Python libraries Pandas [35] and Scikit-Learn [36].
Methods
Data and Pre-Processing
Candidate features at the national level are extracted from three datasets for input into our machine learning models: NPIs, cultural dimensions, and current COVID-19 confirmed case numbers.
The Oxford COVID-19 Government Response Tracker (OxCGRT) provides daily level metrics of the NPIs implemented by countries [1]. This dataset categorizes NPIs into 17 categories, each with either an ordinal policy level metric ranging from (not implemented) to 3 (strictly enforced) or a continuous metric, representing a monetary amount (e.g., research funding). We limit our candidate features to the 13 ordinal policy categories, as well as 4 computed indices, which represent the implementation of different policy types taken by governments, based on the implemented NPIs. This dataset contains data starting from 1 January 2020.
To represent cultural differences across populations of different countries, the 2015 edition of Hofstede’s cultural dimensions [37] are tagged to each country. While rarely used in epidemiology studies, these dimensions have been used frequently in international marketing studies and cross-cultural research as indicators of the cultural values of national populations [38]. Because the 2015 edition of this dataset groups certain geographically neighboring countries together (e.g., Ivory Coast, Burkina Faso, Ghana, etc. into Africa West), we tag all subgroup countries with the dimension values of their group. While we recognize this is far from ideal and will likely lead to some degree of inaccurate approximation in these subgroup countries, we perform this pre-processing step to include those countries in our study. The dimension values for each country are constant across all samples. Six cultural dimensions are presented for each country/region [39]:
Power distance index: The establishment of hierarchies in society and organizations, and the extent to which lower hierarchical members accept inequality in power.
Individualism v. collectivism: The degree to which individuals are not integrated into societal groups (e.g., individual, immediate family (individualistic) v. extended families (collectivistic))
Uncertainty avoidance: Society’s tendency to avoid uncertainty and ambiguity through use of the societal disapproval, behavioral rules, laws, etc.
Masculinity v. femininity: Societal preference towards assertiveness, competitiveness, and division in gender roles (masculinity), compared to caring, sympathy, and similarity in gender roles (femininity)
Long-term v. short-term orientation: Societal values towards tradition, stability, and steadfastness (short-term) v. adaptability, perseverance, and pragmatism (long-term)
Indulgence v. restraint: The degree of freedom available to individuals for fulfilling personal desires by social norms (e.g., free gratification (indulgence) v. controlled gratification (restraint))
We extract the daily number of confirmed cases, nt, for each country from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [40]. We use a rolling average of the previous 5-day window to smooth fluctuations in nt, which may be caused by various factors, such as inaccurate case reporting, no release of confirmed case numbers (e.g., on weekends and holidays), and sudden infection outbreaks. We refer to the smoothed daily number of confirmed cases for date as .
We compute the Confirmed Infection Growth (CIG) for a specified date, τ, as:
The CIG represents the expected number of new confirmed cases from date τ − 13 to date as a percentage of the total number of confirmed infection cases up to date τ − 14.
Our goal is to predict the CIG 14 days in advance (i.e., CIGτ + 14) given information from the current date τ for each country. Candidate features available include all ordinal policy metrics and the 4 computed indices from OxCGRT, the six dimension values from Hofstede’s cultural dimensions, the CIG of the current date CIGτ, and the smoothed cumulative number of confirmed cases , for a total of 25 feature candidates. Neither the date nor any other temporal features are included.
We trim samples with fewer than 10 cumulative confirmed infection cases and with the highest 2.5 % and the lowest 2.5 % of CIGτ + 14 to remove outliers in the data. Because the lowest 2.5 % of CIGτ + 14 are all 0.0%, we remove the samples with CIGτ + 14=0.0% by ascending date.
Our data ranges from 1 April 2020 to 30 September 2020 inclusively. We exclude all countries in our combined dataset that have missing feature values. In total, our combined dataset and our experiments apply to 114 countries.
Feature Selection and Processing
We select features to input into our machine learning models from our candidate feature pool using mutual information [41]. Mutual information is a measure of the dependency between the individual feature (i.e., independent variable) and the label (i.e., dependent variable) and captures both linear and non-linear dependencies. However, mutual information does not capture multivariate dependencies nor indicates collinearity between features. To include both linear and non-linear dependencies, features are selected if they achieve a substantially non-zero mutual information, i.e., greater than 0.10. Feature selection is conducted prior to training with the training set in all validation methods. Similar feature filtering and selection techniques have been used in other machine learning applications [36,42]. The candidate features considered for input and their respective mutual information are listed in Table 1 for the in-distribution and out-of-distribution validation methods. Mutual information is also computed for each of the ten folds of the cross-validation method.
Mutual information of candidate features for the in-distribution and out-of-distribution validation methods. In the cross-validation method, the ten folds have varying mutual information. Selected features are indicated with (*) for the in-distribution method and with (†) for the out-of-distribution method.
All selected features are then normalized to the range [0, 1]using standard min-max normalization.
Model Training and Validation
We train the machine learning models with the combinations of hyperparameters listed in Table 2 [36,43–46]. We optimize the models using the mean squared error (MSE) criterion and select the model hyperparameters with the lowest mean absolute error (MAE) as the optimal configuration of a model. While the MSE heavily penalizes large residual errors disproportionately, the MAE provides an absolute mean of all residual errors [47]. The MAE of the training data acts as a measure of the goodness-of-fit of the model, while the MAE of the validation and testing data acts as a measure of the predictive performance [48].
Machine learning models and hyperparameter combinations.
To validate in-distribution and out-of-distribution, we split our samples into 70-15-15 training-validation-test sets. For cross-validation [49,50], we split our samples into 10 folds (i.e., 90-10). These three methods of validation each represent a different definition of performance for the machine learning models.
In-distribution validation
We randomly split the samples into training, validation, and test sets. Consequently, the models are trained from samples distributed across the entire date range available in our data. This is critical - it is generally expected that model performance is best when training and test data are drawn from the same distribution. Because the COVID-19 infection numbers are naturally time series, this method ensures that validation and test samples are indeed from the same distribution as training samples. Because samples are disassociated from their dates and all other known temporal features, the prediction of the validation and test samples using the training samples are unordered. This method may be applicable to use cases where the date-to-predict is expected to be in a similar distribution as the training samples, such as predicting CIGτ + 14when data up to the current date is available.
Out-of-distribution validation
While the in-distribution method may ensure that the training, validation, and test data are all sampled from the same distribution, it may not necessarily be the most practical. Generally, the goal of long-term infection rate forecasting is to anticipate for future infection rates and should not be represented as an in-distribution task, where we have trained with data from near or later than the date-to-predict. Therefore, we also validate the performance of our models by training on the earliest 70% of samples. The validation and test sets are then randomly split between the remaining 30% of samples. This setup ensures that all training samples occurred earlier than validation and testing samples and no temporal features (known or hidden) are leaked. However, due to the changing environment related to COVID-19 infections (e.g., introduction of new NPIs, seasonal changes, new research), the validation and testing distributions is likely different from that of the training set. This method may be applicable for use cases where the date-to-predict is in the far future and not all data up to 14 days prior to the date-to-predict are available.
Country-based cross-validation
As a compromise between the above two methods, we also use a cross-validation method by splitting the available countries into ten folds. The aim is to evaluate validation samples from the same date range as training samples, but not the same country trajectory. That is, only data from countries not in the validation set are included in the training set. Although the samples from the training and validation sets are therefore sampled from different distributions (i.e., different countries), we anticipate that features from Hofstede’s cultural dimensions [34] may assist in identifying similar characteristics between countries, thus reducing the disparity between the training and validation distributions. This method may be applicable in predicting the CIG of countries where associated previous data is unavailable or unreliable.
Results
Feature Selection
For both the interpolation and extrapolation training sets, we observe that most candidate features meet our requirement of a non-zero mutual information (≥ 0.10) (see Table 1).
In both training sets, the candidate features that do not meet the requirements are international travel control (0.099, 0.099), debt/contract relief (0.043, 0.053), public information campaigns (0.020, 0.023), testing policy (0.056, 0.064), and contact tracing (0.030, 0.038). Additional candidate features which do not meet the requirements for interpolation training set are workplace closing (0.098) and cancel public events (0.089). Overall, the in-distribution and out-of-distribution datasets contain 17 and 20 features, respectively.
CIGτ has the highest mutual information out of all features, suggesting similarities between the feature CIGτ and the label CIGτ + 14. Further analysis shows a correlation of r = 0.309 between CIGτ and CIGτ + 14. This may be due to similar trends in the CIG when the implementation of NPIs is consistent within a 14-day period.
Comparison of Machine Learning Models
Out of all available configurations (i.e. hyperparameter combinations) of each model, we select the model configuration with the lowest validation error and compute the test error. The parameters for these selected models are listed in Table 3. The mean training, validation, and test errors are included in Table 4, Table 5, and Table 6, respectively, for the in-distribution, out-of-distribution, and cross-validation methods. We also include the median percent error [51], which is the percentage difference of the prediction f(x(i)) and the label y(i) for each instance {x(i), y(i)}, computed as:
Hyperparameters of the optimal configuration (lowest validation MAE) for each model for each validation method.
Optimal mean absolute error and median percent error for in-distribution validation method. The model indicated with the lowest test MAE is indicated with (*).
Optimal mean absolute error and median percent error for out-of-distribution validation method. The model indicated with the lowest test MAE is indicated with (*).
Optimal mean absolute error and median percent error for cross-validation method. (Validation error is equivalent to test error for cross-validation.) The model indicated with the lowest test MAE is indicated with (*).
We observe that random forest regression has the lowest mean test error in the interpolation method (0.031) and AdaBoost regression has the lowest mean test errors in the extrapolation and cross-validation methods (0.089 and 0.167 respectively) (see Table 4, Table 5, and Table 6). For all models aside from ridge regression, the in-distribution method has the lowest mean test errors and the lowest median percent error.
Analysis of Best Performing Models
Intercepts near 0.0 and slopes near 1.0 are the linear calibration measures that would indicate a perfect calibration relationship between the predictions and the labels [48]. For the optimal models in all validation methods, we observe slopes close to 1.0 and intercepts close to 0.0 (see Table 7). Due to the large sample sizes, statistical significance testing indicates several slopes and intercepts as significantly different from 1.0 and 0.0, respectively. However, the small mean differences (standardized to the standard deviation, i.e., z-score) indicate these differences have no practical significance. High correlations and R2 values between the predictions and labels are observed in all three validation methods (see Figure 1, Figure 2, and Figure 3).
Linear calibration measures of the models with the lowest test MAE for each validation method.
Calibration plot between labels and predictions for the interpolation validation method, with the means of each prediction bin of size 0.25.
Calibration plot between labels and predictions for the extrapolation validation method, with the means of each prediction bin of size 0.25.
Calibration plot between labels and predictions for the cross-validation method, with the means of each prediction bin of size 0.25.
From observing the distributions of predictions and labels in all three validation methods (see Figure 4, Figure 5, and Figure 6), we see that the distributions of predictions and labels in the in-distribution method are similar. In the cross-validation method, predictions are slightly higher than the labels for the label range from 0.0 to 1.0, showing overestimation of the CIG within this range.
Distributions of test labels and predictions (N = 2847) for the interpolation validation method.
Distributions of test labels and predictions (N = 2811) for the extrapolation validation method.
Distributions of test labels and predictions (N = 19,669) for the cross-validation method.
Further analysis shows that the performance of the models varies with the values of the labels. In both the in-distribution and cross-validation methods, the test MAE is lowest for samples with labels 0.0 (see Table 8 and Table 10), followed by label range 0.0 − 0.5. In the out-of-distribution method, the test MAE is lowest for samples with labels from 0.0 − 0.5 (see Table 9). For all validation methods, the mean MAE and median percent errors also increase with label bins greater 1.0, showing a decrease in accuracy for larger CIG.
Test errors and median percent errors of label bins of size 0.5 for the in-distribution validation method.
Test errors and median percent errors of label bins of size 0.5 for the out-of-distribution validation method.
Test errors and median percent errors of label bins of size 0.5 for the cross-validation method.
Discussion
Principal Results
Our results suggest that traditional, non-time series machine learning models can predict future CIG to an appreciable degree of accuracy, as suggested by the high values and strong linear calibration relationships between the labels and predictions in all validation methods.
A comparison of our results for all validation methods suggests differences in the predictive performance of machine learning models across the varying use cases. The in-distribution method has the lowest test mean error and median percent error, which is to be expected as the test samples are obtained from the same distribution as the training samples. Intuitively, while samples in the in-distribution method are unordered (i.e., no temporal features are included), the availability of samples across the entire temporal range in the training set allows the validation and test samples to interpolate between these training samples.
The out-of-distribution method achieves a test mean error which is higher than that of the in-distribution method. This is expected, as evolving COVID-19 infection trajectories observed most countries give distributions of training samples from earlier dates that may differ greatly from those of validation and test samples from later dates (i.e., data shift), which often leads to poor generalization (e.g., over-fitting) of machine learning models.
Conversely, even though the cross-validation method contains the training and validation sets within the same date range, the cross-validation method also separates countries across these sets (i.e., the 10 folds), such that there is no country which has samples in both the training and validation sets. This difference leads to higher test mean errors and median percent errors than both other methods, which suggests that including training samples from the same country as the validation samples is more important than ensuring temporal overlap. We speculate that this occurs because the unique cultural dimensions per country may potentially act as a categorical rather than continuous features, for each country. In such cases, the cultural dimensions observed in the training set would be considered irrelevant to cultural dimensions within the validation set.
Performance also varies depending on the value of the label (see Table 8, Table 9, and Table 10), which may be due to the imbalanced frequency of training samples. That is, the rareness of samples with higher CIG compared to lower CIG in the training set may be the cause of their comparatively poorer performance.
In Figure 2 and Figure 3, we also observe constraints of the trained AdaBoost regression models. Discretization of the prediction values may be due to the low number of estimators used in the lowest mean test error configuration, as seen in Table 3. The low number of estimators in these configurations may also restrict the predictions to a maximum of 1.5 selected to the relatively lower number of samples with labels greater than 1.5 (see Figure 5 and Figure 6). Label ranges with the most samples are selected over underrepresented ranges as candidates for prediction values in the discretized AdaBoost regression models. While additional estimators in the AdaBoost regression models may result in less discrete prediction values, they may also cause over-fitting by increasing the complexity of the models.
Limitations
First, the scores in the OxCGRT and Hofstede’s cultural dimensions datasets are imprecise. NPI enforcement levels and definitions may vary even between countries with the same scores, while countries sharing similar cultural dimension scores may have unobserved differences in terms of cultural practices. Second, by predicting the CIG 14 days in advance of the current date, the models do not account for information regarding changes in NPIs between the current date and the date-to-predict. Third, the CIG is a measure of the change in the cumulative number of confirmed infections and may not necessarily be correlated with the change in the daily number of confirmed infections nor the actual transmission rate of COVID-19. For example, changing biased and/or unreliable testing policies (e.g., prioritizing high-risk patients) may lead to a misleading representation of the infection growth within the general population.
Conclusion
In this study, we train five non-time series machine learning models to predict the CIG 14 days into the future, using NPI features extracted from the OxGCRT dataset [1], as well as cultural norm features extracted from Hofstede’s cultural dimensions [34]. Together, these features enable the prediction of near-future CIG in multiple machine learning models. Specifically, we observe that random forest regression and AdaBoost regression result in the most accurate predictions out of the five evaluated machine learning models.
We observe differences in the predictive performance of the machine learning models across the three validation methods, with the highest accuracy with the in-distribution method and the lowest with the cross-validation method. These differences in performance suggest that the models have varying levels of accuracy depending on use case. Specifically, predictions are expected to have higher accuracies when existing data from the same country in nearby dates are available (i.e., in-distribution method). This enables use cases such as predicting the CIG over the upcoming 14 days from the current date. The decrease in accuracy when data from nearby dates are unavailable (i.e., out-of-distribution method) suggest weaker performance in predicting the CIG over 14 days for relatively distanced future dates. We observe the greatest decrease in performance when data from the same country is unavailable (i.e., cross-validation method). However, with all validation methods, we observe appreciable calibration measures between the predictions and labels of the test set.
Due to the rapidly growing body of work related to predicting COVID-19 infection rates, one cannot claim state-of-the-art results. However, this work provides new considerations in the use of NPIs and cultural dimensions for predicting the national growth of confirmed infections of the COVID-19 pandemic and other infectious diseases using non-time series machine learning models. These experiments also provide insight into validation methods for different applications of the models. As the availability of such data increases and the nature of the data continues to evolve, we expect that simple and straightforward models such as these may prove to be more accurate and generalizable, improving overall predictive performance.
Data Availability
All data used in this manuscript is open-sourced and readily accessible online.
Acknowledgments
Frank Rudzicz is supported by a CIFAR Chair in Artificial Intelligence.