ABSTRACT
Using data from 50 very different countries (which represent nearly 70% of world’s population) and by means of a regression analysis, we studied the predictive power of different variables (mobility, air pollution, health & research, economic and social & geographic indicators) over the number of infected and dead by SARS-CoV-2. We also studied if the predictive power of these variables changed during a 4 months period (March, April, May and June). We approached data in two different ways, cumulative data and non-cumulative data.
The number of deaths by Covid-19 can always be predicted with great accuracy from the number of infected, regardless of the characteristics of the country.
Inbound tourism emerged as the variable that best predicts the number of infected (and, consequently, the number of deaths) happening in the different countries. Electricity consumption and air pollution of a country (CO2 emissions, nitrous oxide and methane) are also capable of predicting, with great precision, the number of infections and deaths from Covid-19. Characteristics such as the area and population of a country can also predict, although to a lesser extent, the number of infected and dead. All predictive variables remained significant through time.
In contrast, a series of variables, which in principle would seem to have a greater influence on the evolution of Covid-19 (hospital bed density, Physicians per 1000 people, Researches in R & D, urban population…), turned out to have very little - or none- predictive power.
Our results explain why countries that opted for travel restrictions and social withdrawal policies at a very early stage of the pandemic outbreak, obtained better results. Preventive policies proved to be the key, rather than having large hospital and medical resources.
INTRODUCTION
COVID-19 is a global pandemic caused by SARS-CoV-2 virus1,2. The virus rapidly spread and is happening at the same time all around the world, to different people living in different countries (with different mobility, pollution, health, economic, geographic and social characteristics). Infection and death counts vary widely not only from country to country but also through time. In some countries the virus is in the remission phase, with chances of new outbreaks, while in others is yet uncontrolled.
The main feature of COVID-19 pandemic is the lack of knowledge, SARS-CoV-2 is a new coronavirus and little is known yet, despite the effort of scientists around the world. Under this setting, regression analysis is a very useful tool to find out which variables can predict number of infected and dead, and which variables cannot. Consequently, numerous studies have been performed since the outbreak of the pandemic using regression analysis 3–7, most of them focus on the clinical aspects of the pandemic and its implications on health care settings 5,8–13, such as cardiac injury, kidney disease, respiratory conditions, diabetes, comorbidities 8,9,14–19 gender and underlying comorbidities 20, severe COVID-19 condition and risk factors 21–24, number of CD8 +T cells as risk factors for the duration of SARS- CoV-2 viral positivity 25,26, clinical symptoms as smell loss 27, smoking habits 28. And not as many, on other aspects like looking into, for example, temperature as an affecting parameter to transmission rates3,29–31. Some have used GIS modelling to understand what is happening with SARS CoV-2 in this pandemic 32. Some others try to unveil the relevance of temperature, humidity, climate, air quality, socioeconomic factors, demographics…3,4,38–40,5,30,31,33–37.
The relationship of COVID-19 with socioeconomic variables has also been investigated. For example, Mollalo et al. (2020) have performed a study in the U.S.A. compiling 35 environmental, socioeconomic, topographic and demographic variables and create a geodatabase to explain the spatial variability of the COVID incidence 37. GIS-based spatial modelling studies have also been done.32 Guha et al (2020) studied community and socioeconomic characteristics of people in the United States and its possible associations with COVID-19 cases and deaths6, Whittle et al (2020) studied socioeconomic predictors across neighbourhoods in New York 7, Pirouz et al (2020) mixed climate and urban parameters for three regions in Italy 4
In this study we used a regression analysis between the number of COVID-19 cases and the number of deaths, the dependent variables; and a series of independent variables, such as Mobility, Air Pollution, Health & Research, Economic, Geographical and Social Indicators, to estimate the predictive value of those independent variables and if it changed with time. We also analysed how the predictive value of these variables changed through May and June but using non-cumulative data.
Interestingly, variables related to mobility and air pollution better explain number of infected and dead from COVID-19 than other variables that estimate countries health system’s quality or economic strength.
MATERIALS AND METHODS
We carried out a regression analysis, trying to unveil which characteristics make a country more vulnerable to Covid19.
The chosen dependent variables correspond to two of the most significant and worrisome characteristics of any outbreak: total number of infected people and total death toll. Data was collected from the World Health Organisation (WHO) Coronavirus Disease (COVID-19) Dashboard41 (retrieved from https://covid19.who.int the 21st of: March, April, May and June).
This study focuses on the predictive ability -or not- of the chosen variables, regarding infected and death counts. Calculations were performed using data in two different ways: cumulative count and non-cumulative count. Cumulative count data implies that figures from one period include previous ones. Non-cumulative count represent exclusively figures from the period of time studied (i.e. May period starts on the 21st of April and ends on the 21st May; June period starts on the 21st of May and ends on the 21st of June).
We first analysed how number of infected predicts number of deaths and its evolution through time. If there was a significant improvement in treatment protocols, its predictive value should change.
The predictive variables were selected searching for relations that may explain the differences within countries. Mobility easiness also implies an easy way to spread pathogens, we selected Inbound Tourism as an international mobility indicator. One Health speaks about Human Health, Animal Health and Environmental Health: we selected Air Pollution indicators such as CO2, NO2 and methane emissions, as they may show an influence on the disease incidence. Health indicators (e.g. Hospital Bed Density -beds per 1000- or Physicians per 1000 people, Researches in R & D). Economic indicators as different outcomes could be influenced by the countries’ wealth. Social and geographic indicators to study any pattern that explained countries differences (e.g. population, surface area, urban population…).
All data was obtained via internet repositories open to public access, such as The World Health Organization, The Central Intelligence Agency or The World Bank. The complete list of variables, as well as the link to each data set is presented in Table 1.
The territories for this study were the first 50 countries (which represent nearly 70% of world’s population) with more inbound tourism (last data available from 2018), according to the World Tourism Organization’s webpage42 (https://www.unwto.org/country-profile-inbound-tourism). In these countries, very different to each other but with lots of people entering from everywhere, the likelihood of a quick SARS-CoV-2 arrival was very high.
Regression analysis allows to establish a relation between two variables, a dependent variable yi and an independent variable xi. This linear regression follows the model: and was performed using the statistic package StatPlus:mac (AnalystSoft Inc.)
RESULTS & DISCUSSION
Before any further analysis we studied the relation between the number of dead by SARS-COV-2 (yi= dependent variable) and the number of infected people (xi= predictive variable). Among all the studied parameters, these had the highest significant regression (p<0.00001) consistently through the period studied (Cumulative March: R2=0.80; C. April: R2=0.76; C. May: R2=0.99; C. June: R2=0.85).
The coefficient of determination (R2) is the proportion of the variance in the dependent variable that is predictable from the independent variable (e.g. in C. May 99% of the deaths were explained by the number of infected people).
Furthermore, when non-cumulative data was used results were as significant and surprising as before (Non-Cumulative May: R2=0.85; N-C. June: R2=0.85). We were expecting some change as health care systems learnt how to fight the virus. This may appear as an obvious conclusion, but it gains relevance when you consider that other parameters studied (i.e. health and research indicators, country richness…) had no significance, this is in line with Walach et al (2020) findings, that none of the variables related to Health or other population parameters were predictive43. No matter what country you choose (rich, poor, with a good or not so good health system, etc.) the number of infected people is going to rule the number of deaths. This drives the idea that what really matters to fight COVID-19 pandemic is to prevent infections among their citizens rather than better health care systems or richness or development. This may be the reason behind why poorer countries with less developed health care systems did so well fighting this pandemic outbreak (e.g. Viet Nam).
We present variables divided into clearly predictive and not predictive. As interesting is to know which ones can predict as it is to know the ones that do not predict at all.
1. Variables that are clearly predictive of the covid-19 pandemic (statistically significant linear regression p < 0.05)
Inbound tourism was the second variable with greater significance. It remains significant with the number of infected people during the cumulative four-month period studied (C. March: p<0.00001, R2=0.34; C. April: p<0.00001, R2=0.42; C. May: p<0.00006, R2=0.29; C. June: p<0.0003 R2=0,17) and with non-cumulative data (N-C May: p<0.005, R2=0.15; N-C June: p<0.0003 R2=0.17). Our results are concordant with the results obtained by Aldibasi et al (2020)44. Daon et al (2020) suggest that an additional prominent reason for the especially wide and rapid spread of COVID-19 is likely to be the current high prevalence of international travel45. According to Papadopoulus et al. (2020) early international travel restrictions may help control the pandemic46, and Walch et al (2020) found that border closure had the potential of preventing cases43. For Chinazzi et al. (2020) most of the imported cases outside China have likely originated from air travel47 and Linka et al. suggest that the spread of COVID-19 in Europe closely followed air travel patterns and that the severe travel restrictions implemented there resulted in substantial decreases in the disease’s spread48. Kubota et al (2020) found that relative frequency of foreign visitors per population was positively correlated49 and Coelho et al. (2020) emphasized the role of the air transportation network in this epidemic50. Gross et al studying spatial dynamics of the COVID-19 in China found a strong correlation between the number of infected individuals in each province studied and the population migration from Hubei to those provinces51 with a slight decay along time, which agrees with our results although remaining highly significant. Tuli et al (2020) show in their model that countries with higher air passenger traffic have higher likelihood of getting higher level of infections52. Lockdown-type measures had the largest effect on limiting viral transmission, followed by complete travel ban53
Inbound tourism also has a significant association with the other dependent variable, number of dead people and stays significant with both cumulative (C. March: p<0.0003, R2=0.24; C. April: p<0.00001, R2=0.63; C. May: p<0.00001, R2=0.46; C. June: p=0.00001, R2=0.35) and non-cumulative data (N-C May: p<0.0001, R2=0.27; N-C June: p<0.00001, R2=0.35).
This means that the number of people arriving to a country is linked to number of infected or dead by SARS-CoV-2. This is concordant with the results obtained by Anzai et al.54. Merler et al. conducted a study to see the effect of travel restrictions on the spread of the virus, and they found that the travel quarantine of Wuhan delayed the overall epidemic progression by only 3 to 5 days in mainland China but had a more marked effect on the international scale, where case importations were reduced by nearly 80% until mid-February 55.
Other results that we estimated relevant was the strong regression found between infected and dead by COVID-19 and independent variables that talk about air pollution.
Electricity consumption had high statistical significance with both dependent variables, number of infected people and number of deaths, and were statistically significant with cumulative data and non-cumulative data. Other environmental indicators (CO2 emissions, NO2 emissions and methane emissions) also showed high predictive power since they had statistical significance (with p<0.001 values) Countries with more emissions and more energy usage had higher counts of infected or dead people (Results are shown on Table 2 and Table 3).
Although Travaglio et al (2020) worked only with data from England their findings are similar to ours since they conclude that the levels of some air pollutants are linked to COVID-19 cases and morbidity40.
Similarly, Liu et al. (2020) studied the association of NO2 atmospheric level over the spread of the COVID-19 in Chinese cities, suggesting that ambient NO2 may contribute to the spread of COVID-195. Yao et al. (2020) although studying the effect of other pollution indicators (PM2.5 and PM10) found that COVID-19 held higher death rates with increasing concentration of PM2.5 and PM1056.We cannot forget that pollution variables are somehow related with transport (e.g. airplanes, trains and cars)
Population and Area of the countries studied had different predictive power along the studied period. On the 21st of March (C. March), both, Population and Area had high correlation with number of infected people (p=0.00007 and p=0.0043 respectively) but on the consecutive months (C. April, C. May and C. June), while Population had no statistical significance, the Area variable retained predictive power for the number of infected people. The p values for Area were p=0.009 on C. April, p=0.0002 on C. May and p=0.0004 on C. June. Similarly, during N-C May and N-C June Population variable lost its predictive power and had no significance but Area did (N-C May: p=0.00002, R2=0,31; N-C June: p=0.00004, R2=0,30).
When we considered the relation of these two variables with the number of deaths, we found similar outputs, on the 21st of March (C. March) both had statistical significance (Population: p=0.014; Area: p=0.0053), but on the following cumulative months the Population variable was no longer significant while Area still was (p=0.04 on C. April; p=0.007 on C. May; p=0.0015 on C. June). Again, during N-C May and N-C June, Population loses its significance while Area retains it (N-C May: p=0,004; N-C June: p=0.001)). Aldibasi (2020) found a statistically significant positive association between Area(km2) and mortality rate and a negative association between Population size and incidence rate44 (Results are shown on Table 2 and Table 3).
2. Variables that are not predictive of the covid-19 pandemic (non-statistically significant linear regression p > 0.05)
None of the variables included in this group had statistical significance during the time period studied, from 21st of March to the 21st of June. The regression analysis with Hospital Bed Density (beds per 1000 people), Number of Physicians (per 1000 people) had no correlation with number of infected nor number of deaths at any moment in our study, either using cumulative data or non- cumulative data. When confronted with this new pandemic, health care systems showed little efficiency in preventing infections and death counts, due to probably lack of knowledge. We thought this should change with time, their predictive power had to grow as health care systems dealt with SARS-CoV-2, but astonishingly, after several months since the outbreak, all this relevant health parameters remained irrelevant. What proved to be more effective was early preventive measures. Lowental et al (2020) point in the same direction, countries that implemented early measures to limit population mixing limited the number of deaths, in this same study they explain that a delay of 7,49 days in initialling social distancing would double the number of deaths.57. For Papadopoulus et al (2020) early introduction of these policies are of greater importance than its stringency.46
Researches in R&D (per million people) or Infant Mortality Rate (indicator of a countries health and care system) showed similar behaviour. In the same line, Yao et al. (2020) did not find significance in the association between hospital beds per capita and COVID-19 death rate56. Sorci et al. (2020) found that the Number of hospital beds per 1,000 inhabitants was negatively associated with Case Fatality Rate 58. Qiu et al (2020) in their study in China, found that COVID-19 transmission is negatively moderated by the number of doctors at the city level59
The logic of any disease is that good health and research parameters of a country should have a positive effect over the number of infected or the number of deaths, but our results indicate no significance at all. It is necessary to highlight that the best predictor we found to death count by COVID-19 is the number of infected by COVID-19 (p>0,00001), either with cumulative data or with non- cumulative data, there is a cause-effect relation. Despite the increase in knowledge, despite improvements in health care systems… the percentage of deaths for a given number of infected could not be reduced. What it really shows is that prevention, and not treatment, is the best way to face the COVID-19 pandemic.
Unemployment rate, Inflation rate and Military expenditure, categorised as economic indicators fell also in this group of non-predictive variables. We did not find statistically significant regressions between the different economic indicators we selected and the dependent variables (number of infected and number of deaths). Richness of a country does not predict its infected or dead by COVID- 19, with the exception of those that showed easiness to travel across the country (i.e. Inbound tourism). Yao et al. among other variables they also studied Gross Domestic Product (GDP) per capita as a measurement of a country’s wealth and in their work, GDP per capita was found not to have significance in the association with COVID-19 death rate56. According to Qiu et al. (2020) COVID- 19 transmission is positively moderated by GDP per capita59
In our study, Urban population (%) and Population Growth rate were also not statistically significant in relation with number of infected people and number of deaths by SARS-CoV-2.
Government expenditure on education (as % of GDP) had some correlation with number of infected and number of deaths by SARS-CoV-2 on the 21st of March but not the following two months.
The Population ages 65 and above in our study had no relation with the number of infected on the 21st of March, gained significance on the 21st of April (p=0.06) and went back on the 21st of May to be not statistically significant. When we studied Population ages 65 and above against number of deaths it showed no correlation in March or April, but it did in May (p=0.09).
Messner also studied the effect of education and countries’ median age on the COVID-19 outbreak and found that the quality of a country’s education system is positively associated with the outbreak and that countries with an older population are more affected60
Shagam (2020) using multivariate linear regression found significant correlation between incidence and mortality rates with GDP per capita (p = 2.6 × 10−15 and 7.0 × 10−4, respectively), country-specific duration of the outbreak (2.6 × 10−4 and 0.0019), fraction of citizens over 65 years old (p = 0.0049 and 3.8 × 10−4) and level of press freedom (p = 0.021 and 0.019). Table 4 and table 5 can be found in complementary material.
Statistically, with these 50 countries (nearly 70% world’s population) number of dead depends only on number of infected according to:
With α values ranging from -35423 to 6197 and β from 0.02910 to 16.35. The best predictor to death count by SARS-CoV2 was the number of infected (p>0,00001), either with cumulative data or with non-cumulative data, there is a cause effect relation.
The best predictive variable was inbound tourism, and it did not change throughout the time period studied, it retained its predictive strength using cumulative data or non-cumulative data.
Air pollution predictive variables are of great significance, their ability to predict number of infected and dead should make us think about the importance of One Health. Environmental health and Animal health are essential to Human health.
Despite every effort done, the increase in knowledge, the improvements implemented in health care systems, etc. yet is not enough. None of the health & research indicators had significance at all during the four-month cumulative data regressions, nor when non-cumulative data was used. As stated previously, what it really shows is that prevention, and not treatment, is the best way to confront SARS-CoV-2 outbreak, and by extension any future epidemics that may arise unknown to medicine. Some lessons must be learnt from this pandemic. Those countries that conducted more tests to isolate infectious promptly and impose to their citizens policies to prevent the spread of the virus at early stages were, by far, least affected.
According to Loewenthal et al. (2020), to reduce the number of deaths is essential to implement confinement policies at a very early stage. If social distancing is not adopted, death incidents are doubled every 7.49 days.57 Piguillem et al. using Italy’s early figures to calibrate their model they estimate that without lockdowns there could be as much as 800, 000 fatalities. With a very early intervention that number reduces to about 120 – 230 fatalities, while if the intervention is started 60 days later, the number of fatalities is between 7, 200 and 9, 00061. Italy had, as to 21st of June, 238,275 infected people and 34,610 dead according to the World Health Organisation COVID-19 dashboard (https://covid19.who.int).
Early detection of cases through surveillance and aggressive contact tracing around known cases has helped to contain spread of the outbreak in Singapore. Together with other healthcare, border and community measures, they allow the COVID-19 outbreak to be managed without major disruption to daily living. Countries could consider these measures for a proportionate response to the risk of COVID-1962.
Our study shows, with very high significance, what variables predict SARS-CoV- 2 infected and dead counts In order to minimise the impact of future unknown pathogens arising, is paramount that we learn the lesson from this pandemic. Prevention is the key.
Data Availability
All Data and links are available in the text
Competing Interest Statement
The authors have declared no competing interest.
ACKNOWLEDGEMENTS
We thank to Dr. C. García-Balboa, H. Díaz-Alejo, P. Martinez-Alesón for their help and advice on this paper.