An in-depth statistical analysis of the COVID-19 pandemic's initial spread in the WHO African region

Objective: To quantify the initial spread of COVID-19 in the WHO African region, and to investigate the possible drivers responsible for variation in the epidemic among member states. Design: A cross-sectional study. Setting: COVID-19 daily case and death data from the initial case through 29 November 2020. Participants: 46 countries comprising the WHO African region. Main outcome measures: We used five pandemic response indicators for each country: speed at which the pandemic reached the country, speed at which the first 50 cases accumulated, maximum monthly attack rate, cumulative attack rate, and crude case fatality ratio (CFR). We studied the effect of 13 predictor variables on the country-level variation in them using a principal component analysis, followed by regression. Results: Countries with higher tourism activities, GDP per capita, and proportion of older people had higher monthly (p < 0.001) and cumulative attack rates (p < 0.001) and lower CFRs (p = 0.052). Countries having more stringent early COVID-19 response policies experienced greater delay in arrival of the first case (p < 0.001). The speed at which the first 50 cases occurred was slower in countries whose neighbors had higher cumulative attack rates (p = 0.06). Conclusions: While global connectivity and tourism could facilitate the spread of airborne infectious agents, the observed differences in attack rates between African countries might also be due to differences in testing capacities or age distribution. Wealthy countries managed to minimize adverse outcomes. Further, careful and early implementation of strict government policies, such as restricting tourism, could be pivotal to controlling the COVID-19 pandemic. Evidently, good quality data and sufficient testing capacities are essential to unravel the epidemiology of an outbreak. We thus urge decision-makers to reduce these barriers to ensure rapid responses to future threats to public health and economic stability.


Introduction
The first confirmation of a COVID-19 case in the African continent occurred in Egypt on 14 Feb 2020(Massinga et al., 2021. Following that introduction, along with others elsewhere in the continent, COVID-19 cases and related fatalities rose exponentially, eventually reaching all African countries. Whether due to its younger populations, lower rates of obesity and other comorbidities, higher rates of immune-modulating parasitic diseases, relatively low population densities and urbanization, a disproportionate lack of data on confirmed cases and deaths, or some other feature unique to the African diaspora, countries on the continent appear to have fared better during the initial wave of the pandemic than elsewhere in the world, with lower attack rates and many orders of magnitude fewer deaths (Kousi et Rice et al., 2021). While the as-yet-undetermined cause of this difference in confirmed COVID-19 cases and deaths setting African countries apart could be common across the continent, there was nevertheless heterogeneity in the evolution of the epidemic between countries (Twahirwa Rwema et al., 2020). It is important to understand the factors that may contribute to the differences among these countries, as these insights can inform policy and priorities for future public health emergencies, when comparison to the rest of the world is marred by problematic biases (Salyer et al., 2021).
Motivated by this objective to examine drivers of heterogeneous COVID-19 spread across African countries, we quantified how the first wave of the COVID-19 pandemic unfolded in the 47 member states comprising the World Health Organization (WHO) African region using data scraped from official country health ministry announcements and published daily on the WHO COVID-19 dashboard (World Health Organization Dashboard, 2020). Following this, we performed a statistical analysis using policy related characteristics during the pandemic as well as largely pre-pandemic socioeconomic, and demographic aspects specific to each country.

Study design and settings
This is a cross-sectional analysis of the COVID-19 data reported as of 29 Nov 2020, among the 47 member states comprising the WHO African region. We analysed the data gathered by the WHO on the daily number of new cases and deaths published by country health ministries through official channels. These data eventually came to be systematically published on the WHO COVID-19 dashboard (World Health Organization Dashboard, 2020), and are thus freely accessible. However, groups of confirmed cases marked as 'probable' due to sole availability of rapid diagnostic (viral antigen) tests -representing only 70 cases from Comoros and one case from the Democratic Republic of Congo (DRC) -are now also considered part of the official counts and were included as such in the present study. We included the cases with information about patient outcome status (alive, recovered, or dead), excluding those with missing values. Because of the lack of data reported from Tanzania, our analysis focused on the remaining 46 countries (Figure 1).

Variables
The response indicators: For each of the 46 countries, we included the five response indicators ( Fig. 2A). They described not only the severity/burden of the outbreak but the evolution of the pandemic in the context of each country as well as the whole WHO African region, as explained below. 1) The per-capita cumulative number of cases (cases per million inhabitants), 2) the maximum monthly attack rate (new cases per million inhabitants counted over consecutive 4-week intervals), 3) the crude case fatality ratio (CFR, cumulative number of deaths / cumulative number of cases), 4) the speed of the epidemic within each country, measured as the inverse of the time interval between the first case and the 50th case, and 5) the speed at which the epidemic reaches each country, following its first confirmation in the region, measured as the inverse of the time delay between the first reported case in the region (on 25 February 2020 in Algeria) and the first case in each country, denoted by start_delay: In the above expression, the addition of 1 in the denominator ensures that we do not incur in a division by zero. The within-country speed of the outbreak was a priori expected to correlate with population density. Our final response indicator, concerning the regional correlation, was motivated by Li et al., 2021, representing a delay in either the exposure, testing capacity, data reporting, or some combination thereof.
Predictor variables: To explain the variation in the response variables between countries, we collected 13 predictor variables (Fig. 2B) from public data repositories (see Table S3  (In the analysis involving countries having at least 10 deaths, even the mean stringency was imputed like this.) Moreover, we used the OWID (Our World in Data, 2021) data to compute the neighbors_attack_rate for the countries we did not have data regarding the total number of positive cases.
Transformations: Following data imputation, we log-transformed skewed variables: Tourism_arrivals_percap, Tourism_dollars_percap, neighbors_attack_rate, GDP_percap, Fishing_volume_percap, and Pop_density, so that they could be approximated by normal distributions. We used square root transformation for the Stringency_day1. (In the analysis with only the countries having at least 10 deaths, the mean stringency (Stringency_mean) also was transformed like this.) We also transformed the response variables so that they would follow a normal distribution. We log-transformed the cases_per_million, max_monthly_attackrate, and start_speed, whereas speed_first50cases was square root transformed and CFR was arcsine transformed. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 27, 2021. ;

Regression analysis:
We performed a linear regression to understand the dependence of response indicators on the five composite explanatory variables (PCA dimensions 1-5 for the main analysis and the analysis to study the robustness). For each response indicator, we used the glmulti function from the glmulti package (Calcagno and de Mazancourt, 2010) to select the best fitting model among those including 0-5 of the explanatory PCA dimensions. We have provided the full summary tables corresponding to the best fitting regression model in the Supplementary Information.

Robustness to variation in data quality:
In order to check the robustness of our results to data quality issues, we assumed countries with fewer than 10 reported deaths as of 29 Nov 2020 to be suspected of under-detection or under-reporting of cases. Thus, we repeated the above analyses after removing the four countries that had reported fewer than 10 deaths ( i.e., Burundi (deaths = 1), Comoros (deaths = 7), Eritrea (deaths = 0), and Seychelles (deaths = 0)) to check for qualitative differences in the results.

Patient and public involvement
Neither patients nor the public were involved in the design, conduct, reporting, or dissemination plans of our research.

An overview of the response indicators included
The COVID-19 response variables (before transformation) showed substantial variation across the 46 countries (see Figures 2A and 4A . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The main analysis
The reduced list of predictors (principal components) With the help of a PCA on the 13 predictors ( Figure 2B-C), we reduced the number of dimensions to five, with all of them having eigenvalue > 1 and accounted for 75.7% of the variance of the dataset. Figure 3 shows the five PC dimensions, the percentage of variance explained by them, their descriptions, and correlations with the predictors.
The first dimension, PC1, accounted for 29.4% of the explained variance. PC1 was higher in countries with higher income and visitors from tourism, higher GDP, and generally older populations (p < 0.001 for all). The second dimension (PC2) accounted for 14.9% of the total variance and was significantly positively correlated with urbanization and the proportion of males, and negatively with overall population density (p < 0.001 for all). PC3 accounted for 14% and was negatively correlated with the cumulative attack rate in neighboring countries (p < 0.001). PC4 accounted for 9.6% of the explained variance. It was positively correlated with epidemic preparedness and negatively correlated with initial government response stringency (p < 0.001 for both). PC5 accounted for 7.9% of the total variance. It varied positively with per-capita fishing volume and negatively with the proportion of males and population density, though these correlations were not significant (p > 0.05).

The main predictors associated with the response indicators
Our regression suggested a negative relationship (p = 0.052) between CFR and PC1 (composed of high GDP, high tourism, and also higher proportion of older individuals). Both cases per million (cases_per_million) (p < 0.001) and monthly attack rate (max_monthly_attackrate) (p < 0.001) were positively correlated with PC1 (composed of high tourism, high GDP, older age). On the other hand, monthly attack rate (p = 0.086) showed a non-significant negative relationship with PC3, implying that it was positively associated with the cumulative attack rate of neighbouring countries. The relative indicator showed a positive correlation with _ PC2 (high urbanization) (p = 0.016) and PC4 (higher preparedness and lower initial stringency) (p < 0.001). The showed a positive trend with PC3 (lower neighbour attack _ 50 rate) (p = 0.06; Tables 1 and S1).
The geographic distributions of the response variables as well as the PC dimensions can be seen in Figure 4.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 27, 2021. ; https://doi.org/10.1101/2021.08.21.21262401 doi: medRxiv preprint The analysis following the removal of countries with <10 deaths The principal components In this case, PC1 remained the same as the main analysis, whereas PC2 was positively correlated with urbanization (p < 0.001) and the proportion of males (p < 0.001). Further, PC3 was negatively correlated with cases per million from neighbours and positively correlated with population density (p < 0.001). PC4 was negatively associated with the stringency index on the first day (p < 0.001). PC5 showed a negative correlation with fishing volume and preparedness and a positive association with proportion of males, with none of the correlations being significant (p > 0.05) ( Figure S1).
The main predictors associated with the response indicators As before, CFR remained negatively associated with PC1 (tourism, GDP, and older age) (p = 0.02). The monthly and cumulative attack rates were positively related to PC1 (p < 0.001). The speed of first case arrival was associated positively with PC2 (p = 0.042) and PC4 ( _ ) (p < 0.001) and showed a non-significant negative trend with PC5 (p = 0.085). Therefore, countries in which the first infection happened quickly had less stringent restrictions that time, and also had high urbanization and male proportion. The within-country speed of the pandemic, speed to first 50 cases ( ), was positively related to PC3 (p < 0.001), high _ 50 population density and low infections in neighbors (see Table S2).

Major findings and trends
This study considered globalisation and geographic, demographic, and socioeconomic factors as potential determinants of the spread and severity of COVID-19 in 46 WHO AFRO countries. Our two primary findings were the following: 1) the strong positive associations of tourism, GDP, and a high fraction of older people with the attack rates; 2) the negative relationship between tourism/GDP/older age and CFR. The influence of air travel, connectivity, and tourism in the spread ( is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 27, 2021. ; https://doi.org/10.1101/2021.08.21.21262401 doi: medRxiv preprint more asymptomatic infections (Abraha et al., 2021). In fact, COVID-19 severity being higher in older populations (Abraha et al., 2021) could be another reason behind detecting more cases from countries having a higher fraction of older people. African countries were also reported to have a lower number of COVID-19 infections among children than adults (Velásquez et al., 2021), in accordance with our finding. Moreover, previous research showed that countries with higher GDP per capita have higher life expectancy at birth, which leads to a higher percentage of older people, leading to the correlation between GDP and age observed by us (Miladinov, 2020). The lower CFR could be attributed to the better healthcare system, along with more patients with milder symptoms possibly being detected owing to more testing capacities, in richer countries (Skrip et al., 2020).
We found that urbanisation is positively associated with a faster occurrence of the first infection. It has been previously recorded that large, global cities reported positive COVID-19 cases at an earlier stage, due to their strong connectedness to other areas (OECD, 2020). Additionally, we found that the speed at which the first 50 infections happen in a country is positively related to its population density. Urban, dense areas usually produce the prerequisites for crowding, greater number of social interactions and close contact, which increases the risk for the spread of an infectious agent (IZA, 2020; Neiderud et al., 2015;Sigler et al., 2021).
We observed that countries having less stringent measures were the ones in which the COVID-19 outbreak happened quickly. As expected, early implementation of restriction and control measures probably resulted in delaying the onset of the outbreak. This result could help the authorities to implement the necessary measures to more effectively control the upcoming surge of cases and protect public health. Previous research showed that a common characteristic among countries with delayed onset was the implementation of effective border measures and various preparedness activities at an early stage (Kousi et al., 2020;Li et al., 2021;Ayouni et al., 2021). Conversely, countries that had a higher value of preparedness index were among the first ones in which the outbreak initiated. Perhaps, countries better prepared to control an epidemic had also enhanced testing capabilities, and as a result were able to detect and identify cases earlier. An interesting finding was the negative association between the speed at which the first 50 cases happen in a country and the attack rate in neighbouring countries. It is possible that an outbreak . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 27, 2021. in one country would alert its neighbours to take early precautions to avoid the spread of the virus in their territory. These precautions could include stricter border control and decreased mobilisation of people, resulting in lower early cases. However, given the geographic complexity of the continent and the knowledge gap on the drivers of geographical diffusion between areas, further research is needed to examine this hypothesis in detail (Sigler et al., 2021).

Strengths and limitations
Our study is one of the most comprehensive studies describing the first wave of COVID-19 pandemic in Africa, where it is less studied. Using the 5 response variables and 13 predictors we analyzed, we believe that we managed to get a thorough understanding of the burden, severity, spatial trends, and the evolution of the pandemic in the WHO African region. The other strength of our study is the robustness of our estimates. The analysis conducted with countries having at least 10 deaths confirmed the roles of tourism, age distribution, and GDP on the attack rates and CFR. It also established that nations with strict governmental policies had a late onset of the COVID-19 outbreak and those having a high attack rate in neighbouring countries managed to reduce the speed at which the first 50 infections occur in them.
However, there exists heterogeneity across the countries in this continent, which limits our ability to extract general conclusions. One important limitation of this study is the lack of standardization in testing policies (Velasquez et al., 2021) and notification of cases between countries. Additionally, the availability of individual patient data corresponding to each predictor variable, where possible, is useful in obtaining a clearer insight of the pandemic situation. However, such studies are limited in the context of Africa and have addressed the impacts of a few individual-level factors only . Similar to other regions, Africa was also subject to under-reporting of confirmed cases and deaths (Chitungo et al., 2020;Dyer, 2021).
We transformed many of the predictors as well as the response indicators so that they are well approximated by normal distributions. Hence, our results need to be interpreted carefully. Moreover, the number of tourists arriving as well as the stringency variables were missing for a few countries. We relied on imputation on such occasions. We also recognize that data was not reported consistently for all countries for some of the indicators. For instance, we noticed that the latest reported data on some of our variables was not documented the same year for all countries and was collected before the pandemic. For a few of them, the absolute value could change in the pandemic time, and the proportion by which the change happens could vary across countries. There are other potential variables not accounted for in our study such as air pollution, climatic variation, economic inequality (GINI index), diet, etc., that could affect the attack rates, severity, and speed corresponding to COVID-19 (Rice et al., 2021;Kim et al., 2021).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Conclusions
Important evidence can be extracted from the presented research, and inform decision makers on the factors to be considered when designing their plan to effectively and rapidly control a future outbreak in the African context. Rich countries, with high tourism activities, experienced a higher number of cases, indicating links between air travel, connectivity, human mobility, and viral transmission, along with suggesting the roles of better testing facilities and age structure. However, since most of these nations also had better healthcare, they managed to minimize the number of deaths, resulting in their low CFRs. Urbanised areas with high population density underwent a faster increase in the number of cases in the beginning of the outbreak. Also, countries with weaker control measures faced a COVID-19 outbreak quicker. These findings stress the need for appropriate non-pharmaceutical measures at an early stage, with emphasis on densely populated areas, and famous tourist destinations. Lastly, it is very important for the countries to ensure robust surveillance, and testing capacities, so that they can base their control strategy on data that depicts the epidemiological situation of the country correctly. Good quality of data will be helpful to better evaluate and adjust the various implemented measures during the course of the outbreak.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  The grey color corresponds to the missing entries. For both the heatmaps, blue color represents high values whereas red color denotes low values. Each indicator (in the sets of response and predictor variables) here is scaled by the standard deviation and centered by subtracting the mean before plotting. C: Correlation matrix depicting the predictor variables included in the PCA. So as to preserve the epidemiological significance, we did not exclude any predictor based on its strong correlation with other ones.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted August 27, 2021. ; https://doi.org/10.1101/2021.08.21.21262401 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 27, 2021. ; https://doi.org/10.1101/2021.08.21.21262401 doi: medRxiv preprint The strong positive association between the attack rates (cases per million and maximum monthly attack rate) with PC1 is evident from the maps.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 27, 2021. ; https://doi.org/10.1101/2021.08.21.21262401 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 27, 2021. ; https://doi.org/10.1101/2021.08.21.21262401 doi: medRxiv preprint Table 1. Main regression analysis: Best fitting regression model along with their coefficients for each of the five response indicators. The significant codes: '***' for < 0.001, '**' for < 0.01, '*' for < 0.05, and '.' for <0.1.

Response Indicators
Best
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 27, 2021. ; https://doi.org/10.1101/2021.08.21.21262401 doi: medRxiv preprint