Abstract
Data collected in the Global COVID-19 Trends and Impact Surveys (UMD Global CTIS), and data on variants sequencing from GISAID, are used to evaluate the impact of the Omicron variant (in South Africa and other countries) on the prevalence of COVID-19 among unvaccinated and vaccinated population, in general and discriminating by the number of doses. In South Africa, we observe that the prevalence of COVID-19 in December (with strong presence of Omicron) among the unvaccinated population is comparable to the prevalence during the previous wave (in August-September), in which Delta was the variant with the largest presence. However, among vaccinated, the prevalence of COVID-19 in December is much higher than in the previous wave. In fact, a significant reduction of the vaccine efficacy is observed from August-September to December. For instance, the efficacy drops from 0.81 to 0.30 for those vaccinated with 2 doses, and from 0.51 to 0.09 for those vaccinated with one dose. The study is then extended to other countries in which Omicron has been detected, comparing the situation in October (before Omicron) with that of December. While the reduction measured is smaller than in South Africa, we still found, for instance, an average drop in vaccine efficacy from 0.53 to 0.45 among those vaccinated with two doses. Moreover, we found a significant negative (Pearson) correlation of around −0.6 between the measured prevalence of Omicron and the vaccine efficacy.
1 Introduction
The Omicron variant of SARS-CoV-2 has seen an expressive increase since its initial classification in November 2021 [Oo21]. In South Africa it appears to have out-competed the Delta variant [Hod21] and has rapidly spread into Europe and other regions. Preliminary observations also indicate that it might spread faster and might have higher immune evasiveness than previous variants [KK21]. While vaccination still provides a level of protection against a serious disease [RHRM+21], recent results [PvSG+21, NKL+21, KST+21, LMD+21] point towards a reduced level of protection against infection, especially from 15 weeks post the second dose [ASK+21], and it is likely that the number of breakthrough infections (i.e., infections among vaccinated people) will rise with the spread of Omicron. It is also possible that the rapid spread of Omicron is not only a consequence of high transmissibility but also of immune evasiveness [LMD+21]. Some of the preliminary models [SLD+22] showed that high transmissibility in combination with high immune evasiveness could lead to a concerning health system overload [LRSC+21].
Since the spring of 2020, the University of Maryland in collaboration with Facebook has collected extensive survey data on self-reported symptoms, infection, testing, behavior and, more recently, vaccination status (UMD Global CTIS) [FLS+20, The21b]. In mid December 2021, researchers used data from this survey concerning the Gauteng province in South Africa to define different combinations of symptoms that are associated with COVID-19 infection, and combined those with self-reported vaccination status to compare vaccine efficacy changes from a Delta dominant period to the current Omicron dominant period [VRAB21]. Their findings showed a measurable drop of efficacy towards infection for those vaccinated with two doses.
In this study we use self-reported confirmation of COVID-19 infection, from a subset of the UMD Global CTIS survey responses, to derive an improved proxy for COVID-19 active cases (using a Random Forest classifier) that tracks more closely the evolution of confirmed cases. We use this improved proxy for analysing prevalence and vaccine efficacy changes in South Africa as a whole, and in the Gauteng province, among those unvaccinated, partially vaccinated, and fully vaccinated. We also compute results in other countries that are currently experiencing a rise of Omicron cases, which show a significant negative correlation between the prevalence of Omicron and the vaccine efficacy.
The rest of the paper has three sections. In Section 2 the data used and the methodology applied is described. In Section 3 we describe the results obtained when applying the methodology to the data. Finally, in Section 4 we have a discussion about the implications of the results obtained.
2 Methods
2.1 Self-reported Survey Data
Since Spring 2020, the U. of Maryland (UMD) has been running a COVID-19-related survey [FLS+20, The21b] in most countries1, in collaboration with Facebook [Fac20, KBB+20, ATMC+21]. This survey, called the University of Maryland Social Data Science Center Global COVID-19 Trends and Impact Survey in partnership with Facebook (UMD Global CTIS), collects more than 100,000 responses daily across the world. It asks the participants questions covering, among others aspects: symptoms, habits, testing, and vaccination status. All the participants in the CTIS have declared to be at least 18 years of age.
In this work, we use the responses to the UMD Global CTIS, to which we have access by agreement with UMD and Facebook (see Appendix D). We first curate the data by removing abnormal responses, following the approach proposed by Alvarez et al. [ÁBC+21]: We remove responses that declare to have all symptoms or that declare unusual values (greater than 100) in the quantitative questions of the survey (e.g., days of symptom duration, number of symptomatic contacts, number of people staying at the same place, etc.).
After curating the responses, the next task we face is determining whether they correspond to active cases of COVID-19. This is somewhat direct for the subset of responses that respond affirmatively to the survey question “B7: Have you been tested for COVID-19 in the past 14 days?” and then respond positively or negatively to the survey question “B8a: Did your most recent test find that you had COVID-19?” [The21a]. For this work, we assume that a participant responding affirmatively to both questions is an active case of COVID-19 (i.e., it is a positive case). Similarly, a participant responding affirmatively to Question B7 and negatively to Question B8a is assumed not infected with COVID-19 (i.e., negative). This set of classified responses constitute a ground-truth set, for which infection status (positive or negative) is available.
Unfortunately, this ground-truth set cannot be used directly to estimate the prevalence of COVID-19 in the overall population, because the set is usually very small and is not produced via uniform random sampling: People who have reason to believe they may be infected are more likely to be tested and therefore the ratio of positives among those tested in the latest 14 days (i.e., the testing positive rate, abbreviated TPR) is higher than the actual prevalence.
In order to classify the responses as positive or negative, several criteria have been proposed in the literature. In particular, we consider the following symptom-based COVID-like illness classifiers (see Appendix A for the list of symptoms collected in the survey):
UMD CLI [FLS+20, ÁBC+21]: A response is considered to be positive if it declares fever (symptom B1_1), along with cough (symptom B1_2), or shortness of breath / difficulty breathing (symptom B1_3). Otherwise, it is negative.
Stringent CLI [VRAB21]: A response is positive if it declares anosmia (symptom B1_10), combined with fever (B1_1), muscle pain (B1_6), or cough (B1_2). Otherwise, it is negative.
Classic CLI [VRAB21]: A response is positive if it declares cough (B1_2), combined with fever (B1_1), muscle pain (B1_6), or anosmia (B1_10). Otherwise, it is negative.
Broad CLI [VRAB21]: A response is positive if it declares muscle pain (B1_6), combined with fever (B1_1), cough (B1_2), or anosmia (B1_10). Otherwise, it is negative.
These methods for classifying cases as positive or negative have two main limitations. First, they do not take into account diagnostic uncertainty, e.g., the same set of symptoms might be associated with some other condition. Second, these criteria are not adaptive to possible changes in the symptoms experienced as conditions change, e.g., as vaccination rates increase or new virus variants emerge. Thus, in this work, we introduce a new machine-learning-based classifier (described in Section 2.2) where the responses of users in the ground-truth set are used to train a model, which is then used to determine the status of users outside that set (users who do not report test information). We use the random-forest technique to design this classifier and the corresponding results are labeled Random Forest in what follows.
We refer to the values obtained with each of these five classifiers (namely, Random Forest, UMD CLI, Stringent CLI, Classic CLI, and Broad CLI) as proxy estimates (or proxy for short). We compare each proxy estimate with the estimate of active cases obtained from the official number of cases as described by Alvarez et al. [ÁBC+21], where each new case is assumed to remain active for 10 days. These last estimates are called Confirmed. Both Confirmed and the estimates using the various proxies lead to time series with one estimated value per day.
2.2 Machine Learning Classifier: Random Forest
Each response to the survey includes a large number of questions (obviously, not all participants answer all questions). For training and inference of the Random Forest classifier, we use only questions with answers holding discrete values. From these we remove questions B7 and B8, which are only used to create the ground-truth set, as well as related questions, such as “B0: As far as you know, have you ever had coronavirus (COVID-19)?” and “B15: Do any of the following reasons describe why you were tested for COVID-19 in the past 14 days?”. Finally, we do not use the questions related to vaccination, since we do not want them to influence the classification. The set of questions used can be found in Appendix B. The answers to this set of questions are “dummified” before they are used, i.e., a question with k possible answers is replaced by k binary attributes. The Random Forest model is generated with the randomForest function in R. No hyperparameter tuning is done, and the standard options of the function are used, with the exception of limiting the model to 100 trees to reduce the training time.
Observe that the questions in Appendix B include all symptoms, but also have many more questions, including behavioral or demographic aspects. Additionally, the Random Forest classifier can give different weights to different symptoms, while previously proposed symptom based criteria are based on determining only whether a symptom is present or not. Thus, overall the Random Forest classifier is much more versatile than the symptom-based criteria described in the previous section. Additionally, there are other aspects that make the Random Forest classifier(s) more adaptive:
Firstly, we create different models for different countries. It is expected that different countries will have local characteristics, thus training and using the classifier with data from one same country can capture them.
Secondly, we create not one but several models per country: one for each 3-month period. This allows the model to capture and adapt to aspects that change over time, like the level of vaccination, the surge of new variants, or the stringency measures imposed.
2.3 Evaluating the Classifiers
In order to verify whether the Random Forest classifier provides better proxy estimates than the symptoms-based classifiers, we selected a set of countries and tested the performance of each classifier in the last two quarters of 2021. To this end, we randomly divided the ground-truth set into a training and a testing set, with 70% and 30% of the responses of the ground-truth set in each subset, respectively. Table 1 shows the results for three countries that have detected Omicron in December for the periods of July-September 2021 (2021-Q3) and of October-December 2021 (2021-Q4). The classification performance metrics used are:
Performance for three different countries in two different 3-month periods (2021-Q3: July-September 2021 and 2021-Q4: October-December 2021) of the different classifiers in the ground-truth set, when randomly divided into training (70%) and testing (30%) subsets.
Accuracy: Ratio of cases correctly classified over the size of the test set.
Sensitivity / recall: Ratio of cases correctly classified as positive over the number of positive cases.
Specificity: Ratio of cases correctly classified as negative over the number of negative cases.
F-score: Harmonic mean of precision and recall, where the precision is the ratio of cases correctly classified as positive over the number of all cases classified as positive.
As can be seen in Table 1, Random Forest almost always shows the highest performance (marked in bold) among the classification methods used.
As another test, we then selected a set with the 20 countries that have the largest number of available responses in the UMD Global CTIS dataset along with South Africa. For each of these countries, the first two columns of Table 2 show the official Test Positivity Rates obtained via Our World In Data [RMRG+20, Our21] (OWID TPR) and the corresponding survey-based estimate from the UMD Global CTIS dataset (CTIS TPR). The remaining columns show the Pearson correlation coefficient between the time series of Confirmed active cases (computed based on data from Johns Hopkins University [Joh20] as described by Alvarez et al. [ÁBC+21]) and that of each of the candidate proxies in the period June 18th, 20212 to December 31st, 2021. All time series have one value per day, which is the average of the latest 14 days.
Test-positivity rate (TPR) obtained from OWID and extracted from the UMD Global CTIS data for the 20 countries with largest survey data and South Africa. Values of at most 0.1 are shown in bold. The rest of columns show the Pearson correlation coefficient of each different proxy with the Confirmed time series. Correlation values of at least 0.9 are shown in bold. The time period used is Jun 18th, 2021 to Dec 31st, 2021. The estimates have been smoothed with a rolling average of 14 days.
We can make two observations from Table 2. First, Random Forest turns out to be the candidate proxy that exhibits the highest correlation values in most countries. Second, 17 out of the 21 countries exhibit low TPR (≤ 0.1) values in at least one of the first two columns (either official or survey-based TPR), and 11 out of the 21 exhibit low values in both columns, with 7 having values no higher than 0.053. This suggests that such countries tend to keep the case count relatively under control and report data somewhat correctly. We can thus interpret the high correlation between the Random Forest proxy and the Confirmed time series as a sign that this proxy constitutes the most promising option among the five proxies considered.
2.4 Prevalence and Efficacy Estimation
As mentioned, each classifier will be used to determine whether survey responses correspond to positive or negative cases. Hence, the prevalence of COVID-19 estimated by a given classifier is the ratio between the number of positive cases over the total number of responses. Then, we consider four subsets of responses:
Unvaccinated: Participants that respond negatively to the question “V1: Have you had a COVID-19 vaccination?”
Vaccinated: Participants that respond positively to Question V1.
Vaccinated with 1 dose: Participants that respond positively to Question V1 and declare having received 1 dose in Question “V2: How many COVID-19 vaccinations have you received?”
Vaccinated with 2 doses: Participants that respond positively to Question V1 and declare having received 2 doses in Question V2.
Unfortunately, from the questions in the UMD Global CTIS it is not possible to know whether those with one dose are fully vaccinated, i.e., they have received a one-dose vaccine, or they simply received only the first dose of a two-dose vaccination. Similarly, it is not possible to know whether the participant received a booster shot.
For each of these subsets, the prevalence of COVID-19 is computed as the fraction of responses classified as positive among the responses that report a given vaccination status. For each proxy we also estimate the vaccine efficacy (VE) against illness as in [VRAB21], based on the estimates of prevalence among unvaccinated (PU) and vaccinated (PV):
The confidence intervals of this metric are obtained using the Katz-log Method [AB15]. Since we have three subsets of vaccinated participants, we compute the vaccine efficacy for the subsets Vaccinated, Vaccinated with 1 dose, and Vaccinated with 2 doses.
2.5 Time Periods of Interest
2.5.1 South Africa
The main objective of this work is to evaluate the change in vaccine efficacy due to the Omicron variant. To this end, we evaluate the decrease in vaccine efficacy in South Africa from mid-June 2021 until the end of 2021. Moreover, to ensure that we have sufficient data for our estimates, we concentrate on three time periods in 2021, each lasting about a month, two dominated by the Delta variant: i) June 18 to July 18, 2021, which is the period considered in [VRAB21], and ii) August 9 to September 6, 2021; and one dominated by Omicron: December 1st to 31st, 20214 (see Table 3). In addition to considering South Africa as a whole, we also study the Gauteng province, which is among the most affected by Omicron in the country.
Percentage of sequenced virus samples belonging to Delta and Omicron in South Africa from June 1st to December 31st of 2021. The third column presents the total number of samples reported on the corresponding date.
2.5.2 World
Beyond South Africa, we study the 50 countries for which the UMD Global CTIS has the largest amount of data. We compute for all of them the vaccine efficacy in two periods.
Period 1: The month of October (in which Omicron was still not present).
Period 2: The month of December (in which Omicron was present).
A computed efficacy value is only considered if it is non-negative, both prevalences PV and PU are at least 0.01, and the number of samples used to compute them is at least 1000. We only consider further the countries with at least one efficacy value in Period 2.
We have observed that the information on prevalence of Omicron is available [Our21] with a significant delay. Hence, most countries do not report relevant presence of Omicron until the second half of December 2021. For that reason, we consider the prevalence of Omicron reported in Period 3: from December 15th, 2021 to January 7th, 20225. Furthermore, among the countries mentioned above, in order to have a reasonable estimate of the prevalence of the Omicron variant, we consider only countries whose data is based on sequencing at least 30 virus samples. We say that these are the countries with presence of Omicron and use their estimated Omicron prevalence in Period 3 in some of our results.
For all countries with presence of Omicron, we compare the estimated vaccination efficacy using Random Forest among all three vaccination groups and for both periods. For this, we adopt simple statistical methods, such as correlation analysis.
3 Results
3.1 Prevalence and Vaccination Efficacy in South Africa
Figures 1a and 1b show the prevalence of COVID-19 in South Africa in the period June 18th to December 31st, 2021, with the different proxies. The direct approach of Figure 1a shows a gap from the estimate Confirmed derived from the official number of cases to the other proxies. This gap can be explained by a combination of under-detection in the official number of cases (in South Africa the test-positivity rate is above 15%, as seen in Table 2) and the presence of a background of symptoms that never goes to zero. Figure 1b shows that if each curve is independently normalized to the unit scale all proxies closely track the evolution of the official number of cases Confirmed.
Prevalence in South Africa obtained with the different proxies, smoothed with a rolling average of 14 days from June 18th to December 31st, 2021. In the left plot we have the actual ratio (note that the y axis is in logarithmic scale). On the right plot all curves are normalized so the smallest value is 0 and the largest value is 1.
In Figures 2a, 2b, 2c and 2d we show the COVID-19 prevalence in South Africa among Vaccinated, Unvaccinated, Vaccinated with 1 dose and Vaccinated with 2 doses with the diffferent proxies. We can observe that the UMD CLI and Stringent CLI proxies show a low infection prevalence in the period July-September and the month of December when compared with the Random Forest proxy. This is possibly because UMD CLI and Stringent CLI have a fixed combination of symptoms that did not capture well the new variants Delta and Omicron, while the Random Forest classifier is trained on a 3-month period and can adapt to these changes. On the other hand, Classic CLI and Broad CLI show a high prevalence in the period October-November, when the official data was showing that the number of cases was very low, possibly because of existing symptoms in the population not related to COVID-19.
Prevalence in South Africa among Vaccinated, Unvaccinated, Vaccinated with 1 dose, and Vaccinated with 2 doses, with different proxies.
Focusing on the Random Forest proxy, and in Vaccinated (2a) versus Unvaccinated (2b) prevalence, we can observe that although in the unvaccinated population we see a similar magnitude across the two waves (August-September and December) we see that in the Vaccinated group there is a much higher rate of prevalence in the December wave. This hints at a decrease of vaccine efficacy towards infection with the introduction of Omicron, as we will show next.
Figure 3a shows the prevalence in South Africa estimated with Random Forest across the reported vaccination states. Here we confirm the observation that in the December wave there was a disproportionate increase of infections in the vaccinated groups (Vaccinated, Vaccinated with 1 dose and Vaccinated with 2 doses). We also observe that, as expected, subjects vaccinated with two doses show higher protection that those reporting only one dose (with Vaccinated somewhere in between since it combines both groups).
(a) Prevalence and (b) vaccination efficacy in South Africa among people with different levels of vaccination, estimated with Random Forest.
As for vaccination efficacy, Figure 3b shows the estimates for South Africa, again with Random Forest. While the data in October-November has lower quality due to the reduced number of cases in that country, we can clearly observe the reduction of vaccine efficacy, towards infection, when contrasting the August-September period to the December period when Omicron dominates. Table 4 quantifies the measurements of estimated efficacy for the three periods of interest and for the five classifiers. We also provide a similar analysis in Table 5 with data restricted to the Gauteng province.
Vaccine efficacy in South Africa calculated for three time periods: June 18th to July 18th (Jun-Jul), August 9th to September 6th (Aug-Sep), and December 1st to 31st (Dec).
Vaccine efficacy in the Gauteng province of South Africa calculated for three time periods: June 18th to July 18th (Jun-Jul), August 9th to September 6th (Aug-Sep), and December 1st to 31st (Dec).
Figure 4 shows an area plot, estimated from the UMD Global CTIS data, of the proportion of vaccinated with 1 dose, Vaccinated with 2 doses, and Unvaccinated from June 18th until December 31st, 2021. As can be seen, the ratio of the population vaccinated is low at the beginning of this interval, especially with two doses. Then, we can see a high increase in Vaccinated between July and October. We point out that in each time point of this plot the proportions are provided by a different set of surveys respondents, and it still closely captures the increase of vaccination.
Evolution of the vaccination in South Africa as ratio of the population, estimated from the UMD Global CTIS data. A small fraction of responses that declared being vaccinated without reporting the number of doses are not presented for clarity. The values are from June 18th to December 31st, 2021, smoothed with a rolling average of 14 days.
3.2 Prevalence and Vaccination Efficacy in the World
From the analysis of the 50 countries with the largest amount of data in the CTIS plus presence of Omicron and a calculated efficacy value, as defined in Section 2.5.2, we obtain a set of 24 countries. In Table 8 (in Appendix C) we show, for reference, the level of vaccination in these countries6. The next two tables, Table 9 and 10, present the estimates of virus prevalence in the same countries in the periods of October and December, and also estimates of vaccination efficacy towards infection.
Both prevalence estimates and the derived efficacy estimates are obtained by the Random Forest classifier and shown with 95% confidence intervals. When data is insufficient to meet the defined selection criteria (c.f. Section 2.5.2), it is omitted and replaced by “–”. Both tables are presented alphabetically by country name and also share a column depicting the most recent data on Omicron prevalence among all virus samples. While Table 9 focuses on the data from individuals that declared their overall vaccination status (using groups Vaccinated, Unvaccinated), Table 10 makes a more detailed characterization by considering the number of doses declared (groups Vaccinated with 1 dose, Vaccinated with 2 doses, Unvaccinated). We also observe that there is less data on individuals with only one dose, since this is a transient state in the vaccination sequence. The full information on sample sizes can be consulted in Appendix C in Tables 11 and 12.
Figure 5a shows three pairs of box plots. Each pair allows comparing vaccine efficacy in October and December when considering data from the selected countries. Table 6 presents the average corresponding to each boxplot, with the 95% confidence interval. We observe that although results are inconclusive for Vaccinated with 1 dose, there is a clear decrease of overall efficacy when considering Vaccinated and Vaccinated with 2 doses.
Analysis of vaccine efficacy towards preventing infection: Sub-figure (a) shows distributions of efficacy in October and December, for the countries with presence of Omicron (as defined in Section 2.5.2); Sub-figures (b,c,d) show vaccination efficacy versus Omicron prevalence in the same set of countries, depending on vaccination status. For each country the 95% confidence intervals of the two values are shown as black lines. The blue line is the Loess curve fitting of the data.
Prevalence of COVID-19 and vaccine efficacy (with 95% confidence interval) in the countries with presence of Omicron in the periods of October and December 2021.
The next three figures, Figures 5b, 5c and 5d, allow us to see a clear trend when plotting efficacy against the most recent relative level of Omicron presence in each selected country. For each case, we present a smoothed line, in blue, depicting a clear decreasing trend. Table 7 presents estimates for the correlation coefficient (using Pearson correlation) together with the corresponding p-value, which confirms its statistical significance for the usual α = 5%.
Relationship between prevalence of Omicron and vaccine efficacy in the countries with presence of Omicron.
4 Discussion
After its surge in South Africa, the Omicron variant is increasing in prevalence in other countries. Although it is still unclear if this variant is associated to a milder disease [KBPC+21] several studies have raised concerns over the decrease of vaccine effectiveness against infection [PvSG+21, NKL+21, KST+21, LMD+21] and this can lead to a wider spread of the virus even in countries with a high vaccination uptake. While we have observed that Omicron reduces the efficacy of vaccines, new studies show that T cells may remain effective with this new variant [AQM22].
Daily participatory symptom surveillance, with widespread deployment in most world countries along the last couple of years, has the potential to offer a new instrument for assessing both global and local trends in health status. While limited in assessing the ground truth, due to the smaller control over the sample design and the need to preserve anonymity, we believe that the vast number of daily survey responses can compensate some of these factors. In this study, we developed a method to adapt and calibrate against the reported SARS-CoV-2 infection status the selection of symptoms, and other covariates from the survey, along different time periods and locations. This was shown to provide a better proxy for assessing the trend in infections and more closely track the official reported cases, in particular in those countries that had a strong surveillance and consistent test positivity rates.
Using this improved classifier we complemented earlier results [VRAB21] that used traditional fixed combinations of symptoms, and updated the analysis for South Africa showing the observed decrease in vaccine efficacy when contrasting a Delta-dominated period (August-September 2021) with the recent Omicron-dominated period (December 2021). We confirmed the presence of a measurable drop in vaccine efficacy from 0.62 (with 95% confidence interval [0.58, 0.65]) in the Delta period to 0.24 (95% CI [0.17, 0.30]) in the Omicron period in the whole country (0.62[0.54, 0.69] to 0.30[0.18, 0.40] in the Gauteng province). In addition, we confirmed that having two doses of vaccine confers better protection than one dose, both in Delta (0.81[0.78, 0.84] versus 0.51[0.46, 0.55]) and Omicron (0.30[0.23, 0.36] versus 0.09[0.00, 0.18]) dominated periods. However, we have no data on the status of respondents with regard to a possible booster dose.
By January 7th, 2022, there were a limited number of candidate countries exhibiting both a high prevalence of Omicron and a high level of sequencing data supporting it. Nevertheless, we extend the analysis to these countries and show the observed changes in efficacy when comparing the months of October (pre-Omicron) with December (with partial presence of Omicron). Although these results should be confirmed once the level of Omicron becomes more dominant in many countries, we have observed a significant level of correlation of around and beyond −0.6 between vaccine efficacy (with either one or two doses) and the prevalence of Omicron. We must also keep clear that this reduction of efficacy is towards infection, and while it does have impact on transmission it does not imply a reduction of vaccine efficacy in protection against serious disease, hospitalization and death.
There are several assumptions that frame our analysis. We assume that UMD Global CTIS answers provide a sample of the population that is interchangeable among the Delta and Omicron dominated periods. Additionally, we did not take into account possible effects from waning immunity and vaccine boost shots, however by considering several different countries we have a mix of different vaccination timings.
Data Availability
The data presented in this paper and some of the programs used to process it are openly accessible at https://github.com/GCGImdea/coronasurveys/tree/master/papers/omicron_efficacy_paper_medRxiv.
https://github.com/GCGImdea/coronasurveys/tree/master/papers/omicron_efficacy_paper_medRxiv
https://raw.githubusercontent.com/GCGImdea/coronasurveys/master/data/estimates-confirmed/PlotData/
https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv
https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/variants/covid-variants.csv
A List of Symptoms
In the UMD Global CTIS the following question is asked: “B1 In the last 24 hours, have you had any of the following?” [The21a]. The following is the list of possible answers (non exclusive):
Fever (B1_1).
Cough (B1_2).
Difficulty breathing (B1_3).
Fatigue (B1_4).
Stuffy or runny nose (B1_5).
Aches or muscle pain (B1_6).
Sore throat (B1_7).
Chest pain (B1_8).
Nausea (B1_9).
Loss of smell or taste (B1_10).
Headache (B1_12).
Chills (B1_13).
B Questions Used for the Machine Learning Model
The following is the list of survey questions whose answers are used to create the Random Forest models, and to classify with them the responses: B1_1, B1_2, B1_3, B1_4, B1_5, B1_6, B1_7, B1_8, B1_9, B1_10, B1_11, B1_12, B1_13, B1_14, B1b_x1, B1b_x2, B1b_x3, B1b_x4, B1b_x5, B1b_x6, B1b_x7, B1b_x8, B1b_x9, B1b_x10, B1b_x11, B1b_x12, B1b_x13, B1b_x14, B3, B5, B6, B9, B10, B11, B12_1, B12_2, B12_3, B12_4, B12_5, B12_6, B13_1, B13_2, B13_3, B13_4, B13_5, B13_6, B13_7, B14_1, B14_2, B14_3, B14_4, B14_5, C0 1, C0 2, C0 3, C0 4, C0 5, C0 6, C1 m, C2, C3, C5, C6, C7, C8, C9, C9a, C12, C13_1, C13_2, C13_3, C13_4, C13_5, C13_6, C14, D1, D2, D3, D4, D5, D6 1, D6 2, D6 3, D7, D8, D9, D10, E2, E3, E4, E7, H1, H2, H3.
The questions removed are B0, B7, B8, B15, and all the questions related to vaccination (V-questions).
C Countries with Omicron Prevalence
Table 8 shows basic official vaccination data on December 31st, 2021, of these countries. Tables 9 and 10 show the COVID-19 prevalence and the vaccine efficacy in October and December in the countries with presence of Omicron as defined in Section 2.5.2.
Information about vaccination on December 31st, 2021, in the countries with presence of Omicron (as defined in Section 2.5.2).
Prevalence of Omicron in COVID-19 and vaccination efficacy in the countries with presence of Omicron (as defined in Section 2.5.2).
Prevalence of Omicron and vaccination efficacy with one and two doses in the countries with presence of Omicron (as defined in Section 2.5.2). The prevalence of Omicron is replicated from Table 9 for easy reference.
Number of survey responses used in each period from the countries with presence of Omicron (as defined in Section 2.5.2), for each level of vaccination.
Number of survey responses classified as positive by Random Forest in each period from the countries with presence of Omicron (as defined in Section 2.5.2), for each level of vaccination.
D Ethical Declaration
The Ethics Board of IMDEA Networks Institute gave ethical approval for this work on 2021/07/05. IMDEA Networks has signed Data Use Agreements with Facebook, Carnegie Mellon University (CMU) and the University of Maryland (UMD) to access their data, specifically UMD project 1587016-3 entitled C-SPEC: Symptom Survey: COVID-19 and CMU project STUDY2020 00000162 entitled ILI Community-Surveillance Study.
E Data Availability
The data presented in this paper and some of the programs used to process it are openly accessible at https://github.com/GCGImdea/coronasurveys/tree/master/papers/omicron_efficacy_paper_medRxiv.
Footnotes
* This work is partially supported by grant CoronaSurveys-CM, funded by IMDEA Networks and Comunidad de Madrid, Spain, and individual donations to the CoronaSurveys Project https://coronasurveys.org.
↵1 Except in the US, where the survey has been run by CMU [Del20, SRB+21].
↵3 The WHO considers countries to have the epidemic under control when their TPR is below 0.05 [W+20].
↵4 The information on variant presence is obtained from [Our21], which extracts it from [EBM17] via [Hod21].
↵5 Our World In Data [Our21] stopped sharing the variant data on January 10th, 2022, upon GISAID request.