Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Estimating the COVID-19 Prevalence in Spain with Indirect Reporting via Open Surveys

Augusto Garcia-Agundez, Oluwasegun Ojo, Harold Hernandez, Carlos Baquero, Davide Frey, Chryssis Georgiou, Mathieu Goessens, Rosa Lillo, Raquel Menezes, Nicolas Nicolaou, Antonio Ortega, Efstathios Stavrakis, Antonio Fernandez Anta
doi: https://doi.org/10.1101/2021.01.29.20248125
Augusto Garcia-Agundez
1Multimedia Communications Lab, etit, TU Darmstadt, Darmstadt, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: augusto.garcia@kom.tu-darmstadt.de
Oluwasegun Ojo
2IMDEA Networks Institute, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Harold Hernandez
3Department of Statistics, University Carlos III de Madrid, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Carlos Baquero
4Departmento de Informatica, University of Minho, Braga, Portugal
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Davide Frey
5Inria Centre de Recherche Rennes Bretagne Atlantique, Rennes, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Chryssis Georgiou
6Department of Computer Science, University of Cyprus, Nicosia, Cyprus
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mathieu Goessens
7IMT Atlantique, Nantes, France
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rosa Lillo
3Department of Statistics, University Carlos III de Madrid, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Raquel Menezes
8Departmento de Matematica, University of Minho, Braga, Portugal
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Nicolas Nicolaou
9Algolysis Ltd, Limassol, Cyprus
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Antonio Ortega
10Department of Electrical and Computer Engineering, USC Viterbi School of Engineering, Los Angeles, CA, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Efstathios Stavrakis
9Algolysis Ltd, Limassol, Cyprus
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Antonio Fernandez Anta
2IMDEA Networks Institute, Madrid, Spain
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

ABSTRACT

During the initial phases of the COVID-19 pandemic, accurate tracking has proven unfeasible. Initial estimation methods pointed towards case numbers that were much higher than officially reported. In the CoronaSurveys project, we have been addressing this issue using open online surveys with indirect reporting. We compare our estimates with the results of a serology study for Spain, obtaining high correlations (R squared 0.89). In our view, these results strongly support the idea of using open surveys with indirect reporting as a method to broadly sense the progress of a pandemic.

1 INTRODUCTION

During the initial phases of the COVID-19 pandemic, progress tracking via massive serology testing has proven to be unfeasible. However, initial estimation methods suggested that the real numbers of COVID-19 cases were significantly higher than those officially reported (1). For instance, by April 30th, 2020, the number of confirmed fatalities due to COVID-19 in the US was 66, 028, and the number of confirmed cases was 1, 080, 303. However, with that number of fatalities the number of cases must have been no less than 4, 784, 637, by simply using the Case-fatality Ratio (CFR) of 1.38% measured in Wuhan (2).

In the case of Spain, the discrepancy seems to be even higher. Preliminary studies point towards only one in 53 cases being reported during the first days of the pandemic (3). Although recent availability of massive testing has reduced this discrepancy, demographic statistics still indicate a degree of underreporting to this day, which can be seen among others in mortality numbers: all-cause mortality statistics in Spain point to two periods of significant excess of deaths in the country over the predicted values in 2020: March and April (44, 599 deaths in excess) and August to December (26, 186 deaths in excess) (4). These numbers contrast with the officially reported number of deaths due to COVID-19, which rests at 50, 837 (5). This discrepancy is corroborated in publications from official government authorities, which indicate an ongoing estimated underreporting of 20% to 40% (6).

In the CoronaSurveys project, (7) we aim to track the progress of the pandemic using online, open, anonymous surveys with indirect reporting. Recent articles have also suggested the use of surveys to monitor the pandemic, both for Spain (8, 9) and globally (10). However, to our knowledge, all surveys conducted in Spain have employed direct reporting only, asking participants about themselves. CoronaSurveys implements the network scale-up method of indirect reporting instead, allowing us to collect data on a wide fraction of the population with a small number of responses and in a very short time-frame (11). In this article, we compare the accuracy of CoronaSurveys with a gold standard: serology testing data collected by the Spanish government in the ENE-COVID study (12).

2 METHODS

The survey deployed in the CoronaSurveys project, which can be answered via browser or mobile app, includes two questions:

  1. How many people do you know in your area for which you know their health condition? The answer to this question by participant i is the Reach ri.

  2. How many of those were diagnosed with or have symptoms of COVID-19? The answer to this question by participant i is the Cumulative Number of Cases ci.

In the CoronaSurveys project we have focused on simplicity and brevity to maximize interest and retain users that would consistently provide data every few days. For that reason the total number of questions in the survey has been kept small at all times. Our approach yielded good initial results with about 200 responses per week. The survey has been promoted via social networks via direct contacts and, more recently, with paid advertising. To ensure total anonymity, the surveys are hosted on a private instance of LimeSurvey (13). Data is aggregated daily, and in this process the responses are shuffled so no single entry can be back-traced to its user. All the data is published in a public Github repository. The study design was reviewed and approved by the ethics committee of the IMDEA Networks Institute. The survey includes an informed consent.

Once the data is collected, we remove outlier responses. A response is considered an outlier if (1) ri is outside 1.5 times the interquartile range above the upper quartile (which for the data in this paper means ri > 175) or if (2) ci/ri is greater than 1/3 (to exclude participants with an exceptionally high contact with cases). For this paper we only consider responses in which participants provide information for their region. Hence, the data is aggregated by region for all participants, to obtain the estimator of COVID-19 prevalence (∑i ci)/(∑iri) (11).

3 RESULTS

To evaluate the accuracy of this method to sense the cumulative number of cases of COVID-19, we compare our estimates with the results of the serology study of Pollán et al. (12) for Spain. We exclude Ceuta and Melilla due to lack of data on our part. Conducted between April 27 and May 11, 2020, the serology study provides data for n = 61, 075 participants (0.1787% ± 0.0984% of the regional population, and 0.1299% of the national population). We consider as positive cases those that tested positive to the point-of-care or immunoassay IgG tests (Supplementary Table 6 in Pollán et al. (12), column Either test positive).

For our estimates, we consider the (up to) 100 most recent survey responses per region on April 20. The date is chosen because the mean period between illness onset and a 95% confidence of IgG antibodies presence is 14 days (14). This results in n = 999 responses (59 ± 35 per region) across Spanish regions, with a cumulative reach of ∑i ri = 67, 199 (0.1827% ± 0.0701% of the regional population, and 0.1434% of the national population).

4 DISCUSSION

The Bland-Altman plot in Figure 1A shows a high correlation between the CoronaSurveys estimates and the gold standard. A direct comparison of crude percentages, depicted in Figure 1B, also yields excellent results (R2 = 0.8994). The linear regression equation points to CoronaSurveys very consistently underestimating the number of cases by a factor of approximately 46%, possibly due to asymptomatic cases. This ratio is consistent with the estimates of the Covid19Impact study of Oliver et al. (9), which used more than 140, 000 direct survey responses collected on March 28th-30th. It is also consistent with the reported data on asymptomatic cases reported by Poll án et al. (12), which found that around a third of the seropositive participants were asymptomatic. Table 1 presents a detailed comparison of the estimates per region obtained in the different studies.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

Percentage (and 95% confidence interval) of infected population per region according to the ENE-COVID serology study (12), CoronaSurveys and Covid19Impact (9) (symptom-only model).

Figure 1.
  • Download figure
  • Open in new tab
Figure 1.

Comparison between the serology test and CoronaSurveys, Bland-Altman (A) and direct correlation (B)

Figure 2A presents how the number of replies per region affects the resulting value of R2. This analysis indicates that 50 responses per region can already offer a reasonable estimation of cases. Including more replies may increase accuracy further, but the numbers remain reasonably stable. Naturally, it is important that replies are well distributed across all regions. Figure 2B depicts the effect of the day limit on R2 if we consider a date of ± one week. Theoretically, a bell curve centered on the 20th should be expected, as estimating too early would imply too few cases are reported, and estimating too late would include more cases. We indeed observe an impact on accuracy, and the left half of the bell curve is more visible. The change in accuracy is mostly due to responses collected on April 16th. The lack of the right half of the bell curve is due to the low number of daily responses after April 16th, which implies that the daily estimates are computed with sets of responses with large intersections.

Figure 2.
  • Download figure
  • Open in new tab
Figure 2.

Convergence of correlation with number of replies (A) and day of the month (B)

Interestingly, a similarly high number of responses was collected on April 14th, with nearly no impact on accuracy. We believe this is due to the distribution of the responses. As depicted in Figure 3, additional responses from regions where many are already available will barely have an impact on the global result. As the great majority of contributions for April 14th were for Madrid, where we already had many responses available, the 77 new responses on April 14th barely had any impact.

Figure 3.
  • Download figure
  • Open in new tab
Figure 3.

Distribution of new survey responses on April 14 (A) and April 16 (B)

Our study presents a number of limitations. Firstly, as presented in Table 1, our number of responses in some regions was limited (e.g., 9 responses in La Rioja or 16 in Navarra and Cantabria). Our own analysis suggests this is not enough to offer reliable data for these three regions. Additionally, our criteria to eliminate outliers is heuristic, and may change in the future as we collect more data.

Nevertheless, despite these limitations, the estimates obtained in CoronaSurveys show high correlation with serology tests. Moreover, since the underestimation of our estimates over all regions is homogeneous, and consistent with the one third fraction of asymptomatic reported by Pollán et al. (12), these estimates can be “corrected” to provide an accurate cumulative number of cases for each region. We will further evaluate the robustness of our model as Pollán et al. publish the results of their three additional serology studies.

In summary, we believe these results strongly support using open surveys with indirect reporting as a method to broadly sense the progress of a pandemic.

Data Availability

All data are publicly available in the CoronaSurveys Github Repository

https://github.com/GCGImdea/coronasurveys

CONFLICT OF INTEREST STATEMENT

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

AUTHOR CONTRIBUTIONS

The analysis presented in this article was conducted by Augusto Garcia-Agundez and Antonio Fernandez Anta with support and feedback from all remaining co-authors. The data acquisition and processing techniques were developed by all co-authors.

FUNDING

At the time of writing this article, CoronaSurveys has received no public funding. Social networks surveys have been partially funded via donations through our website. CoronaSurveys received an award from the UMD/CMU COVID-19 Symptom Data Challenge.

DATA AVAILABILITY STATEMENT

The datasets generated and analyzed for this study can be found in the CoronaSurveys Github Repository at https://github.com/GCGImdea/coronasurveys.

ACKNOWLEDGMENTS

We would like to thank all CoronaSurveys researchers and collaborators for their contribution to this project: https://coronasurveys.org/team/.

REFERENCES

  1. 1.↵
    Maxmen A. How much is coronavirus spreading under the radar. Nature 10 (2020).
  2. 2.↵
    Verity R, Okell LC, Dorigatti I, Winskill P, Whittaker C, Imai N, et al. Estimates of the severity of coronavirus disease 2019: a model-based analysis. The Lancet infectious diseases 20 (2020) 669–677.
    OpenUrlCrossRefPubMed
  3. 3.↵
    Krantz SG, Rao ASS. Level of underreporting including underdiagnosis before the first peak of COVID-19 in various countries: Preliminary retrospective results based on wavelets and deterministic modeling. Infection Control & Hospital Epidemiology (2020) 1–3.
  4. 4.↵
    [Dataset] Centro Nacional de Epidemiologia, Instituto de Salud Carlos III. Informe MoMo. situación a 30 de diciembre de 2020. https://www.isciii.es/QueHacemos/Servicios/VigilanciaSaludPublicaRENAVE/EnfermedadesTransmisibles/MoMo/Paginas/Informes-MoMo-2020.aspx (2020).
  5. 5.↵
    [Dataset] Ministerio de Sanidad Gobierno de España. Actualización n° 282. enfermedad por el coronavirus (COVID-19). https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov/docuActualizacion282COVID19.pdf (2020).
  6. 6.↵
    Moros MJS, Monge S, Rodríguez BS, San Miguel LG, Soria FS. COVID-19 in Spain: view from the eye of the storm. The Lancet Public Health (2020).
  7. 7.↵
    Ojo O, García-Agundez A, Girault B, Hernández H, Cabana E, García-García A, et al. Coronasurveys: Using surveys with indirect reporting to estimate the incidence and evolution of epidemics. KDD Workshop Humanitarian Mapping, San Diego, California USA, August 24, 2020. ArXiv preprint:2005.12783 (2020).
  8. 8.↵
    Linares M, Garitano I, Santos L, Ramos JM. Estimando el numero de casos de COVID-19 a tiempo real utilizando un formulario web a través de las redes sociales: Proyecto COVID19-TRENDS. Semergen (2020).
  9. 9.↵
    Oliver N, Barber X, Roomp K, Roomp K. Assessing the impact of the COVID-19 pandemic in Spain: Large-scale, online, self-reported population survey. Journal of medical Internet research 22 (2020) e21319.
    OpenUrl
  10. 10.↵
    [Dataset] Facebook Data for Good. COVID-19 symptom survey – request for data access. https://dataforgood.fb.com/docs/covid-19-symptom-survey-request-for-data-access/ (2020). Accessed: 2021-01-24.
  11. 11.↵
    Bernard HR, Hallett T, Iovita A, Johnsen EC, Lyerla R, McCarty C, et al. Counting hard-to-count populations: the network scale-up method for public health. Sex. Transm. Infect. 86 (2010) ii11–ii15.
    OpenUrlAbstract/FREE Full Text
  12. 12.↵
    Pollán M, Pérez-Gómez B, Pastor-Barriuso R, Oteo J, Hernán MA, Pérez-Olmeda M, et al. Prevalence of SARS-CoV-2 in Spain (ENE-COVID): a nationwide, population-based seroepidemiological study. Lancet 396 (2020) 535–544.
    OpenUrlCrossRefPubMed
  13. 13.↵
    LimeSurvey Project Team / Carsten Schmitz. LimeSurvey: An Open Source survey tool. LimeSurvey Project, Hamburg, Germany (2012).
  14. 14.↵
    Pallett SJ, Rayment M, Patel A, Fitzgerald-Smith SA, Denny SJ, Charani E, et al. Point-of-care serological assays for delayed SARS-CoV-2 case identification among health-care workers in the UK: a prospective multicentre cohort study. Lancet Respir. Med. 8 (2020) 885–894.
    OpenUrl
Back to top
PreviousNext
Posted February 01, 2021.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Estimating the COVID-19 Prevalence in Spain with Indirect Reporting via Open Surveys
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Estimating the COVID-19 Prevalence in Spain with Indirect Reporting via Open Surveys
Augusto Garcia-Agundez, Oluwasegun Ojo, Harold Hernandez, Carlos Baquero, Davide Frey, Chryssis Georgiou, Mathieu Goessens, Rosa Lillo, Raquel Menezes, Nicolas Nicolaou, Antonio Ortega, Efstathios Stavrakis, Antonio Fernandez Anta
medRxiv 2021.01.29.20248125; doi: https://doi.org/10.1101/2021.01.29.20248125
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Estimating the COVID-19 Prevalence in Spain with Indirect Reporting via Open Surveys
Augusto Garcia-Agundez, Oluwasegun Ojo, Harold Hernandez, Carlos Baquero, Davide Frey, Chryssis Georgiou, Mathieu Goessens, Rosa Lillo, Raquel Menezes, Nicolas Nicolaou, Antonio Ortega, Efstathios Stavrakis, Antonio Fernandez Anta
medRxiv 2021.01.29.20248125; doi: https://doi.org/10.1101/2021.01.29.20248125

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (76)
  • Allergy and Immunology (202)
  • Anesthesia (55)
  • Cardiovascular Medicine (495)
  • Dentistry and Oral Medicine (91)
  • Dermatology (57)
  • Emergency Medicine (170)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (217)
  • Epidemiology (5749)
  • Forensic Medicine (3)
  • Gastroenterology (221)
  • Genetic and Genomic Medicine (883)
  • Geriatric Medicine (89)
  • Health Economics (233)
  • Health Informatics (777)
  • Health Policy (400)
  • Health Systems and Quality Improvement (256)
  • Hematology (105)
  • HIV/AIDS (187)
  • Infectious Diseases (except HIV/AIDS) (6580)
  • Intensive Care and Critical Care Medicine (397)
  • Medical Education (119)
  • Medical Ethics (28)
  • Nephrology (94)
  • Neurology (858)
  • Nursing (45)
  • Nutrition (143)
  • Obstetrics and Gynecology (166)
  • Occupational and Environmental Health (266)
  • Oncology (521)
  • Ophthalmology (168)
  • Orthopedics (44)
  • Otolaryngology (107)
  • Pain Medicine (48)
  • Palliative Medicine (22)
  • Pathology (150)
  • Pediatrics (257)
  • Pharmacology and Therapeutics (147)
  • Primary Care Research (116)
  • Psychiatry and Clinical Psychology (990)
  • Public and Global Health (2264)
  • Radiology and Imaging (380)
  • Rehabilitation Medicine and Physical Therapy (175)
  • Respiratory Medicine (314)
  • Rheumatology (110)
  • Sexual and Reproductive Health (83)
  • Sports Medicine (83)
  • Surgery (118)
  • Toxicology (25)
  • Transplantation (34)
  • Urology (42)