Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

Are COVID-19 data reliable? The case of the European Union

View ORCID ProfilePavlos Kolias
doi: https://doi.org/10.1101/2021.12.24.21268373
Pavlos Kolias
Section of Statistics and Operational Research, Department of Mathematics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Pavlos Kolias
  • For correspondence: pakolias@math.auth.gr
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Previous studies have used Benford’s distribution to assess whether there is misreporting of COVID-19 cases and deaths. Data inaccuracies provide false information to the media, undermine global response and hinder the preventive measures taken by countries worldwide. In this study, we analyze daily new cases and deaths from all the countries of the European Union and estimate the conformance to Benford’s distribution. For each country, two statistical tests and two measures of deviations are calculated to determine whether the reported statistics comply with the expected distribution. Four country-level developmental indexes are also included, the GDP per capita, health expenditures, the Universal Health Coverage index, and full vaccination rate. Regression analysis is implemented to show whether the deviation from Benford’s distribution is affected by the aforementioned indexes. The findings indicate that only three countries were in line with the expected distribution, Bulgaria, Croatia, and Romania. For daily cases, Denmark, Greece, and Ireland, showed the greatest deviation from Benford’s distribution and for deaths, Malta, Cyprus, Greece, Italy, and Luxemburg had the highest deviation from Benford’s law. Furthermore, it was found that the vaccination rate is positively associated with deviation from Benford’s distribution. These results suggest that overall official data provided by authorities are not confirming Benford’s law, yet this approach acts as a preliminary tool for data verification. More extensive studies should be made with a more thorough investigation of countries that showed the greatest deviation.

Introduction

The pandemic of COVID-19 has affected the life of millions of people worldwide. Due to rapid contagiousness of the virus (Hafeez et al., 2020), nearly every country employed measures against the virus’ spread, such as national lockdowns and restrictions of typical activities. The pandemic showed that statistical and machine learning modelling procedures can potentially predict the number of new cases or deaths for a given country (Cássaro & Pires, 2020; Niazkar & Niazkar, 2020; Neto et al., 2020). The accurate forecast of the infection curve can facilitate government’s measures towards the suppression of the growth rate. However, in order to accurately predict or model COVID-19 spread, reliable and valid data should be collected from authorities. The recent pandemic of COVID-19 raised issues about data collection and handling. Media reports have questioned whether the statistics provided by countries are trustworthy (Kilani, 2021). Several studies have questioned the accuracy of government data and had linked data manipulation with transparency and democracy indexes (Adsera, Boix & Payne, 2003; Magee & Doces, 2015; Rozenas & Stukal, 2019).

Previous studies, in different fields, have applied Benford’s distribution (or law) analysis to detect fraudulent and manipulated data. Specifically, for COVID-19, it was found that deaths were underreported in the USA (Campolieti, 2021), while in China no manipulation was found (Koch & Okamura, 2020). A study for Japan also showed deviation from Benford’s distribution (Lee, Han & Jeong, 2020). Furthermore, it was found that countries with higher values of the developmental index are less likely to deviate from Benford’s law (Balashov, Yan & Zhu, 2021). This study applies Benford’s law to detect the first digit deviations of the announced cases and deaths from the expected frequencies in the European Union (EU). We further investigate whether the deviation present for each country, is associated with four developmental indexes, the GDP per capita, health expenditures (% of GDP), the Universal Health Coverage Index and full vaccination rate.

Methods

Sample

The public COVID-19 data of the European Union, regarding daily cases and deaths were exported from the European Centre for Disease Prevention and Control (ECDC) and consisted of observations between 2nd of March to the 20th of December 2021 (N = 8820). ECDC’s Epidemic Intelligence team collects and refines daily data of new cases and deaths associated with COVID-19, based on reports from health authorities worldwide. Apart from COVID-19 data, we included the gross domestic product per capita (GDPc), the healthcare expenditures of countries as percentage of GDP (HGDP), and the Universal Health Coverage Index (UHC) from the World Bank (https://data.worldbank.org/). Finally, we included the full COVID-19 vaccination rate as of the 16th of December 2021, obtain from ECDC.

Benford’s distribution

Benford’s law (or law of prime digits) is a probability distribution for determining the first digit in a set of numbers. It was formally proposed in 1938, after an early work by the mathematician Simon Newcomb, by the physicist Frank Benford, who claimed that in natural and unrestricted data sets, the probability of each digit appearing is given by the formula: Embedded Image Based on Benford’s distribution, the probabilities for each number d as the first digit are presented in Table 1.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

Probabilities of the digit in the first position

The most common application of the law is in Economics, where it has already been considered as a tool for checking tax validity and detecting fraud (Nigrini, 1996; Durtschi, Hillison & Pacini, 2004; Tam Cho & Gaines, 2007). More recent studies have used Benford’s law to investigate whether COVID-19 data provided by countries are accurate (Kilani, 2021; Silva & Figueiredo Filho, 2021; Campolieti, 2021; Koch & Okamura, 2020) and if the deviation from Benford’s distribution could be affected by developmental indexes (Balashov, Yan & Zhu, 2021).

Goodness-of-fit

First, in order to investigate to which extent, the observed cases and deaths conform to Benford’s law’s expected frequencies, two goodness-of-fit tests were applied, the chi-squared (χ2) goodness-of-fit test and Kolmogorov-Smirnov (K-S). The chi-squared test statistic is given by: Embedded Image where the index i is the digit, and Oi and Ei are the observed and expected frequencies of the i-th digit, respectively. The degrees of freedom for this test are equal to 8, and the critical value is Embedded Image for the significance level set at a = 0.05; thus, any value of the statistic greater than the critical value would imply significant deviation from the expected distribution. However, in large samples, the interpretation of significance should be avoided, as the test has enough power to detect even small deviations from the expected distribution (Lin, Lucas & Shmueli, 2013). To accompany the results of the chi-squared test, Cramer’s V was calculated along with 95% bootstrap CI for an estimate of the effect size. The Kolmogorov-Smirnov D statistic is commonly used for comparing empirical with theoretical continuous distributions, but it can also be used with integers. The statistic is given by: Embedded Image Both chi-squared and D statistics are greatly affected by sample size, hereby we included two measures that are not affected by large sample sizes, namely the Euclidean distance (ED) in the nine-dimensional space (Tam Cho & Gaines, 2007) given by: Embedded Image and Mean Absolute Distance (MAD) given by: Embedded Image where POi and PEi are the observed and expected proportions of the first digit, respectively.

Regression analysis

The two measures of deviation (ED and MAD) were used as the dependent variables in two regression models, with independent variables the gross domestic product per capita (GDPc), the healthcare expenditures of countries as percentage of GDP (HGDP), the Universal Health Coverage Index (UHC) and the full COVID-19 vaccination rate (Vac), to examine whether the distance observed from Benford’s distribution could be associated with those predictors. Instead of relying in OLS estimates for the parameters of the model, bootstrap estimates have been calculated due to the small sample size of countries (N = 27) leading to more robust results. With bootstrap, we selected 10000 samples with replacement of the initial size, as the original sample, and each time we estimated the OLS coefficients of the parameters; hence, creating the sampling distribution of each coefficient along with 95% bootstrap CIs (Davison & Hinkley, 1997).

Results

The results of the goodness-of-fit tests along with the two measures of deviations are presented in Table 2. For almost countries, except for Bulgaria, Croatia, and Romania, significant deviations were found for both cases and deaths. For daily cases, Denmark, Ireland and Greece were associated with the highest chi-squared statistics and this was also confirmed by the two distance measures (Figure 1 and 2). Regarding deaths, Cyprus, Italy, and Greece had the highest chi-squared statistics and distance measures. The K-S D statistic in most cases came in agreement with the chi-squared test.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2.

Goodness-of-fit statistics and distance measures across countries for new cases and deaths associated with COVID-19

Figure 1.
  • Download figure
  • Open in new tab
Figure 1.

Mean Absolute Distance across countries for a) cases and b) deaths.

Figure 2.
  • Download figure
  • Open in new tab
Figure 2.

Euclidean Distance across countries for a) cases and b) deaths.

The bootstrap estimates and 95% bootstrap CIs of the regression analysis for the two measures of deviation are presented in Table 3. In order to avoid having small coefficients, GDP per capita has been log-transformed and the other three predictors were divided by 100. Regarding new cases, no predictor was found to significantly affect either MAD or ED. Vaccination rate was positively associated with deviation from Benford’s distribution in new cases (0.076, 95% CI [0.020, 0.144]) and deaths (3.415, 95% CI [1.175, 6.286]), indicating that countries with a higher full vaccination percentage tend to deviate more from Benford’s law.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3.

Bootstrap estimates and 95% CIs for Mean Absolute Deviation and Euclidean Distance from Benford’s distribution

Discussion

This study aimed to examine the validity of COVID-19 data from EU using Benford’s law. Data of daily new cases and deaths were collected by ECDC for the period of the 2nd of March 2021 and 20th of December 2021. Also, four country-level indexes were collected, the GDP per capita, the health expenditure as GDP percentage, the Universal Health Coverage index and the full vaccination rate. Two goodness-of-fit tests were applied, the chi-squared test and the Kolmogorov-Smirnov test, and two measures of deviation were estimated, the Euclidean distance and Mean Absolute distance. Bulgaria, Croatia and Romania were not deviating from Benford’s law for both new cases and deaths. Regarding daily cases, Cyprus, Germany, Lithuania, and Slovakia were in line with Benford’s distribution, while Denmark, Greece and Ireland, showed the greatest distance from Benford’s distribution. Regarding deaths, France, Ireland, Latvia, Netherlands, Slovenia and Spain matched Benford’s law and Malta, Cyprus, Greece, Italy and Luxemburg had the highest distance from Benford’s law. The results from the regression analysis suggested that the full vaccination rate was positively associated with non-conformity with Benford’s law, where countries with the highest vaccination percentage exhibited greater deviation.

The results of this study imply that the deviation from Benford’s law is not associated with country’s economy, which was suggested by earlier findings (Hollyer, Rosendorff & Vreeland, 2011). However, the effect would possibly be more apparent by including developing with developed countries (Judge & Schechter, 2009). Deviations from Benford’s distribution are a preliminary step for obtaining evidence for data manipulation; it is suggested that for specific economies that showed the greatest deviations, further studies could be made validating data reported by authorities. Additional parameters can be included, such as lockdown restrictions, preventive measures, and regional statistics and indicators.

Data Availability

All data produced are available online at: https://data.worldbank.org/

Funding

This study did not receive any funding.

Declaration of Conflicting Interests

The author declares that there is no conflict of interest.

References

  1. ↵
    Adsera, A., Boix, C., & Payne, M. (2003). Are you being served? Political accountability and quality of government. The Journal of Law, Economics, and Organization, 19(2), 445–490.
    OpenUrlCrossRefWeb of Science
  2. ↵
    Balashov, V. S., Yan, Y., & Zhu, X. (2021). Using the Newcomb–Benford law to study the association between a country’s COVID-19 reporting accuracy and its development. Scientific reports, 11(1), 1–11.
    OpenUrl
  3. ↵
    Campolieti, M. (2021). COVID-19 deaths in the USA: Benford’s law and under-reporting. Journal of Public Health (Oxford, England).
  4. ↵
    Cássaro, F. A., & Pires, L. F. (2020). Can we predict the occurrence of COVID-19 cases? Considerations using a simple model of growth. Science of the Total Environment, 728, 138834.
    OpenUrl
  5. ↵
    Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge university press.
  6. ↵
    Durtschi, C., Hillison, W., & Pacini, C. (2004). The effective use of Benford’s law to assist in detecting fraud in accounting data. Journal of forensic accounting, 5(1), 17–34.
    OpenUrl
  7. ↵
    Hafeez, A., Ahmad, S., Siddqui, S. A., Ahmad, M., & Mishra, S. (2020). A review of COVID-19 (Coronavirus Disease-2019) diagnosis, treatments and prevention. EJMO, 4(2), 116–125.
    OpenUrl
  8. ↵
    Hollyer, J. R., Rosendorff, B. P., & Vreeland, J. R. (2011). Democracy and transparency. The Journal of Politics, 73(4), 1191–1205.
    OpenUrlCrossRefWeb of Science
  9. ↵
    Judge, G., & Schechter, L. (2009). Detecting problems in survey data using Benford’s Law. Journal of Human Resources, 44(1), 1–24.
    OpenUrlAbstract/FREE Full Text
  10. ↵
    Kilani, A. (2021). Authoritarian regimes’ propensity to manipulate Covid-19 data: a statistical analysis using Benford’s Law. Commonwealth & Comparative Politics, 59(3), 319–333.
    OpenUrl
  11. ↵
    Koch, C., & Okamura, K. (2020). Benford’s law and COVID-19 reporting. Economics letters, 196, 109573.
    OpenUrl
  12. ↵
    Lee, K. B., Han, S., & Jeong, Y. (2020). COVID-19, flattening the curve, and Benford’s law. Physica A: Statistical Mechanics and its Applications, 559, 125090.
    OpenUrl
  13. ↵
    Lin, M., Lucas Jr, H. C., & Shmueli, G. (2013). Research commentary—too big to fail: large samples and the p-value problem. Information Systems Research, 24(4), 906–917.
    OpenUrlCrossRefWeb of Science
  14. ↵
    Magee, C. S., & Doces, J. A. (2015). Reconsidering regime type and growth: lies, dictatorships, and statistics. International Studies Quarterly, 59(2), 223–237.
    OpenUrlCrossRef
  15. ↵
    Niazkar, H. R., & Niazkar, M. (2020). Application of artificial neural networks to predict the COVID-19 outbreak. Global Health Research and Policy, 5(1), 1–11.
    OpenUrl
  16. ↵
    Nigrini, M. J. (1996). A taxpayer compliance application of Benford’s law. The Journal of the American Taxation Association, 18(1), 72.
    OpenUrl
  17. ↵
    Neto, O. P., Reis, J. C., Brizzi, A. C. B., Zambrano, G. J., de Souza, J. M., Pedroso, W., … & Zângaro, R. A. (2020). Compartmentalized mathematical model to predict future number of active cases and deaths of COVID-19. Research on Biomedical Engineering, 1–14.
  18. ↵
    Rozenas, A., & Stukal, D. (2019). How autocrats manipulate economic news: Evidence from Russia’s state-controlled television. The Journal of Politics, 81(3), 982–996.
    OpenUrl
  19. ↵
    Silva, L., & Figueiredo Filho, D. (2021). Using Benford’s law to assess the quality of COVID-19 register data in Brazil. Journal of public health, 43(1), 107–110.
    OpenUrl
  20. ↵
    Tam Cho, W. K., & Gaines, B. J. (2007). Breaking the (Benford) law: Statistical fraud detection in campaign finance. The american statistician, 61(3), 218–223.
    OpenUrlCrossRefWeb of Science
Back to top
PreviousNext
Posted December 25, 2021.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Are COVID-19 data reliable? The case of the European Union
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Are COVID-19 data reliable? The case of the European Union
Pavlos Kolias
medRxiv 2021.12.24.21268373; doi: https://doi.org/10.1101/2021.12.24.21268373
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Are COVID-19 data reliable? The case of the European Union
Pavlos Kolias
medRxiv 2021.12.24.21268373; doi: https://doi.org/10.1101/2021.12.24.21268373

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Public and Global Health
Subject Areas
All Articles
  • Addiction Medicine (164)
  • Allergy and Immunology (416)
  • Anesthesia (92)
  • Cardiovascular Medicine (867)
  • Dentistry and Oral Medicine (159)
  • Dermatology (98)
  • Emergency Medicine (251)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (397)
  • Epidemiology (8589)
  • Forensic Medicine (4)
  • Gastroenterology (390)
  • Genetic and Genomic Medicine (1772)
  • Geriatric Medicine (169)
  • Health Economics (375)
  • Health Informatics (1252)
  • Health Policy (625)
  • Health Systems and Quality Improvement (472)
  • Hematology (197)
  • HIV/AIDS (380)
  • Infectious Diseases (except HIV/AIDS) (10344)
  • Intensive Care and Critical Care Medicine (553)
  • Medical Education (193)
  • Medical Ethics (51)
  • Nephrology (214)
  • Neurology (1692)
  • Nursing (97)
  • Nutrition (252)
  • Obstetrics and Gynecology (330)
  • Occupational and Environmental Health (451)
  • Oncology (933)
  • Ophthalmology (265)
  • Orthopedics (104)
  • Otolaryngology (172)
  • Pain Medicine (115)
  • Palliative Medicine (40)
  • Pathology (255)
  • Pediatrics (539)
  • Pharmacology and Therapeutics (257)
  • Primary Care Research (210)
  • Psychiatry and Clinical Psychology (1785)
  • Public and Global Health (3871)
  • Radiology and Imaging (627)
  • Rehabilitation Medicine and Physical Therapy (322)
  • Respiratory Medicine (525)
  • Rheumatology (208)
  • Sexual and Reproductive Health (170)
  • Sports Medicine (158)
  • Surgery (191)
  • Toxicology (36)
  • Transplantation (101)
  • Urology (76)