## Abstract

Effectively evaluating, controlling and predicting the course of the COVID-19 pandemic requires knowledge of the true number of infections in the population. This number, however, generally differs substantially from the number of confirmed cases due to a large fraction of asymptomatic infections as well as geographically and temporally variable testing effort and strategies. Here I use age-stratified death count statistics, published age-dependent infection fatality risks and stochastic modeling to estimate the true prevalence and growth of COVID-19 infections among adults (age *≥* 20 years) in 161 countries, from early 2020 until November 1, 2020. My predictions are largely consistent with data from multiple previous nationwide seroprevalence surveys. As of November 1, 2020, the nationwide cumulative COVID-19 prevalence (past and current infections relative to the population size) is estimated at 31% (95%-CI 22-50) for Peru, 27% (17–41) for Mexico, 22% (14–34) for Brazil, 12% (7.2-20) for the US, 11% (6.4–18) for the United Kingdom, 8.2% (5.2–15) for France, 7.4% (4.9–13) for Sweden, 4.2% (2.5–6.8) for Canada, 1.8% (1.2–3) for Germany and 0.12% (0.074–0.26) for Japan. These time-resolved estimates expand the possibilities to evaluate the factors influencing the pandemic’s progression and to assess vaccination needs around the world. Periodically updated estimates are available at: www.loucalab.com/archive/COVID19prevalence

## Introduction

Accurate estimates of the true prevalence of COVID-19 in a population are needed for evaluating (and optimizing) disease control policies and testing strategies, determining seasonal effects, predicting future disease spread, assessing the risk of foreign travel and determining vaccination needs [1]. Further, parameter-rich and potentially underdetermined epidemiological models [2, 3] generally benefit from independently obtained estimates of disease prevalence. Due to the existence of a large fraction of asymptomatic cases, as well as variation in reporting, testing effort and testing strategies (e.g., random vs symptom-triggered), confirmed case counts cannot be directly converted to infection counts and a comparison of confirmed case counts between countries is generally of limited informative value [4]. While large-scale seroprevalence surveys (e.g., using antibody tests) can yield information on the disease’s true prevalence in a population, such surveys involve substantial financial and logistical challenges and only yield prevalence estimates at a specific time point.

In contrast to case reports, COVID-19-related death counts are generally regarded as less sensitive to testing effort and strategy [5, 6], and fortunately most countries have established nationwide continuous reporting mechanisms for death counts. Hence, in principle, knowing the infection fatality risk (IFR, the probability of death following infection) should permit a conversion of death counts to infection counts [5, 6]. The IFR of COVID-19, however, depends strongly on the patient’s age, and hence the effective IFR of the entire population depends on the population’s age structure as well as the disease’s age distribution [7]. Indeed, it was shown that the age-dependency of the IFR, the age-dependency of COVID-19 prevalence, and the age structure of the population are largely sufficient to explain variation in the effective IFR between countries [8]. This suggests that age-stratified death counts can (and must) be used with age-dependent IFR estimates in order to obtain an accurate estimate of infection counts. This approach has been successfully used to estimate COVID-19 prevalence over time in Europe until May 4, 2020 [6].

Unfortunately, the ongoing pandemic necessitates continuously updated prevalence estimates. Moreover, age-stratified and time-resolved death statistics are not readily available for many countries with insufficiently comprehensive reporting, thus preventing a direct adoption of the above approach [6, 9]. In cases where only total death counts are available (e.g., as disseminated by the World Health Organization) one needs to somehow independently determine the likely age distribution of infections in order to convert total death counts to infection counts. Here I address this challenge by leveraging information on the age distribution of COVID-19 infections from multiple countries with available age-stratified death reports, to estimate the likely age-distribution of COVID-19 in other countries, while accounting for each country’s age structure. Based on these calibrations, I estimate the prevalence of COVID-19 (cumulative number of infections, weekly new infections and exponential growth rate) over time in 161 countries up until November 1, 2020, among adults aged 20 years or more. My predictions are largely consistent with data from multiple previously published nationwide seroprevalence surveys.

### Calibrating the age distribution of COVID-19 prevalence

In order to calculate infection counts solely from total (i.e., non-age-stratified) death counts, while accounting for the age-dependency of the IFR and each country’s population age structure, independent estimates of the ratios of infection risks between age groups (i.e., the risk of infection in any one age group relative to any other age group) are needed. To determine the general distribution of age-specific infection risk ratios, I analyzed weekly age-stratified COVID-19-related death reports from 20 countries around the world using a probabilistic model of Poisson-distributed time-delayed death counts (see Methods for details). Briefly, for any given country *c*, any given week *w*, and any given age group *g*, I assumed that the number of new infections during that week (*I*_{c,w,g}) is approximately equal to *α*_{c,g}*I*_{c,w,r}*N*_{c,g}*/N*_{c,r}, where *r* represents some fixed reference age group, *N*_{c,g} is the population size of age group *g*, and *α*_{c,g} is the relative risk of an individual in age group *g* being infected compared to that of an individual in age group *r*. The expected number of deaths in each age group 4 weeks later (roughly the average time lag between infection and death [10]), denoted *D*_{c,w+4,g}, was assumed to be *I*_{c,w,g}*R*_{g}, where *R*_{g} is the IFR for that age group. Age-specific IFRs were calculated beforehand by taking the average over multiple IFR estimates reported in the literature [8, 9, 11–14]. This model thus accounts for the age-structure of each country, the age-distribution of the disease in each country and the age-dependency of the IFR. A critical assumption of the model is that, in any given country, nationwide age-specific infection risks co-vary linearly between age groups over time, i.e., an increase of disease prevalence in one age group coincides with a proportional increase of prevalence in any other age group. This assumption is motivated by the observation that nationwide death rates generally covary strongly linearly between age groups (Fig. 1A and Supplemental Fig. S1); the adequacy of this model is also confirmed in retrospect (see below). For each country, I fitted the infection risk ratios *α*_{c,g} (for all *g* ≠ *r*) as well as the weekly infections in the reference age-group *I*_{c,w,r} (one per week) to the age-stratified weekly death counts using a maximum-likelihood approach and assuming that weekly death counts follow a Poisson distribution. This stochastic model explained the data generally well, with observed weekly death counts almost always falling within the 95% confidence interval of the model’s predictions (Supplemental Fig. S2). This supports the initial assumption that infection risks co-vary approximately linearly between age groups over time and suggests that country-specific but time-independent infection risk ratios are largely sufficient for describing the age-distribution of COVID-19 infections in a country and over time. For any given age group *g*, the fitted infection risk ratios *α*_{c,g} differed between countries but were generally within the same order of magnitude (Fig. 1B). On the basis of this observation, and as explained in the next section, it thus seems possible to approximately estimate the number of infections in any other country based on total death counts, the population’s age structure and the *ensemble* of infection risk ratios *α*_{c,g} fitted above.

### Estimating infection counts over time

Based on the ensemble of fitted infection risk ratios and total (non-age-stratified) COVID-19-related death count reports disseminated by the WHO, I estimated the true weekly infection counts over time in each of 161 countries (details in Methods). Briefly, for any given country *c*, week *w* and any given set of relative infection risks *α*_{1}, *α*_{2}, .., the total number of deaths 4 weeks later was assumed to be Poisson-distributed with expectation equal to:
consistent with the previously described model, where as before *R*_{g} is the IFR for age group *g, N*_{c,g} is the population size of age group *g* and *I*_{c,w,r} is the (a priori unknown) number of new infections in the reference age group *r* during week *w*. For the sum in Eq. (1), I considered only age groups at 20 years or older (in 5-year intervals), because estimates of the infection risk ratios *α*_{g} were unreliable for younger ages (due to low death counts) and because deaths among less-than-20-year olds were numerically negligible compared to the total number of deaths reported. For each week, the unknown *I*_{c,w,r} was estimated via maximum-likelihood based on the total deaths reported 4 weeks later. The total number of new infections among *≥* 20-year olds during that week was estimated as *I*_{c,w} = *I*_{c,w,r}*∑*_{g} *α*_{g}*N*_{c,g}*/N*_{c,r}. Cumulative (i.e., past and current) infection counts were calculated as incremental sums of the weekly infection count estimates. The pandemic’s exponential growth rate over time was subsequently calculated from the estimated weekly infection counts based on a Poisson distribution model and using a sliding-window approach.

Depending on the particular choice of infection risk ratios, this yielded different estimates for the weekly nationwide infection counts, the cumulative infection counts and the exponential growth rates over time. Uncertainty in the true infection risk ratios in any particular country was accounted for by randomly sampling from the full distribution of fitted infection risk ratios multiple times, and calculating confidence intervals of the predictions based on the obtained distribution of estimates. Estimated weekly and cumulative infection fractions (i.e., relative to population size) and exponential growth rates over time are shown for a selection of countries in Fig. 2 and Supplemental Figs. S3 and S4. A comprehensive and periodically updated report of estimates for all 161 countries is being made available at www.loucalab.com/archive/COVID19prevalence. Global color-maps of the latest estimates for all countries are shown in Fig. 3.

To assess the accuracy of the above approach, I compared the estimated cumulative infection fractions to previously published nationwide antibody-based seroprevalence surveys across 12 countries (Supplemental Table S2). Only surveys attempting to estimate nationwide seroprevalence in the general population (in particular, either using geographically or demographically stratified sampling or adjusting for sample demographics) were included. Agreement between model estimates and seroprevalence estimates was generally good, with seroprevalence point-estimates for 8 out of 9 countries (all except Brazil) being included in the model’s 95%-confidence intervals (Figs. 2K,L,N,O and Supplemental Fig. S3). Apart from potentially erroneous model predictions (discussed below), deviations from seroprevalence-based estimates may also be due to the fact that antibody concentrations in infected individuals (especially asymptomatic ones) can drop over time, rendering many of them seronegative [15–17]. Thus, previously infected individuals may not all be recognized as such. Further, sensitivity and specificity estimates for antibody tests performed in the laboratory or claimed by manufactures need not always apply in a community setting [17], thus introducing biases in seroprevalence estimates despite adjustments for sensitivity and specificity.

### Case counts alone can yield wrong impressions

Estimates of the true COVID-19 prevalence in a population can yield insight into the pandemic’s growth dynamics that may not have been possible from reported case counts alone. Indeed, according to the present estimates, in most countries case counts initially severely underestimated the number of true infections and often did not properly reflect the progression of the pandemic, although in many countries more recent case reports do capture a much larger fraction of infections and more closely reflect the pandemic’s dynamics (Figs. 2A–E and Supplemental Fig. S4). For example, in the US, France, Sweden, Belgium, Spain, United Kingdom and many other European countries reported cases only reflected a small fraction of infections occurring in Spring 2020, while the majority of infections occurring in Summer and Fall 2020 have been successfully detected. Nevertheless, in multiple countries even recent case counts do not correctly reflect the actual dynamics of the pandemic, sometimes even suggesting an opposite trend in its growth. For example, recent case counts in Turkey, Iran, Egypt and Afghanistan severely underestimate the disease’s rapid ongoing growth (Supplemental Fig. S5). Future investigations, enabled by the infection rate estimates presented here, might be able to identify the main factors (e.g., political, financial, organizational) driving the discrepancies between infections and detected cases and suggest concrete steps to eliminate them or correct for them.

As infection counts do not depend on testing effort and strategies, they are arguably more suitable for comparing the pandemic’s progression between countries. Future investigations, enabled by the estimates presented here, might be able to identify concrete political, environmental and socioeconomic factors influencing the pandemic’s growth. For example, my results indicate that as of November 1, 2020 Sweden — often criticized for it’s reluctance to impose strong restrictions on it’s citizens — was experiencing a slower increase of infections (relative to its population size) than many other European countries such as France, Spain, Italy and the United Kingdom (Supplemental Fig. S4). A similar observation can be made for the US, also frequently pointed out as a particularly severely affected country: As of November 1, 2020 the weekly fraction of newly infected individuals (relative to population size) appears to be substantially lower in the US than in many European countries (Fig. 2F and Supplemental Fig. S4), while the cumulative fraction of infected individuals in the US is comparable to some European countries (e.g., Belgium, United Kingdom, Spain) and much lower than many South American countries (Supplemental Fig. S6). These observations highlight the importance of considering actual infection counts (and of course death counts) relative to population size when evaluating policy differences between countries.

### Caveats

The predictions presented here are subject to some important caveats. First, incomplete, erroneous or agebiased reporting of COVID-19-related deaths will have a direct impact on the estimated infection counts. This caveat is particularly important for countries with less developed medical or reporting infrastructure, as well as for countries were reports may be censored or modified for political reasons. Comparisons of results between countries should thus be done with care. Second, the age-specific infection risk ratios (*α*_{c,g}) were calibrated based on available age-stratified death statistics from a limited number of countries, and may not apply to all other countries (for example due to strong cultural differences). Uncertainty associated with this extrapolation is partly accounted for by considering infection risk ratios calibrated to multiple alternative countries (see Methods). Third, age-specific IFRs were obtained from studies in only a few countries (mostly western) and often based on a small subset of closely monitored cases (e.g., from the Diamond Princess cruise ship). These IFR estimates may not be accurate for all countries, especially countries with a very different medical infrastructure, different sex ratios in the population or a different prevalence of pre-existing health conditions (e.g., diabetes), all of which can affect the IFR. That said, estimated trends over time within any given country, in particular exponential growth rates (e.g., Figs. 2P–T), are unlikely to be substantially affected by such biases. To nevertheless examine the robustness of my estimates against variations in the IFR, I repeated the above analyses by considering for each age group an ensemble of IFRs, i.e., randomly sampling from the set of previously reported IFRs [8, 9, 11–14] rather than considering their mean. Median model predictions remained nearly unchanged, however unsurprisingly the uncertainty (i.e., confidence intervals) of the estimates increased (examples in Supplemental Fig. S7).

## Conclusion

I have presented estimates of the true nationwide prevalence and growth rate of COVID-19 infections over time in 161 countries around the world, based on official COVID-19-related death reports, age-specific infection fatality risks and each country’s population age structure. My estimates are largely consistent with data from nationwide general-population seroprevalence surveys. My findings suggest that while in many countries the detection of infections has greatly improved, there are also examples where even recent reported case counts do not properly reflect the pandemic’s dynamics. In particular, comparisons between countries based on infection counts can yield very different conclusions than comparisons merely based on confirmed case counts. My estimates thus enable more precise assessments of the disease’s progression, evaluation and improvement of public interventions and testing strategies, and estimation of nationwide vaccination needs.

## Methods

### Age-specific infection fatality risks

Age-specific infection fatality risks (IFRs) were calculated based on the following literature: Table 1 in [12], Supplementary Appendix Q in [8], Table S2 in [13], Table 2 in [11], Table S4 in [9], and Eq. (1) in [14]. For each considered age group, the average IFR across all of the aforementioned published IFRs was used, after linearly interpolating where necessary (Supplemental Table S1).

### Calibrating age-specific infection risk ratios

Age-specific population sizes for each country (status 2019) were downloaded from the United Nations website (https://population.un.org/wpp/Download/Standard/CSV) on October 23, 2020 [18]. Time series of nationwide cumulative COVID-19-related death counts grouped by 5-year age intervals were downloaded on November 30, 2020 from COVerAGE-DB (https://osf.io/7tnfh), which is a database that gathers and curates official death count statistics from multiple official sources [19]. For each country included in COVerAGE-DB, and separately for each age-group, I ensured that cumulative death counts are non-decreasing over time by linearly re-interpolating death counts at problematic time points. The resulting time series were then linearly interpolated onto a regular weekly time grid, i.e., in which adjacent time points are 7 days apart (no extrapolation was performed, i.e., only dates covered by the original time series were included). The weekly number of new deaths in each age group were calculated as the difference of cumulative deaths between consecutive time points on the weekly grid. To ensure a high accuracy in the estimated infection risk ratios, in the following analysis I only considered countries for which COVerAGE-DB covered at least 10 weeks with at least 100 reported deaths each. The following 20 countries were thus considered: Argentina, Bangladesh, Belgium, Brazil, Chile, Colombia, Germany, Ecuador, France, United Kingdom, Indonesia, India, Italy, Mexico, Netherlands, Peru, Philippines, Sweden, Turkey and United States.

For each considered country *c*, I chose as “reference” age group *r* the age group that had the highest cumulative number of deaths. For each other age group *g*, I estimated the infection risk ratio *α*_{c,g}, i.e., the probability of an individual in group *g* being infected relative to the probability of an individual in group *r* being infected, using a probabilistic model according to which the number of deaths in group *g* during week *w* (denoted *D*_{c,w,g}) was Poisson distributed with expectation:
Here, *N*_{c,g} is the population size of age group *g* in country *c* and *R*_{g} is the IFR for age group *g*. Under this model, the maximum-likelihood estimate for *α*_{c,g}, i.e. given the weekly death count time series, is given by:
To avoid errors due to sampling noise, only weeks with at least 100 reported deaths were considered in the sums in Eq. (3). I mention that *α*_{c,g} might also alternatively be estimated as the slope of the linear regression:
Estimates obtained via linear regression were nearly identical to those obtained using the aforementioned Poissonian model, and were thus not considered further.

For purposes of evaluating the model’s adequacy (explained below), I also estimated the weekly number of infections in the reference age group, *I*_{c,w,r}, via maximum-likelihood based on a probabilistic model in which *D*_{c,w,g} was Poisson-distributed with expectation:
Under this model, the maximum-likelihood estimate for *I*_{c,w−4,r} is given by:
To evaluate the adequacy of the above stochastic model in explaining the original death count data, I simulated multiple hypothetical weekly death counts for each age group and compared the distribution of simulated death counts to the true death counts. Specifically, for each country *c*, week *w* and age group *g*, I drew 100 random death counts from a Poisson distribution with expectation:
Median simulated death counts and 50 % and 95% confidence intervals, along with the original death counts, are shown for a representative selection of countries and age groups in Supplemental Fig. S3. As can be seen in that figure, the model’s simulated time series are largely consistent with the original data.

### Estimating infection counts from total death counts

Time series of total (non-age-stratified) nationwide cumulative reported death and case counts were downloaded from the website of the World Health Organization (https://covid19.who.int/table) on November 30, 2020. Cumulative death and case counts were made non-decreasing and interpolated onto a weekly time grid as described above. Only countries that reported at least one death per week for at least 10 weeks were included in the analysis below. For each country *c*, week *w* and any particular choice of age-specific infection risk ratios *α*_{1}, *α*_{2}, .., the number of infections was estimated as follows. Let *r* denote some fixed reference age group with respect to which infection risk ratios are defined, i.e., such that *α*_{r} = 1 (here, ages 70–74 were used as reference). Let *I*_{c,w,r} be the (a priori unknown) number of new infections occurring during that week in the reference age group. The number of deaths occurring 4 weeks later in any age group *g, D*_{c,w+4,g}, was assumed to be Poisson-distributed with expectation equal to:
The total number of weekly deaths, *D*_{c,w+4}, is thus Poisson-distributed with expectation:
As explained in the main text, only age groups *≥* 20 years were included because infection risk ratios could not be reliably estimated for younger ages and because the contribution of younger ages to total death counts can be considered numerically negligible. Under the above model, the maximum-likelihood estimate for *I*_{c,w,r} is given by:
The total number of weekly infections, *I*_{c,w}, can thus be estimated as:
The cumulative number of total infections up until any given week can be estimated by summing the weekly infection counts.

Exponential growth rates over time were estimated from the weekly infection counts using a sliding-window approach, as follows. In every sliding window (spanning 4 consecutive weeks), an exponential function of the form *I*(*t*) = *Ae*^{tλ} was fitted, where *t* denotes time in days and *A* and *λ* are unknown parameters (in particular, *λ* is the exponential growth rate in that window). The parameters *A* and *λ* were fitted via maximum likelihood, assuming that the total number of weekly infections, *I*_{c,w}, was Poisson distributed with expectation Under this model, the log-likelihood of the data (more precisely, of the previously estimated weekly infection counts) is:
where *w* iterates over all weeks in the specific sliding window. The maximum-likelihood estimates of *A* and *λ* are obtained by solving *∂* ln *L/∂λ* = 0 and *∂* ln *L/∂A* = 0, which quickly leads to the condition:
Equation (13) was solved numerically to obtain the maximum-likelihood estimate .

To assess estimation uncertainties stemming from sampling stochasticity and uncertainties in the infection risk ratios, I repeated the above estimations 100 times using alternative infection risk ratios (for each age group drawn randomly from the set of infection risk ratios previously fitted to various countries) and replacing in Eq. (10) the death counts *D*_{c,w+4} with values drawn from a Poisson distribution with mean *D*_{c,w+4}. Hence, rather than point-estimates, all predictions are reported in the form of medians and confidence intervals. Only infection risk ratios for which the corresponding linear curve (Eq. 4) achieved a coefficient of determination (*R*^{2}) greater than 0.5 were used (shown in Fig. 1), to avoid less accurately estimated infection risk ratios (typically obtained from countries with low death rates). Tables of all estimates for all considered countries up until November 1, 2020 are provided as Supplemental File 1; periodically updated estimates and visual summaries can be found at: www.loucalab.com/archive/COVID19prevalence

### Data availability

All data used in this manuscript are publicly available at the locations described in the Methods section.

### Competing interests

The author declares no conflict of interest.

## Data Availability

All data used in this manuscript were obtained from publicly accessible sources.

## Code availability

All software used in this paper have been described in the Methods and are freely available online.

## Data availability

All data are available as supplementary material and on public repositories described in the Methods.

## Competing interests

The authors declare that they have no competing interests.

## Materials & Correspondence

Correspondence and requests for materials should be addressed to S.L.

## Supplementary Information

## Acknowledgements

S.L. was supported by a US National Science Foundation RAPID grant #2028986.