Using excess deaths and testing statistics to improve estimates of COVID-19 mortalities

Factors such as non-uniform definitions of mortality, uncertainty in disease prevalence, and biased sampling complicate the quantification of fatality during an epidemic. Regardless of the employed fatality measure, the infected population and the number of infection-caused deaths need to be consistently estimated for comparing mortality across regions. We combine historical and current mortality data, a statistical testing model, and an SIR epidemic model, to improve estimation of mortality. We find that the average excess death across the entire US is 13% higher than the number of reported COVID-19 deaths. In some areas, such as New York City, the number of weekly deaths is about eight times higher than in previous years. Other countries such as Peru, Ecuador, Mexico, and Spain exhibit excess deaths significantly higher than their reported COVID-19 deaths. Conversely, we find negligible or negative excess deaths for part and all of 2020 for Denmark, Germany, and Norway.


Introduction
The novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) first identified in Wuhan, China in December 2019 quickly spread across the globe, leading to the declaration of a pandemic on March 11, 2020 [1]. The emerging disease was termed COVID- 19. As of this January 2020 writing, more than 86 million people have been infected, and more than 1.8 million deaths from COVID-19 in more than 218 countries [2] have been confirmed. About 61 million people have recovered globally.
Properly estimating the severity of any infectious disease is crucial for identifying near-future scenarios, and designing intervention strategies. This is especially true for SARS-CoV-2 given the relative ease with which it spreads, due to long incubation periods, asymptomatic carriers, and stealth transmissions [3]. Most measures of severity are derived from the number of deaths, the number of confirmed and unconfirmed infections, and the number of secondary cases generated by a single primary infection, to name a few. Measuring these quantities, determining how they evolve in a population, and how they are to be compared across groups, and over time, is challenging due to many confounding variables and uncertainties.
For example, quantifying COVID-19 deaths across jurisdictions must take into account the existence of different protocols in assigning cause of death, cataloging comorbidities [4], and lag time reporting [5]. Inconsistencies also arise in the way deaths are recorded, especially when COVID-19 is not the direct cause of death, rather * Electronic address: lucasb@ucla.edu † Electronic address: dorsogna@csun.edu ‡ Electronic address: tomchou@ucla.edu a co-factor leading to complications such as pneumonia and other respiratory ailments [6]. In Italy, the clinician's best judgment is called upon to classify the cause of death of an untested person who manifests COVID-19 symptoms. In some cases, such persons are given postmortem tests, and if results are positive, added to the statistics. Criteria vary from region to region [7]. In Germany, postmortem testing is not routinely employed, possibly explaining the large difference in mortality between the two countries. In the US, current guidelines state that if typical symptoms are observed, the patient's death can be registered as due to COVID-19 even without a positive test [8]. Certain jurisdictions will list dates on which deaths actually occurred, others list dates on which they were reported, leading to potential lag-times. Other countries tally COVID-19 related deaths only if they occur in hospital settings, while others also include those that occur in private and/or nursing homes. In addition to the difficulty in obtaining accurate and uniform fatality counts, estimating the prevalence of the disease is also a challenging task. Large-scale testing of a population where a fraction of individuals is infected, relies on unbiased sampling, reliable tests, and accurate recording of results. One of the main sources of systematic bias arises from the tested subpopulation: due to shortages in testing resources, or in response to public health guidelines, COVID-19 tests have more often been conducted on symptomatic persons, the elderly, frontline workers and/or those returning from hot-spots. Such non-random testing overestimates the infected fraction of the population.
Different types of tests also probe different infected subpopulations. Tests based on reverse-transcription polymerase chain reaction (RT-PCR), whereby viral genetic material is detected primarily in the upper respiratory tract and amplified, probe individuals who are actively infected. Serological tests (such as enzyme-linked immunosorbent assay, ELISA) detect antiviral antibod-ies and thus measure individuals who have been infected, including those who have recovered.
Finally, different types of tests exhibit significantly different "Type I" (false positive) and "Type II" (false negative) error rates. The accuracy of RT-PCR tests depends on viral load which may be too low to be detected in individuals at the early stages of the infection, and may also depend on which sampling site in the body is chosen. Within serological testing, the kinetics of antibody response are still largely unknown and it is not possible to determine if and for how long a person may be immune from reinfection. Instrumentation errors and sample contamination may also result in a considerable number of false positives and/or false negatives. These errors confound the inference of the infected fraction. Specifically, at low prevalence, Type I false positive errors can significantly bias the estimation of the IFR.
Other quantities that are useful in tracking the dynamics of a pandemic include the number of recovered individuals, tested, or untested. These quantities may not be easily inferred from data and need to be estimated from fitting mathematical models such as SIR-type ODEs [9], age-structured PDEs [10], or network/contact models [11][12][13].
Administration of tests and estimation of all quantities above can vary widely across jurisdictions, making it difficult to properly compare numbers across them. In this paper, we incorporate excess death data, testing statistics, and mathematical modeling to self-consistently compute and compare mortality across different jurisdictions. In particular, we will use excess mortality statistics [14][15][16] to infer the number of COVID-19-induced deaths across different regions. We then present a statistical testing model to estimate jurisdiction-specific infected fractions and mortalities, their uncertainty, and their dependence on testing bias and errors. Our statistical analyses and source codes are available at [17].

Mortality measures
Many different fatality rate measures have been defined to quantify epidemic outbreaks [18]. One of the most common is the case fatality ratio (CFR) defined as the ratio between the number of confirmed "infectioncaused" deaths D c in a specified time window and the number of infections N c confirmed within the same time window, CFR = D c /N c [19]. Depending on how deaths D c are counted and how infected individuals N c are defined, the operational CFR may vary. It may even exceed one, unless all deaths are tested and included in N c .
Another frequently used measure is the infection fatality ratio (IFR) defined as the true number of "infectioncaused" deaths D = D c + D u divided by the actual number of cumulative infections to date, N c + N u . Here, D u is the number of unreported infection-caused deaths within a specified period, and N u denotes the untested or unreported infections during the same period. Thus, IFR = D/(N c + N u ).
One major issue of both CFR and IFR is that they do not account for the time delay between infection and resolution. Both measures may be quite inaccurate early in an outbreak when the number of cases grows faster than the number of deaths and recoveries [10]. An alternative measure that avoids case-resolution delays is the confirmed resolved mortality M = D c /(D c + R c ) [10], where R c is the cumulative number of confirmed recovered cases evaluated in the same specified time window over which D c is counted. One may also define the true resolved mortality via M = D/(D + R), the proportion of the actual number of deaths relative to the total number of deaths and recovered individuals during a specified time period. If we decompose R = R c +R u , where R c are the confirmed and R u , the unreported recovered cases, . The total confirmed population is defined as N c = D c + R c + I c , where I c the number of living confirmed infecteds. Applying these definitions to any specified time period (typically from the "start" of an epidemic to the date with the most recent case numbers), we observe that CFR ≤ M and IFR ≤ M. After the epidemic has long past, when the number of currently infected individuals I approach zero, the two fatality ratios and mortality measures converge if the component quantities are defined and measured consistently, lim t→∞ CFR(t) = lim t→∞ M (t) and lim t→∞ IFR(t) = lim t→∞ M(t) [10].
The mathematical definitions of the four basic mortality measures Z = CFR, IFR, M, M defined above are given in Table I and fall into two categories, confirmed and total. Confirmed measures (CFR and M ) rely only on positive test counts, while total measures (IFR and M) rely on projections to estimate the number of infected persons in the total population N . Of the measures listed in Table I, the fatality ratio CFR and confirmed resolved mortality M do not require estimates of unreported infections, recoveries, and deaths and can be directly derived from the available confirmed counts D c , N c , and R c [20]. Estimation of IFR and the true resolved mortality M requires the additional knowledge on the unconfirmed quantities D u , N u , and R u . We describe the possible ways to estimate these quantities, along with the associated sources of bias and uncertainty below.

Excess deaths data
An unbiased way to estimate D = D c + D u , the cumulative number of deaths, is to compare total deaths within a time window in the current year to those in the same time window of previous years, before the pandemic. If the epidemic is widespread and has appreciable fatality, one may reasonably expect that the excess deaths can be attributed to the pandemic [21][22][23][24][25]. Within each affected region, these "excess" deaths D e relative to "historical" deaths, are independent of testing limitations and do not suffer from highly variable definitions of virus-induced death. Thus, within the context of the COVID-19 pandemic, D e is a more inclusive measure of virus-induced deaths than D c and can be used to estimate the total number of deaths, D e D c + D u . Moreover, using data from multiple past years, one can also estimate the uncertainty in D e . In practice, deaths are typically tallied daily, weekly [21,28], or sometimes aggregated monthly [27,29] with historical records dating back J years so that for every period i there are a total of J + 1 death values. We denote by d (j) (i) the total number of deaths recorded in period i from the j th previous year where 0 ≤ j ≤ J and where j = 0 indicates the current year. In this notation, where the summation tallies deaths over several periods of interest within the pandemic. Note that we can decompose u (i), to include the contribution from the confirmed and unconfirmed deaths during each period i, respectively. To quantify the total cumulative excess deaths we derive excess deaths d The corresponding quantities accumulated over k weeks define the mean and variance of the cumulative excess deathsD e (k) and Σ e (k)D where deaths are accumulated from the first to the k th week of the pandemic. The variance in Eqs. (1) and (2) arise from the variability in the baseline number of deaths from the same time period in J previous years. We gathered excess death statistics from over 23 countries and all US states. Some of the data derive from open-source online repositories as listed by official statistical bureaus and health ministries [21][22][23][24][25]29]; other data are elaborated and tabulated in Ref. [27]. In some countries excess death statistics are available only for a limited number of states or jurisdictions (e.g., Brazil). The US death statistics that we use in this study is based on weekly death data between 2015-2019 [29]. For all other countries, the data collection periods are summarized in Ref. [27]. Fig. A1(a-b) shows historical death data for NYC and Germany, while Fig. A1(c-d) plots the confirmed and excess deaths and their confidence levels computed from Eqs. (1) and (2). We assumed that the cumulative summation is performed from the start of 2020 to the current week k = K so thatD e (K) ≡D e indicates excess deaths at the time of writing. Significant numbers of excess deaths are clearly evident for NYC, while Germany thus far has not experienced significant excess deaths.
To evaluate CFR and M , data on only D c , N c , and R c are required, which are are tabulated by many jurisdictions. To estimate the numerators of IFR and M, we approximate D c + D u ≈D e using Eq. (2). For the denominators, estimates of the unconfirmed infected N u and unconfirmed recovered populations R u are required. In the next two sections we propose methods to estimate . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 12, 2021.  [26,27]. Grey solid lines and shaded regions represent the historical numbers of deaths and corresponding confidence intervals defined in Eq. (1). Blue solid lines indicate weekly deaths, and weekly deaths that lie outside the confidence intervals are indicated by solid red lines. The red shaded regions represent statistically significant mean cumulative excess deaths De. The reported weekly confirmed deaths d (0) c (i) (dashed black curves), reported cumulative confirmed deaths Dc(k) (dashed dark red curves), weekly excess deathsde(i) (solid grey curves), and cumulative excess deathsDe(k) (solid red curves) are plotted in units of per 100,000 in (c) and (d) for NYC and Germany, respectively. The excess deaths and the associated 95% confidence intervals given by the error bars are constructed from historical death data in (a-b) and defined in Eqs. (1) and (2). In NYC there is clearly a significant number of excess deaths that can be safely attributed to COVID-19, while to date in Germany, there have been no significant excess deaths. Excess death data from other jurisdictions are shown in the Supplementary Information and typically show excess deaths greater than reported confirmed deaths (with Germany an exception as shown in (d)).
N u using a statistical testing model and R u using compartmental population model.

Statistical testing model with bias and testing errors
The total number of confirmed and unconfirmed infected individuals N c + N u appears in the denominator of the IFR. To better estimate the infected population we present a statistical model for testing in the presence of bias in administration and testing errors. Although N c + N u used to estimate the IFR includes those who have died, depending on the type of test, it may or may not include those who have recovered. If S, I, R, D are the numbers of susceptible, currently infected, recovered, and deceased individuals, the total population is N = S + I + R + D and the infected fraction can be defined as f = (N c + N u )/N = (I + R + D)/N for tests that include recovered and deceased individuals (e.g., antibody tests), or f = (N c + N u )/N = (I + D)/N for tests that only count currently infected individuals (e.g., RT-PCR tests). If we assume that the total population N can be inferred from census estimates, the problem of identifying the number of unconfirmed infected persons N u is mapped onto the problem of identifying the true fraction f of the population that has been infected.
Typically, f is determined by testing a representative sample and measuring the proportion of infected persons within the sample. Besides the statistics of sampling, two main sources of systematic errors arise: the non-random selection of individuals to be tested and errors intrinsic to the tests themselves. Biased sampling arises when testing policies focus on symptomatic or at-risk individuals, leading to over-representation of infected individuals. Figure 2 shows a schematic of a hypothetical initial total population of N = 54 individuals in a specified jurisdiction. Without loss of generality we assume there are no unconfirmed deaths, D u = 0, and that all confirmed deaths are equivalent to excess deaths, so that D e = D c = 5 in the jurisdiction represented by Fig. 2. Apart from the number of deceased, we also show the number of infected and uninfected subpopulations and label them as true positives, false positives, and false negatives. The true number of infected individuals is N c + N u = 16 which yields the true f = 16/54 = 0.27 and an IFR = 5/16 = 0.312 within the jurisdiction.
Also shown in Fig. 2 are two examples of sampling. Biased sampling and testing is depicted by the blue contour in which 6 of the 15 are alive and infected, 2 are deceased, and the remaining 7 are healthy. For simplicity, we start by assuming no testing errors. This measured infected fraction of this sample 8/15 = 0.533 > f = 0.296 is biased since it includes a higher proportion of infected persons, both alive and deceased, than that of the entire . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 12, 2021. jurisdiction. Using this biased measured infected fraction of 8/15 yields IFR = 5/(0.533 · 54) ≈ 0.174, which significantly underestimates the true IFR = 0.312. A relatively unbiased sample, shown by the green contour, yields an infected fraction of 4/14 ≈ 0.286 and an apparent IFR ≈ 0.324 which are much closer to the true fraction f and IFR. In both samples discussed above we neglected testing errors such as false positives indicated in Fig. 2. Tests that are unable to distinguish false positives as negatives would yield a larger N c , resulting in an apparent infected fraction 9/15 and an even smaller apparent IFR ≈ 0.154. By contrast, the false positive testing errors on the green sample would yield an apparent infected fraction 5/15 = 0.333 and IFR= 0.259.
Given that test administration can be biased, we propose a parametric form for the apparent or measured infected fraction to connect the apparent (biased sampling) infected fraction f b with the true underlying infection fraction. The bias parameter −∞ < b < ∞ describes how an infected or uninfected individual might be preferentially selected for testing, with b < 0 (and f b < f ) indicating undertesting of infected individuals, and b > 0 (and representing over-testing of infecteds. A truly random, unbiased sampling arises only when b = 0 where f b = f . Given Q (possibly biased) tests to date, testing errors, and ground-truth infected fraction f , we derive in the SI the likelihood of observing a positive fractionf b =Q + /Q (whereQ + is the number of recorded positive tests): in which Here, µ is the expected value of the measured and biased fractionf b and σ 2 T is its variance. Note that the parameters θ = {Q, f, b, FPR, FNR} may be time-dependent and change from sample to sample. Along with the likelihood function P (f b |f, θ), one can also propose a prior distribution P (θ|α) with hyperparameters α, and apply Bayesian methods to infer θ (see SI).
To evaluate IFR, we must now estimate f givenf b = Q + /Q and possible values for FPR, FNR, and/or b, or the hyperparameters α defining their uncertainty. The . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 12, 2021. ; https://doi.org/10.1101/2021.01.10.21249524 doi: medRxiv preprint simplest maximum likelihood estimate of f can be found by maximizing P (f b |θ) with respect to f given a measured valuef b and all other parameter values θ specified: Note that although FNRs are typically larger than FPRs, small values of f andf b imply thatf and µ are more sensitive to the FPR, as indicated by Eqs. (5) and (6). If time series data forf b =Q + /Q are available, one can evaluate the corrected testing fractions in Eq. (6) for each time interval. Assuming that serological tests can identify infected individuals long after symptom onset, the latest value off would suffice to estimate corresponding mortality metrics such as the IFR. For RT-PCR testing, one generally needs to track howf b evolves in time. A rough estimate would be to use the mean off b over the whole pandemic period to provide a lower bound of the estimated prevalencef .
The measuredf b yields only the apparent IFR = D e /(f b N ), but Eq. (6) can then be used to evaluate the corrected IFR ≈D e /(f N ) which will be a better estimate of the true IFR. For example, under moderate bias |b| 1 and assuming FNR, FPR,f b 1 Eq. (6) relates the apparent and corrected IFRs through Another commonly used representation of the IFR is is defined as the fraction of infected individuals that are confirmed [30,31]. In this alternative representation, the p factor implicitly contains the effects of biased testing. Our approach allows the true infected fraction f to be directly estimated fromQ + and N .
While the estimatef depends strongly on b and FPR, and weakly on FNR, the uncertainty in f will depend on the uncertainty in the values of b, FPR, and FNR. A Bayesian framework is presented in the SI, but under a a Gaussian approximation for all distributions, the uncertainty in the testing parameters can be propagated to the squared coeffcient σ 2 f /f 2 of variation of the estimated infected fractionf , as explicitly computed in the SI. Moreover, the uncertainties in the mortality indices Z decomposed into the uncertainties of their individual components are listed in Table II.
Using compartmental models to estimate resolved mortalities Since the number of unreported recovered individuals R u required to calculate M is not directly related to excess deaths nor to positive-tested populations, we use an SIR-type compartmental model to relate R u to other inferable quantities [9]. Both unconfirmed recovered individuals and unconfirmed deaths are related to unconfirmed infected individuals who recover at rate γ u and die at rate µ u . The equations for the cumulative numbers of unconfirmed recovered individuals and unconfirmed deaths, (7) can be directly integrated to find The rates γ u and µ u may differ from those averaged over the entire population since testing may be biased towards subpopulations with different values of γ u and µ u . If one assumes γ u and µ u are approximately constant over the period of interest, we find R u /D u ≈ γ u /µ u ≡ γ. We now use D u =D e − D c , where bothD e and D c are given by data, to estimate R u ≈ γ(D e − D c ) and write M as Thus, a simple SIR model transforms the problem of determining the number of unreported death and recovered cases in M to the problem of identifying the recovery and death rates in the untested population. Alternatively, we can make use of the fact that both the IFR and resolved mortality M should have comparable values and match M to IFR ≈ 0.1 − 1.5% [31][32][33] by setting γ ≡ γ u /µ u ≈ 100 − 1000 (see SI for further information). Note that inaccuracies in confirming deaths may give rise to D c >D e . Since by definition, infection-caused excess deaths must be greater than the confirmed deaths, we set D e − D c = 0 whenever data happens to indicateD e to be less than D c .

Results
Here, we present much of the available worldwide fatality data, construct the excess death statistics, and compute mortalities and compare them across jurisdictions. We show that standard mortality measures significantly underestimate the death toll of COVID-19 for most regions (see Figs. A1 and A2). We also use the data to estimate uncertainties in the mortality measures and relate them uncertainties of the underlying components and model parameters.

Excess and confirmed deaths
We find that in New York City for example, the number of confirmed COVID-19 deaths between March 10, 2020 and December 10, 2020 is 19,694 [34] and thus significantly lower than the 27,938 (95% CI 26,516-29,360) reported excess mortality cases [21]. From March 25, 2020 until December 10, 2020, Spain counts 65,673 (99% confidence interval [CI] 91,816-37,061) excess deaths [22], . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. a number that is substantially larger than the officially reported 47,019 COVID-19 deaths [35]. The large difference between excess deaths and reported COVID-19 deaths in Spain and New York City is also observed in Lombardia, one of the most affected regions in Italy. From February 23, 2020 until April 4, 2020, Lombardia reported 8,656 reported COVID-19 deaths [35] but 13,003 (95% 12,335-13,673) excess deaths [25]. Starting April 5 2020, mortality data in Lombardia stopped being reported in a weekly format. In England/Wales, the number of excess deaths from the onset of the COVID-19 outbreak on March 1, 2020 until November 27, 2020 is 70,563 (95% CI 52,250-88,877) whereas the number of reported COVID-19 deaths in the same time interval is 66,197 [26]. In Switzerland, the number of excess deaths from March 1, 2020 until November 29, 2020 is 5,664 (95% CI 4,281-7,047) [24], slightly larger than the corresponding 4,932 reported COVID-19 deaths [35].
To illustrate the significant differences between excess deaths and reported COVID-19 deaths in various jurisdictions, we plot the excess deaths against confirmed deaths for various countries and US states as of December 10, 2020 in Fig. 3. We observe in Fig. 3(a) that the number of excess deaths in countries like Mexico, Russia, Spain, Peru, and Ecuador is significantly larger than the corresponding number of confirmed COVID-19 deaths. In particular, in Russia, Ecuador, and Spain the number of excess deaths is about three times larger than the number of reported COVID-19 deaths. As described in the Methods section, for certain countries (e.g., Brazil) excess death data is not available for all states [27]. For the majority of US states the number of excess deaths is also larger than the number of reported COVID-19 deaths, as shown in Fig. 3(b). We performed a leastsquare fit to calculate the proportionality factor m arising inD e = mD c and found m ≈ 1.132 (95% CI 1.096-1.168). That is, across all US states, the number of excess deaths is about 13% larger than the number of confirmed COVID-19 deaths.

Estimation of mortality measures and their uncertainties
We now use excess death data and the statistical and modeling procedures to estimate mortality measures Z = IFR, CFR, M , M across different jurisdictions, including all US states and more than two dozen countries. 1 . Accurate estimates of the confirmed N c and dead D c infected are needed to evaluate the CFR. Values for the parameters Q, FPR, FNR, and b are needed to estimate N c +N u = f N in the denominator of the IFR, whileD e is needed to estimate the number of infection-caused deaths D c + D u that appear in the numerator of the IFR and M. Finally, since we evaluate the resolved mortality M, through Eq. 8, estimates ofD e , D c , R c , γ, and FPR, FNR (to correct for testing inaccuracies in D c and R c ) are necessary. Whenever uncertainties are available or inferable from data, we also include them in our analyses.
Estimates of excess deaths and infected populations themselves suffer from uncertainty encoded in the variances Σ 2 e and σ 2 f . These uncertainties depend on uncertainties arising from finite sampling sizes, uncertainty in bias b and uncertainty in test sensitivity and specificity, which are denoted σ 2 b , σ 2 I , and σ 2 II , respectively. We use Σ 2 to denote population variances and σ 2 to denote parameter variances; covariances with respect to any two variables X, Y are denoted as Σ X,Y . Variances in the confirmed populations are denoted Σ 2 Nc , Σ 2 Rc , and Σ 2 Dc and also depend on uncertainties in testing parameters σ 2 I and σ 2 II . The most general approach would be to define a probability distribution or likelihood for observing some value of the mortality index in [Z, Z + dZ]. As outlined in the SI, these probabilities can depend on the mean and variances of the components of the mortalities, which in turn may depend on hyperparameters that determine these means and variances. Here, we simply 1 We provide an online dashboard that shows the real-time evolution of CFR and M at https://submit.epidemicdatathon.com/ \#/dashboard . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 12, 2021. assume uncertainties that are propagated to the mortality indices through variances in the model parameters and hyperparameters [37]. The squared coefficients of variation of the mortalities are found by linearizing them about the mean values of the underlying components and are listed in Table II.
To illustrate the influence of different biases b on the IFR we usef from Eq. (6) in the corrected IFR ≈D e /(f N ). We model RT-PCR-certified COVID-19 deaths [38] by setting the FPR = 0.05 [39] and the FNR = 0.2 [40,41]. The observed, possibly biased, fraction of positive testsf b =Q + /Q can be directly obtained from corresponding empirical data. As of November 1, 2020, the average off b over all tests and across all US states is about 9.3% [42]. The corresponding number of excess deaths isD e = 294, 700 [27] and the US population is about N ≈ 330 million [43]. To study the influence of variations inf b , in addition tof b = 0.093, we also use a slightly largerf b = 0.15 in our analysis. In Fig. 4 we show the apparent and corrected IFRs for two values off b [ Fig. 4(a)] and the coefficient of variation CV IFR [ Fig. 4(b)] as a function of the bias b and as made explicit in Table I. For unbiased testing [b = 0 in Fig. 4(a)], the corrected IFR in the US is 1.9% assumingf b = 0.093 and 0.8% assumingf b = 0.15. If b > 0, there is a testing bias towards the infected population, hence, the apparent IFR =D e /(f b N ) is smaller than the corrected IFR as can be seen by comparing the solid (corrected IFR) and the dashed (apparent IFR) lines in Fig. 4(a). For testing biased towards the uninfected population (b < 0), the corrected IFR may be smaller than the apparent IFR. To illustrate how uncertainty in FPR, FNR, and b affect uncertainty in IFR, we evaluate CV IFR as given in Table II.
The first term in uncertainty σ 2 f /f 2 given in Eq. (A6) is proportional to 1/Q and can be assumed to be negligibly small, given the large number Q of tests administered. The other terms in Eq. (A6) are evaluated by assuming σ b = 0.2, σ I = 0.02, and σ II = 0.05 and by keeping FPR = 0.05 and FNR = 0.2. Finally, we infer Σ e from empirical data, neglect correlations between D e and N , and assume that the variation in N is negligible so that Σ e,N = Σ N ≈ 0. Fig. 4(b) plots CV IFR and CV De in the US as a function of the underlying bias b. The coefficient of variation CV De is about 1%, much smaller than CV IFR , and independent of b. For the values of b shown in Fig. 4(b), CV IFR is between 47-64% forf b = 0.093 and between 20-27% forf b = 0.15.
Next, we compared the mortality measures Z =IFR, CFR, M , M and the relative excess deaths r listed in Tab. I across numerous jurisdictions. To determine the CFR, we use the COVID-19 data of Refs. [20,36]. For the apparent IFR, we use the representation IFR = pD e /N c discussed above. Although p may depend on the stage of the pandemic, typical estimates range from 4% [44] to 10% [31]. We set p = 0.1 over the lifetime of the pandemic. We can also use the apparent IFR =D e /(f N ), however estimating the corrected IFR requires evaluating the bias b. In Fig. 5(a), we show the values of the relative excess deaths r, the CFR, the apparent IFR, the confirmed resolved mortality M , and the true resolved mortality M for different (unlabeled) regions. In all cases we set p = 0.1, γ = 100. As illustrated in Fig. 5(b), some mortality measures suggest that COVID-induced fatalities are lower in certain countries compared to others, whereas other measures indicate the opposite. For example, the total resolved mortality M for Brazil is larger than for Russia and Mexico, most likely due to the relatively low number of reported excess deaths as can be seen from Fig. 3 (a). On the other hand, Brazil's values of CFR, IFR, and M are substantially smaller than those of Mexico [see Fig. 5 The distributions of all measures Z and relative excess deaths r across jurisdictions are shown Fig. 5(c-g) and encode the global uncertainty of these indices. We also calculate the corresponding mean values across jurisdictions, and use the empirical cumulative distribution functions to determine confidence intervals. The mean values across all jurisdictions are r = 0.08 (95% CI 0.0025-0.7800), CFR = 0.020 (95% CI 0.0000-0.0565), IFR = 0.0024 (95% CI 0.0000-0.0150), M = 0.038 (95% . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 12, 2021. ; https://doi.org/10.1101/2021.01.10.21249524 doi: medRxiv preprint  [37]. We use Σ 2 N , Σ 2 Nc , Σ 2 Rc , and Σ 2 Dc to denote the uncertainties in the total population, confirmed cases, recoveries, and deaths, respectively. The variance of the number of excess deaths is Σ 2 e , which feature in the IFR and M. The uncertainty in the infected fraction σ 2 f that contributes to the uncertainty in IFR depends on uncertainties in testing bias and testing errors as shown in Eq. (A6). The term ΣD c ,Nc represents the covariance between Dc, Nc, and similarly for all other covariances Σe,N , ΣD c ,Rc , ΣR c ,Ru , ΣR c ,γ . Since variations in De arise from fluctuations in past-year baselines and not from current intrinsic uncertainty, we can neglect correlations between variations in De and uncertainty in Rc, Ru. In the last two rows, representing M expressed in two different ways, Γ ≡De + Rc + Ru andDe + Rc + γ(De − Dc), respectively. Moreover, when using the SIR model to replace Du and Ru withDe − Dc ≥ 0, there is no uncertainty associated with Du and Ru in a deterministic model. Thus, covariances cannot be defined except through the uncertainty in the parameter γ = γu/µu. CI 0.0000-0.236), and M = 0.027 (95% CI 0.000-0.193). For calculating M and M, we excluded countries with incomplete recovery data. The distributions plotted in Fig. 5(c-g) can be used to inform our analyses of uncertainty or heterogeneity as summarized in Tab. II. For example, the overall variance Σ 2 Z can be determined by fitting the corresponding empirical Z distribution shown in Fig. 5(c-g).
or on σ 2 I using (1 −f ) 2 σ 2 I < (f b − FPR) 2 CV 2 IFR . Finally, to provide more insight into the correlations between different mortality measures, we plot M against CFR and M against IFR in Fig. 6. For most regions, we observe similar values of M and CFR in Fig. 6(a). Althouigh we expect M → CFR and M → IFR towards the end of an epidemic, in some regions such as the UK, Sweden, Netherlands, and Serbia, M CFR due to unreported or incomplete reporting of recovered cases. About 50% of the regions that we show in Fig. 6(b) have an IFR that is approximately equal to M. Again, for regions such as Sweden and the Netherlands, M is substantially larger than IFR because of incomplete reporting of recovered cases.

Relevance
In the first few weeks of the initial COVID-19 outbreak in March and April 2020 in the US, the reported death numbers captured only about two thirds of the total excess deaths [15]. This mismatch may have arisen from reporting delays, attribution of COVID-19 related deaths to other respiratory illnesses, and secondary pandemic mortality resulting from delays in necessary treatment and reduced access to health care [15]. We also observe that the number of excess deaths in the Fall months of 2020 have been significantly higher than the corresponding reported COVID-19 deaths in many US states and countries. The weekly numbers of deaths in regions with a high COVID-19 prevalence were up to 8 times higher than in previous years. Among the countries that were analyzed in this study, the five countries with the largest numbers of excess deaths since the beginning of the COVID-19 outbreak (all numbers per 100,000) are Peru (256), Ecuador (199), Mexico (151), . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 12, 2021. Spain (136), and Belgium (120). The five countries with the lowest numbers of excess deaths since the beginning of the COVID-19 outbreak are Denmark (2), Norway (6), Germany (8), Austria (31), and Switzerland (33) [27] 2 . If one includes the months before the outbreak, the numbers of excess deaths per 100,000 in 2020 in Germany, Denmark, and Norway are -3209, -707, and -34, respectively. In the early stages of the COVID-19 pandemic, testing capabilities were often insufficient to resolve rapidly-increasing case and death numbers. This is still the case in some parts of the world, in particular in many developing countries [45]. Standard mortality measures such as the IFR and CFR thus suffer from a time-lag problem. 2 Note that Switzerland experienced a rapid growth in excess deaths in recent weeks. More recent estimates of the number of excess deaths per 100,000 suggest a value of 64 [26], which is similar to the corresponding excess death value observed in Sweden.

Strengths and limitations
The proposed use of excess deaths in standard mortality measures may provide more accurate estimates of infection-caused deaths, while errors in the estimates of the fraction of infected individuals in a population from testing can be corrected by estimating the testing bias and testing specificity and sensitivity. One could sharpen estimates of the true COVID-19 deaths by systematically analyzing the statistics of deaths from all reported causes using a standard protocol such as ICD-10 [46]. For example, the mean traffic deaths per month in Spain between 2011-2016 is about 174 persons [47], so any pandemicrelated changes to traffic volumes would have little impact considering the much larger number of COVID-19 deaths.
Different mortality measures are sensitive to different sources of uncertainty. Under the assumption that all excess deaths are caused by a given infectious disease (e.g., , the underlying error in the determined number of excess deaths can be estimated using historical death statistics from the same jurisdiction. Uncertainties in mortality measures can also be decomposed into the uncertainties of their component quantities, including the positive-tested fraction f that depend on uncertainties in . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In jurisdictions for which the data indicateDe < Dc, we set γ(De − Dc) = 0 in the denominator of M which prevents it from becoming negative as long asDe ≥ 0. All data were updated on December 10, 2020 [20,27,29,36].

the testing parameters.
As for all epidemic forecasting and surveillance, our methodology depends on the quality of excess death and COVID-19 case data and knowledge of testing parameters. For many countries, the lack of binding international reporting guidelines, testing limitations, and possible data tampering [48] complicates the application of our framework. A striking example of variability is the large discrepancy between excess deaths D e and confirmed deaths D c across many jurisdictions which render mortalities that rely on D c suspect. More research is necessary to disentangle the excess deaths that are directly caused by SARS-CoV-2 infections from those that result from postponed medical treatment [15], increased suicide rates [49], and other indirect factors contributing to an increase in excess mortality. Even if the numbers of excess deaths were accurately reported and known to be caused by a given disease, inferring the corresponding number of unreported cases (e.g., asymptomatic infections), which appears in the definition of the IFR and M (see Tab. I), is challenging and only possible if additional models and assumptions are introduced.
Another complication may arise if the number of excess deaths is not significantly larger than the historical mean. Then, excess-death-based mortality estimates suffer from large uncertainty/variability and may be meaningless. While we have considered only the average or last values off b , our framework can be straightforwardly extended and dynamically applied across successive time windows, using e.g., Bayesian or Kalman filtering approaches.
Finally, we have not resolved the excess deaths or mortalities with respect to age or other attributes such as sex, co-morbidities, occupation, etc. We expect that agestructured excess deaths better resolve a jurisdiction's overall mortality. By expanding our testing and modeling approaches on stratified data, one can also straightforwardly infer stratified mortality measures Z, providing additional informative indices for comparison.

Conclusions
Based on the data presented in Figs. 5 and 6, we conclude that the mortality measures r, CFR, IFR, M , and M may provide different characterizations of disease severity in certain jurisdictions due to testing limitations and bias, differences in reporting guidelines, reporting delays, etc. The propagation of uncertainty and coefficients of variation that we summarize in Tab. II can help quantify and compare errors arising in different mortality measures, thus informing our understanding of the actual death toll of COVID-19. Depending on the stage of an outbreak and the currently available disease monitoring data, certain mortality measures are preferable to others. If the number of recovered individuals is being monitored, the resolved mortalities M and M should be preferred over CFR and IFR, since the latter suffer from errors associated with the time-lag between infection and resolution [10]. For estimating IFR and M, we propose using excess death data and an epidemic model. In situations in which case numbers cannot be estimated accurately, the relative excess deaths r provides a complementary measure to monitor disease severity. Our analyses of different mortality measures reveal that • The CFR and M are defined directly from confirmed deaths D c and suffers from variability in its reporting. Moreover, the CFR does not consider resolved cases and is expected to evolve during an epidemic. Although M includes resolved cases, its additionally required confirmed recovered cases R c add to its variability across jurisdictions. Testing errors affect both D c and R c , but if the FNR and FPR are known, they can be controlled using Eq. (A3) given in the SI.
• The IFR requires knowledge of the true cumulative number of disease-caused deaths as well as the true number of infected individuals (recovered or not) in a population. We show how these can be estimated from excess deaths and testing, respectively. Thus, the IFR will be sensitive to the in-. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 12, 2021. ; https://doi.org/10.1101/2021.01.10.21249524 doi: medRxiv preprint ferred excess deaths and from the testing (particularly from the bias in the testing). Across all countries analyzed in this study, we found a mean IFR of about 0.24% (95% CI 0.0-1.5%), which is similar to the previously reported values between 0.1 and 1.5% [31][32][33].
• In order to estimate the resolved true mortality M, an additional relationship is required to estimate the unconfirmed recovered population R u . In this paper, we propose a simple SIR-type model in order to relate R u to measured excess and confirmed deaths through the ratio of the recovery rate to the death rate. The variability in reporting D c across different jurisdictions generates uncertainty in M and reduces its reliability when compared across jurisdictions.
• The mortality measures that can most reliably be compared across jurisdictions should not depend on reported data which are subject to different protocols, errors, and manipulation/intentional omission. Thus, the per capital excess deaths and relative excess deaths r (see last column of Table I) are the measures that provide the most consistent comparisons of disease mortality across jurisdictions (provided total deaths are accurately tabulated). However, they are the least informative in terms of disease severity and individual risk, for which M and M are better.
• Uncertainty in all mortalities Z can be decomposed into the uncertainties in component quantities such as the excess death or testing bias. We can use global data to estimate the means and variances in Z, allowing us to put bounds on the variances of the component quantities and/or parameters.
Parts of our framework can be readily integrated into or combined with mortality surveillance platforms such as the European Mortality Monitor (EURO MOMO) project [28] and the Mortality Surveillance System of the National Center for Health Statistics [21] to assess disease burden in terms of different mortality measures and their associated uncertainty.

Data availability
All datasets used in this study are available from Refs. [21][22][23][24][25]. The source codes used in our analyses are publicly available at [17].

Supplementary Information
Examples of excess death data FIG. A1: Mortality evolution in different countries. The evolution of weekly deaths in New York City, Spain, England/Wales, and Switzerland for different age classes (where available). Grey solid lines and shaded regions represent the historical mean numbers of deaths and corresponding confidence intervals. Blue solid lines indicate weekly deaths and weekly deaths that lie outside the confidence intervals are indicated by solid red lines. For England/Wales and Switzerland, weekly means and 95% confidence intervals are based on data from 2015-2019. In the case of Spain, we show the reported COVID-19 deaths across all age classes [35] in the inset and use the 99% confidence intervals that are directly provided in the corresponding data [26]. The red shaded regions represent the mean cumulative excess deaths De. The data are derived from Refs. [21][22][23][24][25].
We tally weekly deaths according to Eq. (1) for each week i starting from the first week of 2020, and cumulative excess deaths as in Eq. (2) adding all weekly contributions from the first week of 2020 onwards. Note that some governmental agencies tabulate weekly deaths starting on the Sunday closest to January 1 2020 (December 29 2019, such as the United States), others instead use January 1 2020 as the first day of the week (such as Germany). A detailed list of how each country bins weekly deaths is included in Ref. [27]. The final week k up to which the cumulative count is taken depends on data availability, since some countries have larger reporting delays than others. In the majority . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 12, 2021.  c (i) (dashed black curves) and cumulative deaths Dc(k) (dashed dark red curves) with weekly excess deathsde(i) (solid grey curves) and cumulative excess deathsDe(k) (solid red curves). The deaths are plotted in units of per 100,000 in different countries and regions. The data are derived from Ref. [27] and the error bars for the excess deaths are derived from Eqs. (1) and (2). For Spain, we used the 99% confidence intervals that are directly provided in the corresponding data [26] to approximate the 95% confidence intervals. Typically, we findDe(k) > Dc(k).
of cases k is beyond the fourth week of November 2020. Quantities are calculated from data that include deaths from typically J = 5 previous years [27]. In Fig. A2 we plot the weekly confirmed deaths d c (i), and the mean weekly and cumulative excess deathsd e (i) for 2020 as available from data. We . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 12, 2021. ; https://doi.org/10.1101/2021.01.10.21249524 doi: medRxiv preprint also showD e (k) per 100,000 persons from the start of 2020 using Eqs. (1) and (2). The corresponding error bars in Fig. A2 indicate 95% confidence intervals defined byd e (i) ± 1.96 σ e (i) andD e (k) ± 1.96 Σ e (k) in Eqs. (1) and (2), respectively. For Spain, we used the 99% confidence intervals that are directly provided in the corresponding data [26] to approximate the 95% confidence intervals. Excess death statistics evolve differently across different countries and regions. For example, in France excess deaths were negative until the end of March 2020, quickly increasing in April 2020. In Ecuador and Peru, the number of excess deaths is more than 2.5 times larger than the corresponding number of confirmed COVID-19 deaths.

Statistical testing model
Given biases in sampling and testing errors, it is important to use a statistical testing model that takes them into account when estimating the fraction f of a population N that are infected. Testing biases arises, for example, if symptomatic individuals are more likely to seek testing. Thus, the probability f b that an individual who chooses to be tested is positive may be different from f the probability that a randomly selected individual is positive, as defined in Eq. (3). If all tests are error-free, the probability that Q + positive results arise from the Q ≥ Q + administered tests is given by Eq. (A1) is derived under the assumption that once individuals are tested, they are "replaced" in the population and can be tested again. The analogous distribution P true (Q + |Q, f b ) for testing "without replacement" can be straightforwardly derived and yields results quantitatively close to Eq. (A1) provided Q/N 0.3.
Eq. (A1) also assumes flawless testing. Tests with Type I (false positives) and Type II (false negatives) may wrongly catalog uninfected individuals as infected (with rate FPR) while missing some infected individuals (with rate FNR). For serological COVID-19 tests, such as antibody tests, the estimated percentages of false positives and false negatives are typically low, with FPR ≈ 0.03 − 0.07 and FNR ≈ 0.1 [39,50,51]. For RT-PCR tests, the FNRs depend strongly on the actual assay method [52,53] and typically lie between 0.1 and 0.3 [40,41] but might be as high as FNR ≈ 0.68 if throat swabs are used [39,41]. FNRs can also vary significantly depending on how long after initial infection the test is administered [54]. A systematic review conducted worldwide found FNR ≈ 0.54 at initial testing [55], underlying the need for retesting. Reported percentages of false positives in RT-PCR tests are about FPR ≈ 0.05 [39]. A large meta-analysis of serological tests estimates FPR ≈ 0.02 and FNR ≈ 0.02 − 0.16 [54]. These testing errors can lead to inaccurate estimates of disease prevalence; uncertainty in FPR, FNR will thus lead to uncertainty in the estimate of prevalence.
As illustrated through Fig. 2, errors in testing may result in the recorded numberQ + of positive tests to be different from the Q + that would be obtained under perfect testing. The probability thatQ + positive tests are returned due to testing errors can be described in terms of Q + , FPR, and FNR and the corresponding probability distribution P err (Q + |Q + , FPR, FNR) is given by P err (Q + |Q + , FPR, FNR) =Q where q + ≡Q + − p + . By convolving P err (Q + |Q + , FPR, FNR) with P true (Q + |Q, f b ) we derive the overall likelihood distribution for the measured numberQ + of true and false positives given a set of specified parameters θ = {Q, f, b, FPR, FNR} describing the population and testing P (Q + |Q, f, b, FPR, FNR) = Q Q + =0 P err (Q + |Q + , FPR, FNR)P true (Q + |Q, f b (f, b)).
When Q + ,Q + , and Q 1, we can approximate P true , P err , and P by normal distributions and rewrite P as a function of the observed positive fractionf b ≡Q + /Q (Eqs. (4) and (5)).
Using Bayes' rule, we can then formally define the likelihood of θ given a measuredf b , P (θ|f b , α) = P (f b |θ)P 0 (θ|α) . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 12, 2021. ; https://doi.org/10.1101/2021.01.10.21249524 doi: medRxiv preprint