## Abstract

**Background** During a pandemic, estimates of geographic variability in disease burden are important but limited by the availability and quality of data.

**Methods** We propose a framework for estimating geographic variability in testing effort, total number of infections, and infection fatality ratio (IFR). Because symptomatic people are more likely to seek testing, we use a noncentral hypergeometric model that accounts for differential probability of positive tests. We apply this framework to the United States (U.S.) COVID-19 pandemic to estimate county-level SARS-CoV-2 IFRs from March 1, 2020 to October 31, 2020. Using data on population size, number of observed cases, number of reported deaths in each U.S. county and state, and number of tests in each U.S. state, we develop a series of estimators to identify the number of SARS-CoV-2 infections and IFRs at the county level. We then perform a simulation and compare the estimated values to simulated values to demonstrate the validity of our approach.

**Findings** Applying the county-level estimators to the real, unsimulated COVID-19 data spanning March 1, 2020 to October 31, 2020 from across the U.S., we found that IFRs varied from 0 to 0.0273, with an interquartile range of 0.0022 and a median of 0.0018. The estimators for IFRs, number of infections, and number of tests showed high accuracy and precision; for instance, when applied to simulated validation data sets, across counties, Pearson correlation coefficients between estimator means and true values were 0.88, 0.95, and 0.74, respectively.

**Interpretation** We propose an estimation framework that can be used to identify area-level variation in IFRs and performs well to estimate county-level IFRs in the U.S. COVID-19 pandemic.

## 1 Introduction

During the course of an epidemic, accurate estimation of the infection fatality ratio (“IFR”) – defined here as the number of deaths caused by an infection per 1000 infected people – is crucial for policy and planning purposes. IFRs may vary by age distribution of the population, distribution of infection across demographic groups, prevalence of underlying health conditions in the population, access to healthcare resources, and other factors^{1,2}. Therefore, a flexible framework for estimating IFRs in different geographic regions and among different populations is important.

Accurately ascertaining both the number of deaths and the number of infections can be challenging^{3}. While the number of deaths may be under- or over-reported in some settings^{4}, even more concerning are potential inaccuracies in the ascertainment of the number of infections. While case mortality data may provide an upper bound for the IFR, the total number of infections is likely to be underestimated because infected people may not be tested or report illness. In addition, the availability and uptake of testing, as well as reliability of testing data, may also vary over time and geography. For example, during the COVID-19 pandemic, the number of SARS-CoV-2 nucleic acid amplification tests (NAATs) administered was consistently recorded at the state level across the U.S., though not at the county level^{5}. Focusing on the COVID-19 pandemic in the U.S., here we develop an inferential framework for estimating area-level variation in testing efforts, number of SARS-CoV-2 infections, and SARS-CoV-2 IFRs. Although this framework was developed to address the need for countylevel SARS-CoV-2 IFR estimates in the U.S., it can readily be applied to other geographic settings and pandemics to estimate area-level variation in outcomes.

## 2 Methods

This analysis aims to develop estimators for the SARS-CoV-2 IFRs – defined here as the number of COVID-19 deaths per 1000 SARS-CoV-2 infections – within each U.S. county. Here we focus on data from March 1, 2020 - October 31, 2020, though this approach can be applied to any time period within the pandemic and to diseases other than COVID-19. There are two primary challenges with developing these estimators for COVID-19. First, the total number of SARS-CoV-2 infections is unknown because most asymptomatic individuals, but also many symptomatic individuals, fail to get tested or report illness^{6,7,8}. Second, although reliable data are available on the number of SARS-CoV-2 NAATs administered per state, reliable testing data are not available at the county level^{5,9}. Having data on testing is needed because it allows sampling effort to be factored in; without it, for instance, it is impossible to know whether observing one or a few cases in a county is indicative of a low SARS-CoV-2 infection ratio or just a limited testing effort. What follows is an inferential framework for estimating the testing effort, number of SARS-CoV-2 infections, and ultimately SARS-CoV-2 IFRs for each county.

### 2.1 Definitions

At a given time and for a given region (county, state, or the U.S.), let *D* denote the number of COVID-19 deaths, *C* denote the number of observed COVID-19 cases, and *T* denote the number of SARS-CoV-2 NAATs administered. Let *P* be the population size of the region, assumed effectively fixed over the timescale of the pandemic. Let the number of infections (both observed and unobserved) be denoted *ι*. Let the IFR, defined as the number of COVID-19 deaths per 1000 SARS-CoV-2 infections, be denoted *ϕ*. Define *ω* as the odds ratio of testing an infected vs an uninfected individual. County-, state-, and country-level (nation- level) variables will be denoted with the subscripts *C, S*, and *N* respectively. For instance the number of observed cases in a state will be denoted *C*_{s}. Estimators will be denoted with ‘hats;’ e.g., the estimator for the number of infections in a county will be denoted . Where it is necessary to distinguish an individual county or state, an additional subscript *j* will be used; e.g., *T*_{0j} denotes the number of SARS-CoV-2 NAATs administered in county *j*.

### 2.2 Assumptions

To derive an estimator for the county-level IFR, assumptions about several quantities are necessary:

Odds ratios (

*ω*): Although it is not necessary to assume actual values of*ω*, it is necessary to assume a geographic structure for them for each time point. Here we assume that at a given time point in the pandemic, the odds ratio of testing an infected vs an uninfected individual is the same for all counties, states, and the entire U.S. That is,*ω*_{c}=*ω*_{s}=*ω*_{n}. However, the odds ratios can change through time.Global IFR (

*ϕ*_{n}): To estimate geographic variation in IFR, it is necessary to assume an baseline IFR for the entire region under consideration. In our case, we assume that the IFR across the entire U.S. was five deaths per 1000 infections from March 1, 2020 to October 31, 2020, a widely agreed upon value^{10}.Observed quantities: we assume that the population (

*P*) and numbers of COVID-19 deaths (*D*) and cases (*C*) in each county, state, and across the entire U.S. are known. Likewise, we assume that the number of tests in each state (*T*_{s}) and consequently across the whole U.S. (*T*_{n}) is known. However, we assume that the number of tests in each county is unknown (*T*_{c}). While the latter assumption may seem peculiar in light of the assumption that*T*_{s}is known, it reflects the situation that, in the U.S., the number of SARS-CoV-2 NAATs in each county was not consistently recorded during much of 2020, despite the fact that it was recorded in aggregate for each state.Number of tests in each county and state (

*T*_{c}and*T*_{s}): Because the number of NAATs in each county is unknown, it is necessary to estimate it. To do so, we assume that the number of tests administered*T*_{cj}in county*j*is related to the population size*P*_{cj}and the number of observed cases*C*_{cj}by , for some constants*α, β*, and*γ*, where*ϵ*_{cj}is distributed normally with variance zero. Thus, log*T*_{cj}is a linear function of*P*_{cj}and*C*_{cj}: log*T*_{cj}=*α*+*β*log*P*_{cj}+*γ*log*C*_{cj}+*ϵ*_{cj}. We make an analogous assumption for states, i.e., that , implying that log*T*_{sj}=*α*+*β*log*P*_{sj}+*γ*log*C*_{sj}+*ϵ*_{sj}. Empirically, these assumptions are strongly supported for states (Figure 1). Furthermore, it reasonable to suppose that for some non-negative constants*κ*and*η*, which are additionally upper-bounded by 1:*T*_{ij}=*κP*_{ij}and*C*_{ij}=*ηT*_{ij}, which is to say, that (a) the number of tests in state or county*j*is directly proportional to the population size of that state or county and (b) the number of observed cases in a state or county is directly proportional to the number of tests conducted there. Multiplying these two equations and simplifying yields:

That is, the power function assumed above.

At a given time point within a specific region, it is reasonable to conceptualize the process by which people are tested with a noncentral hypergeometric model, wherein each test is administered to either a person having or not having COVID-19 with some bias toward administering tests to symptomatic people (e.g., because sick people are more likely to seek testing^{8}; quantified by *ω*). For estimation, we utilize an approximation for the mean of the Wallenius’ noncentral hypergeometric distribution (Equation 16 in reference^{11}):
While recognizing that this expression is a close approximation, for clarity, in the following we will refer to it as an equality.

### 2.3 Estimators

We develop several moments estimators based on this equation. As per Assumptions 2 and 3, whether the quantities *T* and *ϕ* are observed and unobserved, and consequently need to be estimated, differs between regions (summarized in Table 1). We first find estimators for *ι* and *ω* at the country level, and then use these estimators to find estimators at the state and county levels (Table 3). Specifically, the estimators are derived in the following sequence:

Estimate the total number of SARS-CoV-2 infections in the country using Assumption 2 (i.e., that the IFR is five deaths per 1000) and the fact that the number of deaths is known.

Estimate the odds ratio at the county level by substituting the number of infections estimated in 1 and solving Equation (2) for

*ω*.Estimate the odds ratio at the state level via the Assumption 1 listed above.

Estimate the number of SARS-CoV-2 infections at the state level using the odds ratio from the previous step and solving Equation (2) for

*ι*.Estimate the IFR at the state level by dividing the number of deaths (observed) at the state level by the number of SARS-CoV-2 infections inferred in the previous step.

Estimate the odds ratio at the county level via Assumption 1 listed above.

Estimate the number of tests performed in each county using Assumption 4 listed above.

Estimate the number of SARS-CoV-2 infections at the county level using the odds ratio and tests estimates from the previous two steps, and solving Equation (2) for

*ι*.Estimate the IFR at the county level by dividing the number of COVID-19 deaths (observed) at the county level by the number of SARS-CoV-2 infections inferred in the previous step.

### 2.4 Validation and application

We simulated COVID-19 case data from March 1, 2020 to October 31, 2020, and then numerically assessed the performance of the estimators. Specifically, for each week of the study period:

We set the country-level odds ratio,

*ω*_{n}equal to , where*d*is the number of days elapsed since March 1, 2020. (With regard to the numerator, there were 238 days between March 1, 2020 and October 31, 2020.) As per Assumption 1, we set the state- and county-level odds ratios equal to the same country-level values.For each region, with the exception of the numbers of cases and country-wide IFR, we set the observed quantities in Table 1 equal to their known values (Table 2).

We simulated the numbers of tests in counties,

*T*_{c}, by distributing the tests in the corresponding states (*T*_{s}; an observed quantity) according to a multivariate hypergeometric distribution, with the condition that the number of simulated tests in a county could not exceed its population size (*P*_{c}; also an observed quantity).For each county, we simulated the total number of SARS-CoV-2 infections (

*ι*_{c}) as a random variate from the following distribution: ⌊0.017*· P*_{c}*U*^{4}⌋, where ⌊*·*⌋ is the floor function and*U*is a uniform [0, 1] random variate, with the condition that*ι*_{c}≥*D*_{c}. While this may seem like an arbitrary choice of distributions, it has the desirable property of resulting in a country-wide IFR (*ϕ*_{n}) of approximately 5 deaths per 1000 infections, consistent with Assumption 2.For each county, we simulated the number of COVID-19 cases in each county (

*C*_{c}) as a variate from Wallenius’ noncentral hypergeometric distribution, using the simulated odds ratio and number of SARS-CoV-2 infections from above.For each state and the entire country, we found simulated numbers of SARS-CoV-2 infections and COVID-19 cases (

*ι*_{s},*ι*_{n},*C*_{s}, and*C*_{n}) by summing the corresponding values simulated above from the counties that it contained.

Using the simulated values of the known quantities in the original data set (black quantities in Table 1) and the estimators described above, we found estimates of all of the unknown quantities (red quantities in Table 1), and also the country-wide IFR, *ϕ*_{n}. The average and variance of the estimates over 100 iterations of the simulation process yielded measures of the performance of the estimators.

## 3 Results

Applying the county-level estimators to the real, unsimulated COVID-19 data spanning March 1, 2020 to October 31, 2020 from across the U.S., we found that IFRs varied from 0 to 0.0273, with a mean of 0.0023 and a median of 0.0018 (interquartile range 0.0022). IFRs were the lowest in the Intermountain West, High Plains, and New England, but the highest in Arizona and Western New Mexico, the Upper Midwest, and scattered other regions of the U.S. The number of SARS-CoV-2 infections (not corrected for population) was highest in the Southwest and scattered other parts of the U.S., and lowest in Utah. Testing (also not corrected for population) was generally highest around large population centers. Similar trends were evident at the state level, with lower spatial resolution (Figure 2).

In our validation, we found that the mean of the county-level estimates of IFRs, number of SARS-CoV-2 infections, and tests had high accuracy (Figure 3; Pearson correlation coefficients for estimator means and true values 0.88, 0.95, and 0.74, respectively). Likewise, at the state level, the mean of the estimates for IFRs and number of infections had high accuracy and precision (Figure 4; Pearson correlation coefficients for estimator means and true values 0.95 and 0.99, respectively). Moreover, the estimators had coefficients of variation below 0.1 for almost all counties and states (Figure 5; for example, for IFR, 98.4%, 92.3%, and 11.6% of counties had coefficients of variation below 0.1, 0.05, and 0.01, respectively). In contrast, naive IFR estimates obtained simply by using the numbers of cases and case fatality ratio were biased by up to several orders of magnitude and had low precision (Figure 6; Pearson correlation coefficients for estimator means and true values were 0.21 and 0.51, respectively).

## 4 Discussion

We present an inferential framework for estimating testing effort, number of SARS-CoV-2 infections, and ultimately SARS-CoV-2 IFRs in U.S. counties. The approach relies on a noncentral hypergeometric model that accounts for differential probability of positive tests (i.e. because sick individuals are more likely to seek testing than healthy individuals^{8}) and performs well in validation testing.

Seroprevalence data have been widely used to estimate IFRs^{12,13,14}, and in limited settings, comprehensive case reporting and tracking exist^{15}. We offer the proposed framework as an extension to such methods. Our estimates require an assumption about the baseline IFR (Assumption 2), which may be based on seroprevalence or detailed tracking data. However, such data are often limited geographically and may have non-representative samples or fail to account for potential delays from onset to seroconversion^{16}. Therefore, we propose using data on testing effort, where reliably reported, to address area-level and temporal variation.

The number of excess deaths is another frequently used metric of the burden of mortality potentially related to the COVID-19 pandemic. Typically defined as the difference between the observed number of deaths and the expected number of deaths across a given time period^{17}, it includes changes in mortality that may directly or indirectly be attributable to COVID-19 (such as such as from delayed care or behavioral health crises). While useful in some settings, excess death estimates are highly sensitive to the reference time period used and negative estimates often occur in younger age strata^{18,19}. More detailed area-level estimates of mortality directly attributable to an infection are often needed.

Similar to all models, our results are dependent on the quality of the data, which has limitations. Nonetheless, an advantage of our approach is that it allows investigators to rely on the best available data and estimate quantities where data may be lacking. Our estimates use data on NAAT (also known as molecular or PCR tests) and do not include data from antigen tests including home tests, which did not receive FDA approval until late 2020. The NAAT are the most reliable for detecting infections and were the most widely used during the study period.

Our approach relies on a number of assumptions that can be relaxed to allow for generalization. First, Assumption 1 – that the odds testing an infected vs an uninfected individual is the same for all counties, states, and the entire country – can be relaxed in numerous ways. While some geographic or temporal structure must be assumed to estimate the odds ratios, which are latent variables, this structure does not need to be geographically homogeneous. With additional covariates (e.g., data on demographics, health care access), it may also be possible to further infer specific aspects of the odds ratios. Second, regarding Assumption 2, while an overall IFR must be assumed, this value can in principle be any number between 0 and 1 dependent on the disease, time, location, and other factors. Moreover, when the aim is to rank IFR in subregions relative to each other instead of estimate their precise values, the assumed overall IFR may not matter. And third, an added degree of complexity specific to COVID-19 in the U.S. is imparted here because the number of NAATs in each state, but not county, are known. In a more typical scenario, when testing effort is recorded for each subregion, one could directly make estimates for a single class of subregions (e.g., counties) directly from a superregion (e.g., country). This would omit the need for our Assumption 4 and simplify Assumption 3. By virtue of its flexibility, our approach shows promise for application at different geographic scales and may be particularly useful in settings in which resources to carry out large, representative seroprevalence studies are not available.

Our approach can be flexibly adapted over time and space. This flexibility is important because the IFR is not biologically fixed. It varies with the distribution of age, health, and other qualities of those infected as well as the medical care they receive. Local estimates may be more accurate and useful for policy planning purposes than generalized or country-level estimates. For instance, population age structures do not address all of the heterogeneity between countries IFRs based on seroprevalence^{13}. In conclusion, our approach can serve as an extension to other methods for estimation of pandemic burden to allow for greater precision around geographic and/or temporal population subgroups.

## Data Availability

All data produced in the present study are available upon reasonable request to the authors.

## 5 Acknowledgements

This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. This manuscript has been coauthored by UT-Battelle, LLC under contract no. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, last accessed September 16, 2020). Work at Oak Ridge and Lawrence Berkeley National Laboratories was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act, and was facilitated by previous breakthroughs obtained through the Laboratory Directed Research and Development Programs of the Lawrence Berkeley and Oak Ridge National Laboratories. M.P.J. was supported by a grant from the Laboratory Directed Research and Development (LDRD) Program of Lawrence Berkeley National Laboratory under U.S. Department of Energy Contract No. DE-AC02-05CH11231.