A Method to Identify the Missing COVID-19 Cases in the U.S. and Results for mid-April 2020

I use the COVID-19 death rate in South Korea and a method relating the ratio of death rates in a U.S. state to its share of cumulative positive tests to estimate the total cases of COVID-19 in the U.S. and to estimate the extent of infection and the unidentified share of the infected population in each of the lower-48 states and in New York City in mid-April, 2020. I identify a logarithmic relationship between the cumulative death rate in a state and its cumulative positive share of tests. Using this relationship, I find that 4.3-5.4 million people, 1.4-1.7% of the U.S. population, were infected, with rates of infection that ranged from 0.1% in more rural states to 8-10% in New York state and 11-13% in New York City. Only 16-20% of these infected individuals were identified later through testing.

1 Extremely limited testing for COVID-19 until recently and the uncertain share of asymptomatic cases has led to considerable speculation about the extent of infection in the U.S. population. In this paper I use state level data on positive COVID-19 test rates and associated deaths, combined with an estimate of the death rate from the virus in South Korea, to estimate the number of coronavirus cases in each state, in New York City, and in the U.S. in total. The methodology presented in this paper is similar to the methodology in Breton [2020], but the data used in this paper to estimate the share of infection covers a longer period and the interpretation of the model's results has been revised based on new information.

I. Methodology
I have developed a methodology to estimate the unidentified cases of COVID-19 in the U.S. that utilizes death rates in South Korea and in the lower-48 states of the U.S. and the share of cumulative positive tests in the lower-48 states. The methodology is based on a series of assumptions: 1) Unlike all other countries, South Korea's documented experience with the coronavirus is accurate and complete. 2) South Korea and the U.S. are sufficiently similar in their medical treatment capacity to have similar actual death rates (deaths/infected individuals) for COVID-19 for similar age distributions of patients.
3) The reported share of positive tests in a U.S. state or region is an indicator of the share of infected individuals tested, with a higher positive share inversely related to the share of total cases identified by tests 4) The expected death rate in a state associated with its share of positive tests divided by the death rate in South Korea provides a valid ratio of the total cases/identified cases in the state.
I do not use the death rate in each state directly to estimate unidentified cases in the U.S, because state death rates vary based on the particular populations infected in the state (e.g., the elderly in nursing homes). Instead I use the death rates in all of the states to estimate a relationship between the expected death rate and the share of cumulative positive tests in a state. I do this because the share of positive tests in a state is likely to relate more closely to the share of COVID-19 cases that have been identified than the state's particular death rate.
The critical element in implementing my method is to account properly for the lags between contagion, the onset of symptoms, the reporting of test results, and deaths. The typical lag between contagion and death is 23-27 days [Boyd, 2020]. Due to delays in U.S. testing in April 2020, this means that positive tests for COVID-19 identified infection that occurred 8-12 days earlier and relate to deaths that occurred 11-19 days later.
I use the estimated average lag between the reporting of COVID-19 positive tests and death to identify the related deaths and calculate the death rates in South Korea and in the lower-48 states that correspond to the reported cases in early April. I use an assumed lag of 19 days from the onset of symptoms to death to calculate the death rates. This lag is consistent with the estimate of 18-21 days in the UK [Boyd, 2020] and Verity et al.'s [2020] estimate of 17.8 days in China. I use reports on testing delays to estimate the lag between test results and deaths.
I then plot the death rate in each state vs. the share of positive tests on April 7 th and use a logarithmic trendline to determine the relationship between these two variables in the state data. I calculate the ratio between the estimated death rate in the U.S. and in South Korea to determine the relationship between the actual total cases and the reported cases in each state or in a subregion of a state. I use this relationship to estimate the identified and unidentified shares of the total cases in the states on April 25th, which in total provides a national estimate of the infected population in the U.S. in mid-April. I validate the methodology by examining whether the estimate of the unidentified share of cases in a state is consistent with its reported testing strategy and the availability of testing in March and early April. Since the states did not test asymptomatic individuals, I also examine whether the estimated shares of unidentified cases are consistent with an estimate of the asymptomatic share of all cases in a screening study carried out in Iceland.
In this study 100 of 13,080 (0.6%) participants tested positive and were quarantined for 14 days, during which 43% reported that they were asymptomatic [Gudbjartsson, et al., 2020]. The study excluded individuals with severe symptoms that would be hospitalized. A WHO-China joint mission determined that about 20% of all symptomatic cases in China required hospitalization [Verity, 2020]. Assuming this ratio is applicable in Iceland, I reduced the estimated asymptomatic share to 38% and use this share to further test whether my estimates of the range of unidentified shares of COVID-19 cases in the U.S. are plausible.

II. The Death Rate in South Korea
The death rate used in this study is the share of positive cases of COVID-19 that end in death. I use data from South Korea to estimate the actual death rate in the U.S.
The government of South Korea implemented a contact tracing strategy to identify all of the individuals infected with COVID-19 in the country, including those with no symptoms, and then tracked these individuals to determine the associated deaths. Table 1 presents the data for the cumulative cases of the virus and the associated deaths in South Korea during April. Since the number of new cases declined to almost zero during the month, there must not have been many asymptomatic individuals who were not tested, identified, and quarantined. This means that South Korea is unique among countries in including all of its asymptomatic cases in the national estimate of total COVID-19 cases.
The Government of the Republic of Korea [2020] reports that in early April the delay between testing and reporting results was 2-3 days. I assume 19 days between the onset of symptoms and death and three days between the onset of symptoms and test confirmation, which leaves 16 days as the average lag between confirmed test results and death.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10. 1101 As shown in Table 1, there were 10,156 cumulative confirmed cases of the virus in South Korea on April 4 th , and there were 236 cumulative deaths from the virus sixteen days later on April 20 th . This yields a 2.3% death rate for all cases. This rate could be low if some deaths related to the cases are yet to occur or are not included in the statistics, but this is unlikely. The data in the table show that in the two days prior to April 20 th there were only 21 new cases and two new deaths. These numbers are so small that slight increases would have no noticeable effect on the ratio of cumulative deaths/cumulative cases in mid-April. In addition, deaths are unlikely to be underestimated because South Korea tested all suspected cases and attributed all deaths of persons who tested positive for the coronavirus to the virus, even if they had other underlying health conditions [AFP-JiJi, 2020].
The average death rate for COVID-19 in a country is very sensitive to the age distribution of infected individuals in a country because death rates are much higher for older patients. The South Korean death rate is only applicable for the U.S. if the age distributions of COVID-19 patients are similar in both countries.
The age distribution of all cases is known in South Korea, but it is not known in the U.S. because such a small fraction of the COVID-19 cases have been tested. The distributions of the known cases in the U.S. in mid-April includes 40% in the over-60 category compared to 25% in South Korea [Breton, 2020]. The higher reported incidence in the U.S. is due, at least in part, to the failure to test individuals who were asymptomatic or had mild symptoms, a category where younger individuals are more prevalent. Taking this characteristic into account, the age distributions of infected individuals do not seem very different in the two countries, so I conclude the South Korean death rate of 2.3% is applicable for the actual death rate in the U.S. when all cases of COVID-19 are included.

III. U.S. Data on the Share of Positive Tests and the Associated Death Rates
Several groups track the numbers of U.S. tests, confirmed cases, and deaths from COVID-19 by state and publish their results several times a day. These data can be used to . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10. 1101 calculate the shares of individuals that test positive and the death rates associated with the confirmed cases.
The reported cases of COVID-19 in the U.S. in April 2020 are a subset of the actual number of cases because the U.S. did not have sufficient testing capacity to test all of the individuals with symptoms of the disease. Due to the limited tests available, the CDC issued guidelines to the states to prioritize testing [Connor, 2020]: • Priority one: Hospitalized patients and symptomatic healthcare workers • Priority two: High-risk patients with coronavirus symptoms • Priority three: Symptomatic individuals in the community, if resources allow None of these priorities included asymptomatic individuals.
The likelihood that tests would be positive was highest for individuals in Priority one and lower in Priorities two and three. As a result, as the number of tests increases, testing is extended to groups less likely to test positive, and the positive share of cumulative tests declines. States (or cities) overwhelmed with coronavirus cases tested only a small fraction of the individuals they thought were infected and obtained a very high share of positive tests. Figure 1 shows the share of cumulative positive tests by state for the lower-48 states and South Korea on April 7 th , using U.S. data published in Politico [Jin, 2020]. These data show that about half the states had shares of positive tests under 10%, with the lowest share at 3%. All of the states had a higher share of cumulative positive tests than South Korea (2%). Five states had shares of cumulative positive tests over 20%, and three states had shares over 30%. These three states, New Jersey, New York, and Michigan, reported tremendous limitations on their capacity to test individuals with COVID-19 symptoms. Maag [2020] reports that in late March 2020 very sick New Jersey residents spent many nights in their cars waiting in line at testing stations, after being turned away in emergency rooms. In late March New York City restricted testing to hospitalized patients to prevent the many infected individuals with less serious symptoms from leaving home [Cuzey, 2020]. Michigan restricted testing to those with the most serious symptoms until testing was extended to those with mild symptoms in mid-April [Clarke, 2020].
Until mid-March even states with lower shares of cumulative positive tests, such as California and Rhode Island, reported that they could not test everyone who needed a test [Becker, 2020 andMooney, 2020]. Even in mid-April, after testing capacity in the U.S. improved relative to the number of suspected COVID-19 cases, tests in most locations were still rationed to the higher priority applications [Connor, 2020].
Death rates (deaths/infected population) for COVID-19 calculated from the confirmed cases for most states are much higher than the 2.3% rate in South Korea, even though there is no reason to expect that a higher share of Americans is likely to die from the virus. The most . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10. 1101 likely explanation for the higher death rates in the U.S. is that a large share of infected individuals are not included in the data.

Figure 1 Positive Share of Cumulative Coronavirus Tests on April 7, 2020
If this is the case, then we should also expect to see a strong positive correlation between the death rates and the share of cumulative positive tests across states. A higher share of individuals testing positive is almost certain to indicate that the COVID-19 cases more difficult to identify are missing. The reported lag for test results from LabCorp and Quest Diagnostics on April 8 th was 3-4 days [Strickler and Kaplan, 2020]. With a total lag of 19 days between the onset of symptoms and death, I assume that the average time lag between the onset of symptoms and test results on April 7 th was 5 days. This leaves a 14-day average lag between reports of cases of coronavirus and the deaths associated with these cases.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10. 1101 The data for the death rate and the positive share of tests in the figure show a definite positive logarithmic relationship (R 2 = 0.30). As the positive share of cumulative tests rises across states, the associated death rate rises, but at a decreasing rate. The share of cumulative positive tests in the trendline associated with the assumed 2.3% true death rate is 1%, which in the absence of tracking, is consistent with the 2% share of positive tests at that death rate in South Korea.

Figure 2 Apparent U.S. Death Rates vs. Positive Shares of Tests on April 7th
The estimated relationship of the death rate (deaths/infected population) to the positive share of cumulative tests is: 1) Death rate = 0.0289 ln(positive share of tests) + 0.155 The data used to estimate this relationship are provided in Breton [2020].
This relationship can be used to estimate the ratio of the actual infected population to the identified cases when the death rate equals 2.3% as a function of the positive share of cumulative tests in each state: 2) TotaI Cases/Identified Cases = 1.257 ln(positive share of tests) + 6.757 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10. 1101 Equation (2) can be used to estimate the total number of actual COVID-19 cases in each state or in a subregion of a state. The sum of all the actual cases in the states can be used to estimate the infected share of the U.S. population.
Equation (2) is based on the assumption that the reported deaths due to COVID-19 in the U.S. are as complete as the reported deaths in South Korea. As discussed earlier, South Korea tracked and tested all individuals likely to have COVOD-19 and attributed any deaths of patients who tested positive for COVID-19 to the virus. This is not the case in the U.S. because some deaths from the virus involved patients who died at home or in nursing homes without ever being tested for the virus. These deaths were attributed to pneumonia, to some underlying condition, or to some fatal physiological response to the virus, such as a stroke.
Coroners are now reviewing deaths that occurred over the last few months and increasing the deaths attributed to COVID-19, but these deaths are not included in the New York Times [2020] data. This means that the death rate in Equation (1) and the ratio of total cases/identified cases in Equation (2) are underestimated. Wu, McCann, Katz, and Peltier [2020] have compared the surge in mortality in New York City from March 11, 2020 to April 25, 2020 (relative to the average for the same period in 2017-2019) to the number of COVID-19 deaths in the New York Times data. The sudden increase in deaths in 2020 is 25% higher (4200/16673) than the reported COVID-19 deaths, which could be unreported COVID-19 deaths. They show similar trends in numerous European countries.
Given this information, the estimates from Equations (1) and (2) should be considered a lower bound on U.S. COVID-19 death rates and on the ratio of total/reported cases of COVID-19. A more reasonable estimate of these two variables appears to be a range from the values estimated by the Equations to values that are 25% higher.
The validity of this estimated range for total COVID-19 cases in each state can be evaluated by examining the plausibility of the estimated unidentified shares of cases in the states with the lowest and highest shares of positive tests in April, given the reported availability of tests in those states in late March and early April.
Applying Equation (2) and a value 25% higher in Montana, the state with the lowest share of positive tests (3.7%) on April 25th, indicates that the state identified 31-38% of its COVID-19 cases. Since 38% of all cases are estimated to be asymptomatic, if the state did not identify any of these cases, then it identified 50-61% of the symptomatic cases. Since most cases are mild or moderate, with symptoms not obviously due to Covid-19, this estimate seems reasonable, even given the relatively high level of testing in the state.
Applying Equation (2) and a value 25% higher in New Jersey, the state with the highest share of positive tests (49.6%) on April 25th, indicates that the state identified only 14-17% of its COVID-19 cases. Since 38% of the cases are estimated to be asymptomatic, if the state did not identify any of these cases, then it identified 23-27% of the symptomatic cases. Given the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10. 1101 tremendous difficulty encountered by individuals seeking tests in New Jersey in late March, and the likelihood that a portion of the severe cases ending in death were not identified, this estimate also seems reasonable. Table 2 presents the results for an application of the model and values 25% higher in each state, using Jin's [2020] data for the share of cumulative positive tests on April 25 th . The state results are used to calculate results for the U.S. as a whole, which due to the lag in testing and obtaining test results, measure the levels of infection about 10 days earlier. For each state the table shows the minimum ratio of total cases/reported cases, the minimum share of unidentified cases, and the range of infected shares of the population in mid-April.

IV. Estimated Total and Unidentified Coronavirus Cases
What is striking is that most of the cases in every state are unidentified. Overall, only 16-20% of all cases have been confirmed through testing, leaving 80-84% unidentified. The minimum unidentified share ranges from 60% to 83% across states. The total number of cases in the U.S. was 4.3-5.4 million, yielding an infected share of the population of 1.4-1.7%.
The rate of infection varies dramatically across states, from about 0.1% in Montana and West Virginia to a range of 7.6 to 9.5% in New York. Applying the model to New York City using testing data from New York State [2020] for April 25 th , indicates that the rate of infection in the City in mid-April was 10.7-13.4%. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10.1101

V. Discussion
The U.S. testing to identify cases of COVID-19 has been completely inadequate to determine the extent of infection in the population. In the absence of sufficient tests, this study uses the death rates and the shares of cumulative positive tests in the lower-48 states and the death rate in South Korea to estimate the size of the population infected with COVID-19 and the share in each state that has been identified.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10. 1101 The results indicate that 1.4-1.7% of the U.S. population was infected with the virus in mid-April. This share is nowhere near enough to create herd immunity, but it is far too many people to permit contact tracing in the states with the highest infection rates. South Korea managed to control the spread of the virus through a contact tracing process, but they only had 10,600 total cases.
The results indicate that no state has managed to identify even half of the COVID-19 cases, so it is difficult to control the spread. But it is also clear that not all states have been severely affected by the virus, with rates of infection in mid-April ranging from as little as 0.1% in rural states to a high of 7.6-9.5% in New York State.
The wide differences in rates of infection indicate that different strategies are appropriate to manage the virus in different states. Contact tracing would be feasible, at least in theory, in the states with the lower rates of infection, as South Korea managed to do it with a 0.2% infected share of its population.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 5, 2020. . https://doi.org/10. 1101