The Social and Economic Factors Underlying the Impact of COVID-19 Cases and Deaths in US Counties

This paper uncovers the socioeconomic and health/lifestyle factors that can explain the differential impact of the coronavirus pandemic on different parts of the United States. Using a dynamic panel model with daily reported number of cases for US counties over a 20-day period, the paper develops a Vulnerability Index for each county from an epidemiological model of disease spread. County-level economic, demographic, and health factors are used to explain the differences in the values of this index and thereby the transmission and concentration of the disease across the country. These factors are also used in a zero-inflated negative binomial pooled model to examine the number of reported deaths. The paper finds that counties with high per capita personal income have high incidence of both reported cases and deaths. The unemployment rate is negative for deaths implying that places with low unemployment rates or higher economic activity have higher reported deaths. Counties with higher income inequality as measured by the Gini coefficient experienced more deaths and reported more cases. There is a remarkable similarity in the distribution of cases across the country and the distribution of distance-weighted international passengers served by the top international airports. Counties with high concentrations of non-Hispanic Blacks, Native Americans, and immigrant populations have higher incidence of both cases and deaths. The distribution of health risk factors such as obesity, diabetes, smoking are found to be particularly significant factors in explaining the differences in mortality across counties. Counties with higher numbers of primary care physicians have lower deaths and so do places with lower hospital stays for preventable causes. The stay-at-home orders are found to be associated with places of higher cases and deaths implying that they were perhaps imposed far too late to have contained the virus in the places with high-risk populations. It is hoped that research such as these will help policymakers to develop risk factors for each region of the country to better contain the spread of infectious diseases in the future.


Introduction
The novel coronavirus, also known as COVID-19, has brought the global economy to a screeching halt. It is sweeping through the United States and the country has now taken a lead in not only the total number of positive cases but in terms of the number of reported deaths as well. New York's total number of cases exceeds the total number reported for any country, including China.
The virus is believed to have originated in Wuhan, China in late 2019. As the virus spread through Wuhan and the rest of China, it raised alarms across the scientific communities and governments around the world. With every passing day the virus continued to spread exponentially. The impact of the virus in the United States started to grab public attention from late February and many states started imposing stay-at-home orders in mid-March, 2020. The dramatic increase in the number of infected patients in a nursing home in Seattle, Washington stunned a nation and made evident the contagiousness of the virus and its lethality. While the national attention was focused on Seattle, the virus was taking a deadly hold in New York city and its surroundings. With every passing day, one state after another started announcing their first reported cases. No US state has been spared. However, the spread of the virus has been anything but uniform. Figure 1 shows the geographical distribution of reported cases across US counties in April, 2020.
This paper attempts to uncover the socioeconomic conditions that are dominant in the areas with the high number of cases and deaths. The literature on the transmission of infectious diseases often finds that highest impact areas have low income, poor sanitary conditions, and poor health care conditions due to their focus on viruses that have significantly impacted developing countries (Campos et al. (2018) for Zika, Redding et al. (2019) for Ebola are recent examples). Moore et al. (2017) used Ebola to develop an Infectious Disease Vulnerability Index for countries in Africa. The literature on the socioeconomic determinants of the spread of infectious diseases in developed countries is not extensive - Adda (2016) is an exception.
Using data from France, it offers an extensive analysis of the transmission of three viruses -2 influenza, gastroenteritis, and chickenpox. The paper asks the important questions whether virus spread more rapidly during periods of economic growth and if their spread follows a "gradient determined by economic factors." Using data from France, Adda (2016) finds that the viruses studied indeed propagated faster during times of economic boom due to increased economic activity and contact between people. Qiu (2020) have conducted a similar analysis for Wuhan, China. Both papers find a positive relationship between the spread of the virus and economic activity. Avery et al. (2020) offers a list of resources both in terms of relevant research and data sources for researchers.
Unlike some of the literature cited above that concentrate on the impact of mitigation andor containment strategies along with economic conditions such as GDP, employment, and weather-related factors such as temperatures, and pollution (Wu (2020)), this paper focuses on economic, demographic, and health conditions in explaining the number of cases and deaths in the US. Figure 1 clearly indicates that the spread of the virus has been in regions of high economic activity on the two coasts. The virus has arrived on the US shores through international travel. While the initial spread of the virus is expected to be triggered by international travel and economic activity, it is important to understand whether its continued spread and concentration is restricted to such places. As the lockdown continues and the medical profession is trying to understand the susceptibility of individuals in contracting the disease, this paper attempts to understand the underlying socioeconomic conditions of the geographic regions around the US that make them susceptible to becoming hotspots. This is related to the question about the factors that determine the gradient followed by the virus as it spreads through the country.
Introducing heterogeneity that captures region-specific uniqueness in an epidemiological model of disease spread, the paper develops a Vulnerability Index for the counties included in the study. These indexes capture the underlying factors that impact the vulnerability of a region to the virus. Economic, demographic, and health/lifestyle factors are used to explain the observed differences in the vulnerability index. These factors are also used to explain the differences in the number of deaths reported across the countries. The results indicate 3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) that the underlying demographic and health/lifestyle factors have a more significant impact in explaining deaths than disease spread. This is not a surprising result since the spread of the virus does not depend on a person's ethnicity or education status. Once people contract the disease, however, the health outcome depends on a multitude of factors that go beyond an individual's control. The paper uses available county level data to identify economic, demographic, and health/lifestyle risk factors for different parts of the US. The paper finds that people in regions of high economic activity and economic inequality are particularly at elevated risk of both disease spread and mortality. There is a remarkable parallel between the spread of the disease and distanceweighted distribution of passengers arriving at international airports. The demographic distribution in terms of race shows higher vulnerability both in terms of disease contraction and death for non-Hispanic Blacks, Native Americans, and immigrants. Counties with higher numbers of personal care physicians per 1000 individuals have lower deaths and so do places with fewer preventable cases of hospital stays. Some of the high risk factors such as obesity, diabetes, are found to have a more mixed result. This can be partly explained by the fact that many of these risk factors have a high degree of concentration in many of the southern states.
These states have not reported as many cases or deaths as the regions around New York, Detroit, Chicago, and the western states of California and Washington. The paper includes the number of days since the onset of stay-at-home orders issued by governors at the state level across the country. This variable is not found to be statistically significant in influencing the vulnerability index in the spread of the virus. Regions that have longer stay-at-home orders have experienced higher number of deaths. These regions would have experienced much higher number of cases and deaths without those orders. It is likely that they were imposed too late to have been successful in containing the virus. This paper has identified socioeconomic and health/lifestyle factors that have played a critical role in helping the virus to develop a stronghold in certain parts of the country and cause high fatalities. It is true that a single gathering of individuals can lead to a spike in the number of cases and large number of deaths in a region. The members of the Coronavirus Task Force 4 are monitoring where a sudden spike is occurring. As data on cases and deaths are collected, it is important to be able to better predict if the population of a certain area is particularly vulnerable to the disease. This paper shows that it is possible to develop a vulnerability index both for disease spread and deaths based on the socioeconomic composition of the population and their health/lifestyle choices. Developing such a profile will be particularly important as the various parts of the country contemplate lifting stay-at-home orders before the invention of a therapeutic or a vaccine.
Recent experience suggests that infectious diseases are a major threat to both the health and economic well being of people around the world. In spite of the experience with HINI, SARS, and Ebola, countries such as the United States did not develop a coherent infrastructure or strategy to determine which parts of the country are at particularly higher risk of disease transmission. This paper shows that it is possible to utilize the economic, demographic, and lifestyle profiles of regions to develop a risk factor for each geographical area so that when the next epidemic arises, public officials are better prepared to anticipate where the hotspots are likely to arise and take the necessary containment steps. The experience with COVID-19 shows how rapidly an infectious disease can bring an economy down. Without advance preparation the next disease will be just as difficult to contain as this. The large differences within state boundaries show the importance of developing more local strategies that take into consideration a multitude of factors.

Methodology and Data
The coronavirus pandemic has impacted all 50 states in the United States. The experience of each state, county, and city has been anything but homogeneous. To understand this differential effect across counties in the US, we consider two sets of factors. Epidemiological models explain how an infectious disease evolves in a region based on population and the size of the pool of infected individuals. We will use epidemiological models such as the SIR model to determine the fundamental differences in cases based on population size and number of 5 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. .
infections. These factors alone cannot explain the entire heterogeneous outcomes across the country. We expect differences in types and amounts of economic activities, living conditions, demographic makeups, and lifestyle choices to determine the vulnerabilities of communities in the spread of a highly contagious virus such as the coronavirus.
We will conduct this analysis in two steps. In the first step an epidemiological model of disease spread will be used to generate estimates of a vulnerability index for each county once population and infections are accounted for. In the second step we will use county level economic, demographic, and health data to explain differences in the vulnerability indexes across counties.
Epidemiological models of the SIR type such as in Blackwood et al. (2018) 6 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . https://doi.org/10.1101/2020 where, C it denotes the number of reported cases in county i at time t, γ i gives the fixed effect parameter for county i, δ is the parameter for the time variable, and u it is the error for county i at time t. The lagged value of the cases shows that the number reported in any day depends on the numbers reported the previous day.
Estimation of the above regression will generate parameter values, γ, for each county.
These values will reflect the county-specific fixed effects that influence the vulnerability of each county to the virus. From these fixed effects we generate a vulnerability index for each county. This approach is similar to the one used by Mukherji and Silberman (2013) in studying patent citations between metro areas in the US. In the second step of the analysis, we use county-level economic, demographic, and health care factors to explain how they influence the vulnerability index for each county. The factors that may explain the county vulnerability index are classified into three groups. The first group of factors relate to the economic conditions and include factors such as: per capita personal income, the unemployment rate, the level of income inequality, poverty, access to housing, and concentration of different types of industries such as manufacturing, mining, and others. The second group of factors relate to a set of demographic factors including the size of the population and its density, the racial profile of the counties, the age distribution of the population, and the percentage of the population that was born outside the United States. The third group of factors considered include health or lifestyle related factors such as the number of primary care physicians per capita, the percentage of the population with obesity and diabetes, the percentage of the population that smokes and drinks, the percentage of the population with inactive lifestyles. In addition to the county level economic, demographic, and health data, spatial factors are considered as well. The contagious nature of the disease compels one to consider the spillover effects to neighboring counties. We introduce inverse-distance weighted values of the number of international passengers served by the top 46 international airports in the contiguous US.
Since the virus is presumed to have originated in China and then spread to other parts of the world including Europe before taking a hold in the United States, international passenger data is introduced to examine if proximity to international airports is related to the concentration 7 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . of confirmed cases. While international passengers often arrive at a particular airport and then use domestic airlines to travel to other parts of the country, the locations of the international airports are closely tied to areas with concentrations of activities that are globally oriented.
Consequently, a large number of the international passengers served by these airports are expected to interact in the regions around these airports. Using a 300-mile radius around each county where the airports are located, an inverse-distance matrix is used to assign the number of international passengers in the areas surrounding the airports. The bottom part of Figure   1 displays the weighted distribution of international passengers. While this data is unrelated to the number of confirmed COVID-19 cases, the spatial distribution of the passenger data is similar to the spatial distribution of confirmed COVID-19 cases.
The estimation of the impact of these regional factors in explaining differences in vulnerabilities to the disease will be based on Equation (2).
In the above equation, V i represents the vulnerability index of county i, e ki represents the set of k economic variables that makes a county susceptible to the spread of the disease due to the enhanced interactions between people and working in close proximity. Although the economic activity of a county changes with time, the general distribution of such activities across the country remains relatively stable within short periods of time. d m represents the demographic factors and h n represent the health-care factors discussed above. This equation includes a spatially weighted number of international passengers in the region by multiplying an inverse distance-weighted matrix W with the number of international passengers, I, served by an international airport in the neighborhood of county i. This paper uses county-level data for the United States. The data on COVID-19 cases and deaths is obtained from the COVID tracking data provided by the New York Times and Johns Hopkins University. Figure 1 displays the distribution of cases in the 2512 counties.
Data sources for the various demographic and economic variables such as population distribution by ethnicity, population density are listed in Table 1. While many of the data listed 8 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . for local spillover effects of the virus in the form of increased susceptibility due to higher prevalence of cases, an inverse distance weighted matrix was created with positive weights assigned upto a 300 mile radius around a county. This radius is just large enough to ensure that each county in the study had at least one other county in the study as a neighbor.

Estimation of Cases
The previous section explained that the foundation of the analysis of the socioeconomic factors that can contribute to the spread and concentration of the coronavirus in the various parts of the country lies in the epidemiological model of disease transmission. The first step is to generate county-level vulnerability measures from an estimation of equation (1) CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . infections in determining the proportion of the population that is susceptible at any time t.
The incubation period for this virus is estimated to be anywhere between 2 to 14 days. People are infections a few days before they develop symptoms and after they develop symptoms. We assume a 7 day lag for the results reported in the paper. Sensitivity analysis was conducted for different lag lengths.
Equation (1) Stephen (1981). A difference GMM estimation is found to be the best option for the data.
The Allerano-Bond estimation method Arellano and Bond (1991) that uses lagged values as instruments as implemented by Roodman (2006) was used. Results are reported in Table 2.
The results show that although autocorrelation of the first order exists, there is no second order autocorrelation. The Sargan and Hansen tests of no overidentification of instruments are satisfied and the F statistic shows that the model fits the data well. The table shows that the one period lagged number of cases has a significant impact on the number of cases reported on any day. The interaction of the infected and susceptible population is also significant and positive.
One of the key objectives of this regression is to obtain a set of estimates for the county level fixed effects. The method of dynamic panel estimation that utilizes first differencing removes the impact of time-invariant variables such as the time-invariant fixed effects. These are, however, recoverable from the residuals. It is to be noted that for a dynamic panel model of the form, y it = ρy it−1 + a i + e it , the residualê it = a i + e it + (ρ − ρ)y it . The averageē i can be used as an estimator of the fixed effects to analyze how the underlying conditions in the various counties impact the fixed effects as long as those factors are uncorrelated with the e it . That condition is satisfied with average e it equalling -7.00e-09 for the results of the regression of equation 1. The plot of the fitted and observed values in Figure 5 shows the distance between 10 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . https://doi.org/10. 1101/2020 the observed values and the fitted line and will be the county-level fixed effects.

Estimation of the Vulnerability Index
The estimates of the fixed effects derived from the dynamic panel regression of cases are converted to an index by transforming the mean value to 100 and is termed the Vulnerability Index. High values of the index indicate that the counties are more susceptible for the growth of the disease. The value of the index range from 63 for Lincoln, Arkansas to a high of 229 for New York City, New York. Table 3 Table 2 while the results are reported in Table 4. The differences in the three sets of results are based on the inclusion of population and population density in the regression. These two variables have a correlation of 0.76. As discussed in the previous section, the independent variables are classified into three broad groups -economic, demographic, and health/lifestyle. The results show that in the economic group, per capita income has a positive and significant effect showing that places of high income have higher vulnerability. The Gini coefficient measuring the degree of income inequality and severe housing problems are positive and significant if only population density is included. Another measure 11 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . https://doi.org/10.1101/2020.05.04.20091041 doi: medRxiv preprint of economic hardship measured by the degree of food insecurity has a significant and negative effect. This is consistent with the result on income. Figure 2 shows that the largest concentration of counties with the highest levels of food insecurity are in the southern states of Georgia, Mississippi, Arkansas, Alabama -places that have not reported as many cases as some of the hot spot counties in the northeast and west. The unemployment rate and indicator of deep poverty are not found to be a significant variables. The results also show that places of severe housing shortage have higher vulnerability indexes only when population is not included. Together these results show that counties with higher vulnerability have higher economic activity.
The measures of income inequality and severe housing problems have a positive impact on the vulnerability index but they are only significant when population is not included.

12
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . https://doi.org/10.1101/2020.05.04.20091041 doi: medRxiv preprint interactions people have that make them vulnerable in getting in contact with other carriers of the disease. The variable on driving alone to work is negative and highly significant. This is consistent with the notion that driving alone causes less exposure to others and can serve as a protection against getting infected. Population size is a highly significant indicator and so is density as long as population is not included.

Estimation of Deaths
While the age distribution and health indicators are not significant in explaining the differences in the number of cases across the counties, it is well established at the individual patient level those are important factors. The daily data provided by the New York Times and Johns Hopkins University report the number of deaths as well. Table 5  The coefficients of the regressors are reported as incidence rate ratios to help in the interpretation of the values. Unlike the estimation of cases reported in Table 3 CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. On the health related factors, counties with more primary care physicians have reportedly fewer deaths. The remaining results related to health indicators are as follows -places with more preventable hospital stays, higher percentage of the population that has diabetes, smokes, and are physically inactive have higher reported deaths. These are not surprising since people with underlying health risks are expected to experience more severe reactions to the infection.
Counties with higher obesity and percentage of the population that engages in excessive drinking have fewer deaths. The coefficients of the region codes 2-4 are less than 1 indicating that relative to the excluded region, the northeast, the other regions had smaller incidence of death.
The results show that the economic factors are important for explaining the differential impacts experienced by counties across the country both in terms of confirmed cases and deaths reported. The demographic and health related factors are more pronounced in the estimation of deaths than reported cases. This is not surprising since the virus does not discriminate 14 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . based on any factor other than immunity but the severity of the disease that can lead to a fatal outcome depends on underlying health and demographic factors.

Conclusion
This paper has examined the differential experience of infections and deaths across the United States due to the COVID-19 pandemic. Daily reported cases of confirmed cases and deaths were examined over a 20 day period from March 30 through April 19, 2020. Although data is available for over 2700 counties, this paper focused on 771 counties that reported an average of 30 cases over the 20 day period. The counties that are not included in the study had far fewer cases and reported deaths. The counties that remain in the sample includes a vastly diverse set of counties. The excluded counties are largely similar in the small number of cases and reported deaths and added significant costs in terms of computational complexity without adding much in terms of added value.
The analysis of the number of cases is based on an epidemiological model in which we included a county fixed effect. This is a novel way to introduce heterogeneity in such a model.
As noted by Avery et al. (2020), the epidemiological models do not include the heterogeneity that economic models require. A dynamic panel regression of the number of cases included the potential number of interactions between susceptible and infected individuals as a proportion of the population along with county fixed effects. The results of the model were used to construct a Vulnerability Index for each county. Economic, demographic, and health/lifestyle factors were used to explain the differences in the Vulnerability Index across the counties.
The results showed that counties with higher economic activity have higher vulnerability. The results show that regions around international airports experienced higher numbers of cases than ones that are over 300 miles away. This is consistent with the fact that the virus has arrived on the US shores through travelers coming to the US from abroad. The results also show that places with higher vulnerability also have a higher proportion of the population that does not use public transportation to go to work. Counties with more non-Hispanic 15 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . https://doi.org/10. 1101/2020 Black, Native American, and immigrants are more vulnerable. The remaining demographic and health variables were largely insignificant.
Due to many counties reporting zero deaths during many of the days used in the sample, a zero-inflated negative binomial pooled regression was used to analyze how the economic, demographic, and health conditions impact the severity of the infection experienced by the counties. The results show that the economic factors have a similar impact on deaths. That is, counties with higher income and cases also experienced higher deaths. Counties with higher income inequality and housing shortage also experienced more deaths. In contrast to the results of the reported cases, this regression showed that not only are counties with higher percentages of non-Hispanic Blacks, Native Americans, and immigrants more likely to die relative to counties with non-Hispanic Whites, so are counties with a higher concentration of people with less than a college education. Counties with more personal care physicians per capita experienced lower deaths and so did counties with a lower percentage of the population with diabetes, smokers, and preventable hospital stays. Counties with higher obesity, HIV, and drinking are associated with lower deaths.
The coronavirus pandemic has demonstrated how quickly a highly contagious respiratory illness can bring the global economy to a standstill. There have been several such infections in the last ten years although none of them had the virulence or lethality of this virus. Most of them spread to a few countries and then disappeared. The developed world remained largely unaffected by most of them and the experience of this pandemic has laid bare the lack of infrastructure to respond to such an incident. The economics literature is not extensive in the area of pandemics and epidemics in developed countries. The contribution of this study is to understand the various socioeconomic conditions that can make a county or region more vulnerable to both disease spread and severity of cases. A national strategy to prepare the infrastructure for controlling the spread of infectious diseases should consider these factors and develop Vulnerability Indexes for each region.

16
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . https://doi.org/10.1101/2020

18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020. . . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020.  23 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted May 8, 2020.  24 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 8, 2020. . https://doi.org/10.1101/2020.05.04.20091041 doi: medRxiv preprint The coefficients reported in this table are incidence rate ratios.

25
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 8, 2020. . https://doi.org/10.1101/2020.05.04.20091041 doi: medRxiv preprint F i g u r e 1 : D i s t r i b u t i o n o f C a s e s a n d We i g h t e d I n t e r n a t i o n a l P a s s e n g e r s