Geospatial Analysis of Individual and Community-Level Socioeconomic Factors Impacting SARS-CoV-2 Prevalence and Outcomes

Background: The SARS-CoV-2 pandemic has disproportionately affected racial and ethnic minority communities across the United States. We sought to disentangle individual and census tract-level sociodemographic and economic factors associated with these disparities. Methods and Findings: All adults tested for SARS-CoV-2 between February 1 and June 21, 2020 were geocoded to a census tract based on their address; hospital employees and individuals with invalid addresses were excluded. Individual (age, sex, race/ethnicity, preferred language, insurance) and census tract-level (demographics, insurance, income, education, employment, occupation, household crowding and occupancy, built home environment, and transportation) variables were analyzed using linear mixed models predicting infection, hospitalization, and death from SARS-CoV-2. Among 57,865 individuals, per capita testing rates, individual (older age, male sex, non-White race, non-English preferred language, and non-private insurance), and census tract-level (increased population density, higher household occupancy, and lower education) measures were associated with likelihood of infection. Among those infected, individual age, sex, race, language, and insurance, and census tract-level measures of lower education, more multi-family homes, and extreme household crowding were associated with increased likelihood of hospitalization, while higher per capita testing rates were associated with decreased likelihood. Only individual-level variables (older age, male sex, Medicare insurance) were associated with increased mortality among those hospitalized. Conclusions: This study of the first wave of the SARS-CoV-2 pandemic in a major U.S. city presents the cascade of outcomes following SARS-CoV-2 infection within a large, multi-ethnic cohort. SARS-CoV-2 infection and hospitalization rates, but not death rates among those hospitalized, are related to census tract-level socioeconomic characteristics including lower educational attainment and higher household crowding and occupancy, but not neighborhood measures of race, independent of individual factors.


INTRODUCTION
5 or examination of only a single socioeconomic measure (e.g. income) or SARS-CoV-2-related outcome (e.g. hospitalization, without consideration of the upstream outcome of infection).
Further, many studies capture area-level socioeconomic measures on a county, city/town, or zip code level, which obscures the granular variation in socioeconomic factors over adjacent areas. Census tracts represent geographical units that are smaller in population size than zip codes (average population 4,000 vs 30,000) and represent more homogeneous socioeconomic regions, 27 leading to stronger associations with health outcomes. 28 We hypothesized that high resolution geocoding to the census tractlevel would capture more variation in COVID-19 cases and disease trajectory related to sociodemographic and economic factors.
In this study, we examine both individual and census tract-level indicators to identify factors associated with SARS-CoV-2 infection, hospitalization among infected individuals, and death among hospitalized individuals in a large and diverse cohort seeking care within an integrated healthcare system in New England.

DATA SOURCES
Individual demographics, address, laboratory values, diagnoses, hospitalizations, and deaths were obtained from the electronic medical record (EMR) of Mass General Brigham (MGB, formerly Partners HealthCare), a large, integrated healthcare system serving Eastern Massachusetts. Census tract sociodemographic, economic, and built environment characteristics were obtained from the 2014-2018 American Community Survey (ACS).

STUDY POPULATION
All individuals who received viral polymerase chain reaction (PCR) testing for SARS-CoV-2 in the MGB system between February 1, 2020 and June 21, 2020, a window beginning approximately two . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . 6 months before and ending three months after the peak of the Massachusetts surge, were included.
Individual addresses were geocoded to a latitude and longitude (using the DeGAUSS geocoder, 29 Supplemental Figure 1) from which census tracts were assigned. Individuals were excluded if their address was invalid, unlisted, a PO box number, or located outside Massachusetts; they did not have a valid test result (e.g. test inconclusive); their age was < 18 years; or they were employed by the MGB system (due to different exposures and thresholds for testing compared to the general population).

STUDY OUTCOMES AND COVARIATES
Outcomes included (1) infection with SARS-CoV-2, (2) hospitalization related to SARS-CoV-2 among those testing positive, and (3) death among those hospitalized (see Figure 1 for population included in each outcomes analysis). SARS-CoV-2 infection was defined by positive PCR testing at an MGB facility. Hospitalizations were considered related to SARS-CoV-2 if a positive test occurred during or up to ten days prior to the inpatient admission or if a positive test occurred outside this window but inpatient encounter ICD10 codes suggested a SARS-CoV-2-related admission (B97.29, Z03.818, or Z20.828, recommended by the CDC prior to the availability of codes specific for SARS-CoV-2; 30 or U07.1, "COVID-19," which became effective as of April 1, 2020). Any death following a related hospitalization was considered SARS-CoV-2-related.
Individual demographic variables included age, sex, race/ethnicity, and preferred language, each obtained by self-report upon registration in the hospital system. Racial and ethnic variables were collapsed into non-Hispanic White, non-Hispanic Black, non-Hispanic Asian, Hispanic, and Other categories, and preferred languages were categorized as English, Spanish, or Other. In patients with selfreported "Other" or missing race, self-reported country of origin and language were used to infer race when available (n=859 of 4,363; Supplemental Table 1-2). A sensitivity analysis was performed in which only those with missing, and not of "other," race were reassigned based on either country of origin or preferred language. Insurance was categorized as private, non-Medicare (including private health insurance without any Medicare plans), Medicare (including any Medicare plan among listed insurances), . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . 7 public (including Medicaid, Mass Health, and other public insurance plans excluding Medicare), and safety net/uninsured (including those with safety net, self-pay, or no insurance listed).
Census tract-level variables included population density and measures of demographics, health insurance, income, education, employment, occupation, household crowding (individuals per room) and occupancy (individuals per household), built home environment (housing units per building), and transportation use, all reported as continuous percentages in ACS data tables (Supplemental Table 3).
ACS variables with low variability in our sample (defined as ≥ 75% of census tracts having the same value) were dichotomized (Supplemental Table 4). Essential workers were classified by applying the U.S.

Department of Homeland Security "Advisory Memorandum on Identification of Essential Critical
Infrastructure Workers During COVID-19 Response" to ACS Table C24050. Per capita testing rates were calculated as the number of individuals tested in the MGB system living in that census tract divided by the total population of the tract. SARS-CoV-2 positivity rates are reported as the total number of positive individuals divided by the total number of tested individuals in each census tract.

STATISTICAL ANALYSIS
Baseline characteristics are reported using means and standard deviations for continuous variables and proportions for categorical variables.
All models were fit using a logistic mixed model with a "matern kernel" as a random effect to account for spatial autocorrelation, using latitude and longitude of the address to estimate spatial similarity. 31 The matern kernel has multiple hyperparameters. For each of the three outcomes ( Figure 2a) we used the "base variables" (including all individual-level characteristics and census tract per capita testing rates; Table 1, Figure 2b) to perform a "grid search" to find the range and knot parameters that gave the best model ( Figure 2c; the "base models"). We used the same parameters in subsequent analyses.
We then fit logistic mixed models using all base variables plus each census tract variable alone (the "base-plus models"). We accounted for multiple hypothesis testing for census tract-level variables in the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . 8 base-plus models using the false discovery rate (FDR) adjustment procedure where an FDR-adjusted pvalue of < 0.05 was deemed significant (Table 2). Finally, we fit multivariable logistic mixed models with all base variables and all census tract variables which fell under the FDR significance threshold in the base-plus models (the "full models"). For any cluster of similar census variables (e.g. percent of households with ≥ 5 people, percent of households with ≥ 6 people) we selected the variable with the lowest p-value to be included in the full model. We exponentiated all coefficients in order to report odds ratio per 1-unit change of each variable (percentage point, year in age, or versus the referent [Tables 1-2]).
In the full multivariable model, we assessed significance and independent association using a p-value < 0.05. All analyses were performed using R version 3.6.2.
This study was reviewed and deemed exempt by the Institutional Review Board of Mass General Brigham.

BASELINE CHARACTERISTICS
A total of 80,502 individuals were tested for SARS-CoV-2 in the MGB system between February 1, 2020 and June 21, 2020, and 57,865 individuals met inclusion criteria ( Figure 1). Of these, 56% were female, 36% were of Asian, Black, Hispanic, or Other race/ethnicity (henceforth "non-White" for brevity), and mean age was 52.3 years. Tested individuals resided in 1,386 unique census tracts. Detailed baseline characteristics of tested individuals are presented in Table 1  CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. .

9
individual-level characteristics, including older age, male sex, non-White race, preferred language being other than English, and Medicaid and uninsured/unlisted insurance status, were associated with increased likelihood of infection (Table 2). In the full multivariable models, census tract-level factors, including higher logarithm of population density (OR 1.14, 95% CI 1.03, 1.27) and increase in the percent with higher household occupancy (≥ 5 individuals per household; OR 2.20, 95% CI 1.13, 4.30) were associated with risk of infection, while increased percent with a college education (OR 0.63, 95% CI 0.40, 0.99) was associated with lower risk.

INDIVIDUAL AND CENSUS TRACT-LEVEL FACTORS ASSOCIATED WITH HOSPITALIZATION AND DEATH RELATED TO SARS-CoV-2
Among those who tested positive, 3,009 (30.6%) required hospitalization related to SARS-CoV-2. Hospitalization was associated with individual older age; male sex; Black, Asian, or Other race; Spanish language preference; and Medicare and Medicaid insurance in all models. By contrast, higher per capita testing rates, Hispanic ethnicity, missing race or language, and uninsured/unlisted insurance status associated with lower rates of hospitalization. Census tract-level factors including presence of extreme household crowding (any homes with ≥ 2 occupants per room vs none; OR 1.14, 95% CI 1.01, 1.30), higher percent multi-family homes (OR 1.83, 95% CI 1.01, 3.31), and higher percent of individuals with less than high school education (OR 5.19, 95% CI 1.15, 23.50) were associated with hospitalization in the full model (Table 2).
All significant results are summarized in Figure 4.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. .

This study demonstrates that individual SARS-CoV-2 infection in Eastern Massachusetts was
associated with key census tract-level measures of higher population density, percent without a college degree, and household occupancy, independently of individual factors including older age, male sex, non-White race, non-English preferred language, and Medicaid or uninsured/unlisted insurance as compared to private insurance. Those infected with SARS-CoV-2 had higher rates of hospitalization if they lived in areas with higher rates of extreme household crowding and percent multi-family homes and lower rates of high school completion. Although census tract-level sociodemographic and economic factors were associated with infection and hospitalization, they were not associated with mortality among those hospitalized. Death was associated only with individual-level factors, including older age, male sex, and Medicare insurance (which may further capture age as a risk factor). Together, these findings suggest that while community socioeconomic factors do increase risk of infection and hospitalization, they are not clearly associated with death among those hospitalized for SARS-CoV-2, as seen in previous studies. 7,32 Our study identified individual non-English language and census tract-level lower educational attainment as significant risk factors for both infection and hospitalization even after adjustment for race and ethnicity. To our knowledge, this has not been previously described. These findings lend support for efforts to provide easily accessible, multi-lingual educational materials targeted to all educational levels regarding social distancing practices, symptoms of SARS-CoV-2, and known risk factors for adverse outcomes. Further, we were able to refine previous associations between infection and hospitalization rates with area-level household crowding. In fact, this association is consistent with smaller studies 33,34 and may partially explain racial disparities seen in aggregated county-level data. 19 For example, approximately one-fifth of US residential units are unsuitable for home quarantine based on the number of rooms available per individual, with racial and ethnic minorities overrepresented among these households. 35 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. .

1
By examining the full spectrum of outcomes from testing to mortality in a very large urban cohort, this study extends the developing body of research suggesting that race/ethnicity, socioeconomic status, and neighborhood factors are significantly and independently associated with risk of both infection with and adverse outcomes due to SARS-CoV-2. Multiple studies have examined aggregated larger arealevel factors associated with infection or death, identifying area-level race, population density, poverty, household crowding, and lower rates of medical insurance as risk factors. 19,36,37 By contrast, few studies have examined individual-level data to study the relationship between sociodemographic and economic factors in both SARS-CoV-2 infections and outcomes. Studies in the US and UK have identified non-White race, higher household crowding and occupancy, lower income, higher unemployment, and employment as essential or healthcare workers as correlates of infection among those tested, 15 Our study builds on previous findings by leveraging a larger and more diverse cohort, adjusting for numerous census tract-level variables to represent diverse and correlated socioeconomic constructs, and describing the complete cascade of outcomes beginning with SARS-CoV-2 testing and including diagnosis of SARS-CoV-2, hospitalization, and death. This results in the ability to disentangle individual and neighborhood factors contributing to risk (Figure 4a). For instance, area-level measures of percent racial and ethnic minority residents, which were previously associated with SARS-CoV-2 outcomes in studies using aggregated data from our study area, 37,40 were significant in our base-plus models (models including individual characteristics and a single census tract-level variable, e.g. percent Black residents) of infection and hospitalization; however, these variables lost significance in multivariable models . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. .

2
adjusting for other census tract-level socioeconomic factors, such as educational attainment and household occupancy, suggesting these were the true factors associated with outcomes, rather than neighborhood racial/ethnic makeup itself. Further, we believe that analysis at the census tract-level, rather than the larger zip code or county levels, allows for improved identification of subtle associations. For example, we detect associations of census-tract educational attainment with infection rates, an association not found in a study of Massachusetts using data at the city or town-level. 40 We find that individual age, sex, race/ethnicity, language, and insurance associate with both infection and hospitalization, while different census tract-level sociodemographic and economic factors associate with each outcome.
However, even in models fully adjusted for significant community socioeconomic factors, the association of individual race/ethnicity with outcomes persists, suggesting that differences in underlying comorbidities or as yet unrecognized environmental contributors to risk may not be captured in this analysis.
One important characteristic in the discussion of disparities in SARS-CoV-2 outcomes bears further discussion. In our sample, Hispanic ethnicity and Spanish language are highly, but not completely, correlated, yet our models showed opposite directions of effect for these variables on hospitalization. To improve our understanding of these factors in isolation, we created base models for hospitalization and death, alternately excluding race/ethnicity or language (Supplemental Table 5). These found no significant association between Hispanic ethnicity or Spanish language and hospitalization or death when the other variable was excluded. On average, Hispanic patients are younger than the overall population, which may drive this neutral association with adverse outcomes despite higher rates of infection.

LIMITATIONS
These findings must be interpreted in the context of the study design. The number of cases diagnosed depends heavily on testing rates, which were influenced by changing testing criteria which, in turn, were influenced by local government actions, the necessity of ensuring healthcare worker safety, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. .

3
prevalence of community health centers conducting outpatient testing, and awareness of existing outbreaks. For this reason, we adjusted all analyses for census tract-level per capita testing rates.
Additionally, this cohort included only patients tested within one large healthcare system, leading to under-representation of neighborhoods in this geographic area which are predominantly served by other institutions, raising the possibility of selection bias. Similarly, while we fully capture test positivity among those tested and death among those hospitalized, an individual may have been tested in the MGB system but later presented to another hospital for admission, leading to incomplete capture of the hospitalization outcome. This may be especially common among those with missing information (e.g. insurance) who may have only interacted with the MGB system briefly for a SARS-CoV-2 test. While deaths which occurred in individuals admitted to other hospitals were not captured, these individuals would have been excluded from the death analysis which only included those hospitalized in our system.
Although more than 500 deaths occurred in the cohort, power to detect smaller differences related to highly correlated variables may be limited. Race is a social construct influenced by many factors; the prescribed assignment of race based on country of origin and language oversimplifies the concept of race and may lead to misassignment of individuals of mixed ancestry, complex racial identity, or a minority racial identity in their country of origin. However, in a sensitivity analysis in which only those with missing race data, but not self-reported "Other," race were reassigned (n=393 of 3,248, Supplemental is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 30, 2020. .

4
help to explain the racial and ethnic disparities seen during the COVID-19 pandemic as part of a complex milieu of socioeconomic and structural factors which contribute to disease risk. As many countries anticipate second waves of SARS-CoV-2 infection, the findings of this study from the first wave of SARS-CoV-2 in a large urban area may help to identify individuals and communities at greatest risk during subsequent outbreaks.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 30, 2020. . . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted September 30, 2020. . . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. . . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 30, 2020. .    . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

0
The copyright holder for this this version posted September 30, 2020. . . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

1
The copyright holder for this this version posted September 30, 2020. .  Table 2) For each SARS-CoV-2 outcome, select the optimal matern kernel parameters (number of knots and range)

2
for the logistic mixed model by selecting the best model fit using only base variables as covariates (base model). These optimal kernel parameters are used for all subsequent analysis. All other kernel parameters are set to default.
For each SARS-CoV-2 outcome, loop through all Census tract-level variables and fit a logistic mixed model using all base variables plus that particular Census variable (base-plus model). Aggregate the statistics for all census variables into a single data frame.
Use multiple comparisons testing to adjust p-values in the data frame at an FDR level of 0.05 Remove all census variables with adjusted p-value greater than 0.05 from subsequent analysis Run a multivariate logistic mixed model with base variables and all census variables with an adjusted pvalue less than 0.05 (full model). If there is a cluster of similar variables (e.g. percent of households with ≥ 5 people, percent of people with ≥ 6 people, etc…) then select the variable with the most significant p-value.