Abstract
The COVID-19 pandemic has yielded disproportionate impacts on communities of color in New York City (NYC). Researchers have noted that social disadvantage may result in limited capacity to socially distance, and consequent disparities. Here, we investigate the role of neighborhood social disadvantage on the ability to socially distance, infections, and mortality. We combine Census Bureau and NYC open data with SARS-CoV-2 testing data using supervised dimensionality-reduction with Bayesian Weighted Quantile Sums regression. The result is a ZIP code-level index with relative weights for social factors facilitating infection risk. We find a positive association between neighborhood social disadvantage and infections, adjusting for the number of tests administered. Neighborhood infection risk is also associated with capacity to socially isolate, as measured by NYC subway data. Finally, infection risk is associated with COVID-19-related mortality. These analyses support that differences in capacity to socially isolate is a credible pathway between disadvantage and COVID-19 disparities.
Introduction
The 2019 novel coronavirus (SARS-CoV-2) emerged in Wuhan, China, and has since become a worldwide pandemic. In the United States, given the nature of this novel infectious disease, anyone exposed to the pathogen was believed susceptible to infection, there were no proven pharmacologic treatments, and testing capacity was low. Pre-existing conditions are known risk factors of disease severity, and mortality increases sharply with age1. Consequently, the United States federal, state, and local governments have principally relied on non-pharmaceutical interventions such as social distancing and mask-wearing. New York State (NYS) on PAUSE is one such effort, whereby essential workers, i.e. healthcare workers, food purveyors, bank tellers, etc., were the only employees that should be reporting to work2. We examine the role of social factors, such as employment and commuting patterns, population density, food access, and personal finances and access to healthcare, in infection risk.
It has been widely noted in popular media and emerging scientific evidence that COVID-19 is taking a disproportionate toll on communities of color3–6. For example, in Chicago, Blacks comprise 70% of COVID-related deaths, but only 30% of the population6. In New York City (NYC), Hispanics/Latinx and Blacks are disproportionately impacted, representing 34% and 29% of the deaths, but 28% and 22% of the population, respectively6. While differences in disease severity are likely attributed to higher levels of preexisting conditions, i.e. health disparities7, this does not explain differences in disease incidence. A survey of laboratory-confirmed hospitalized cases across 14 states found, where race was reported, that 33.1% of hospitalized patients were non-Hispanic Black8. In NYC, as of May 13, 2020, the cumulative incidence of non-hospitalized positive cases were 798.2, 684.8, and 616.0 per 100,000 for Blacks/African Americans, Hispanic/Latinx, and Whites respectively9.
A body of literature on the social determinants of health suggest that there are numerous inequities that provide the scaffolding for increased COVID-19 infection rates in communities of color. Racism operates on both the interpersonal and structural levels, the latter explaining the societal mechanisms that reinforce inequality, including through housing, employment, earnings, benefits, health care, criminal justice, etc.10. Those structural forms of social disadvantage are responsible for many of the health disparities we observe in communities of color11.
Researchers have outlined the ways in which residential segregation and structural disadvantages lay the groundwork for racial disparities in infectious diseases12. More recently, others have noted that social distancing is more difficult for communities of color6. Taken together, this literature highlights the social mechanisms that facilitate viral spread in communities of color. The underlying structural disadvantages relevant to the current coronavirus pandemic might include that people of color (POC) are more represented amongst low-wage jobs13, many of which are now deemed essential14. When they get home from work, they are more likely to return to densely populated homes and neighborhoods15. Further, multigenerational homes are more common in communities of color16, making social distancing between least susceptible (healthy children) and most susceptible (elderly adults with chronic conditions) difficult. POC often live further from supermarkets and sources of nutritious foods, necessitating further travel for groceries17. These factors, among others, underscore the many ways that the capacity to social distance may be contextual, and based on structural factors.
In this study, we use socioeconomic data on neighborhood characteristics to understand differences in infection incidence between neighborhoods, as we quantify the relative contribution of these measures of social disadvantage and if a proxy of social isolation, NYC subway utilization, helps us to understand these differences. We create a ZIP code level infection risk index for NYC and show how this index explains racial/ethnic disparities in cases, thus reflecting structural forms of disadvantage. Finally, we examine the relationship between neighborhood infection risk and neighborhood-level COVID-19 mortality. Ultimately, we create a tool that identifies social factors that facilitate viral spread, and therefore, may be useful throughout the US to pinpoint potential areas for targeted public health intervention.
Results
Cross-sectional neighborhood infection risk index
We wanted to identify any association between a neighborhood social disadvantage composite index and cumulative COVID-19 infection incidence. There were 174,614 positive tests across 177 NYC ZIP Code Tabulation Areas (ZCTAs) as of May 7, 2020. Kendall’s tau correlations between social disadvantage variables ranged from −0.15 to 0.61. Kendall’s tau correlation tests were also conducted between each variable and the infection incidence (Supplemental Table 2). An assumption of the Bayesian Weighted Quantile Sums (BWQS) regression is that the direction of the effect for each variable is the same as the overall effect. Given the a priori hypothesis that increased disadvantage yields higher infections, we used the reciprocal of variables that are negatively associated with the infection incidence.
The BWQS regression analysis identified evidence of an association between our composite variable of ZCTA-level social disadvantage (on a ten unit scale) and the number of infections per 100,000 (Figure 1). We found that each unit increase in social disadvantage is associated with a 10% increase in infections per capita (Risk Ratio: 1.10; 95% Credible Interval: 1.08, 1.11). While all included variables contributed to this composite, they do not all contribute equally (Figure 2). We found that the average number of people in a household is the single largest contributor, followed by the proportion of the population who are essential workers and rely on personal vehicles or public transit to commute. Proportion of uninsured and the median income are also relatively informative compared to the other variables.
The spatial distribution of the BWQS infection risk index (Figure 3) largely mirrors that of infections in NYC (Supplemental Figure 1). We examined the population demographics of neighborhoods according to their BWQS infection risk index (Figure 4). The data shows that Blacks have the highest population-weighted mean index and Whites have the lowest. Examining these distributions by quantile of the BWQS Index shows that White populations are overrepresented in ZCTAs in the lower quartile of the infection risk index (<25th percentile) and underrepresented in the upper quartile of infection risk (>75th percentile) ZCTAs (Supplemental Figure 2). While Whites comprise approximately 32% of NYC’s population, they only make up 11% of high infection risk ZCTAs. Conversely, Blacks and Hispanic/Latinx are 22% and 29% of NYC’s population and 31% and 42% of high risk areas respectively.
Capacity to social distance
We found that capacity to social distance appears lower in higher neighborhood infection risk areas, as indicated by the most important variables in our neighborhood infection risk analysis. To assess whether or not this was true using longitudinal data, we decided to model differences in subway utilization by UHFs in NYC. We only included UHFs with the most consistent data quality and that had subways present (Supplemental Figure 2). In order to identify the proper functional form of our nonlinear model, we fit it on the mean sigmoidal decay of subway utilization across all of NYC (Supplemental Figure 3). We then compared this model to an interaction model for UHF-level population-weighted BWQS index (Figure 5). A partial F-test demonstrated that a model with an interaction term for BWQS index categories (above versus below the median) was a significantly better fit than one without the interaction term (p<0.0001).
The interaction model indicates that there is no difference between slopes for the high (−5.6% per day; 95% CI: −5.9, −5.3%) versus low (−6.3% per day; 95% CI: −6.7, −5.9%) infection risk areas (Table 1). However, the lower asymptote of subway utilization under social distancing policies is higher for high infection risk (16%; 95% CI: 15.3, 16.7%) areas compared to low risk infection risk areas (9.6%; 95% CI: 8.8, 10.1%). This implies that high risk and low risk areas had similar relative rates of decreased subway utilization upon news of the pandemic, i.e. school closures, etc. However, high risk neighborhoods had a higher relative use of the subway system after official social distancing policies (NYS on PAUSE) went into effect.
Mortality related to neighborhood infection risk index
There were 16,289 COVID-related deaths across 177 ZCTAs by May 23, 2020. Results from the negative binomial model show an association between the ZCTA BWQS infection risk index and cumulative COVID mortality incidence (Table 2). This regression model employed a spatial filtering approach to account for potential spatial autocorrelation at the ZCTA level. We found that each unit increase in the BWQS infection risk index is associated with a 21% increased risk of COVID-related mortality (Relative Risk: 1.21; 95% CI: 1.16, 1.26) when adjusting for the proportion of the population aged 65+ and accounting for spatial dependence. There was non-significant spatial autocorrelation in the residuals (Moran’s I: 0.07, p value: 0.064).
Discussion
We conducted a study using publicly-available data to identify the role of neighborhood social disadvantage on cumulative COVID-19 infections and COVID-19-related mortality. The neighborhood infection risk index was also used to understand differences in social distancing, as measured by subway ridership. In creating our neighborhood infection risk index, we found that a combination of social variables, indicative of social disadvantage, is associated with cumulative infections and mortality. Black and Hispanic/Latinx communities are overrepresented in high infection risk neighborhoods, and Whites are overrepresented in low infection risk neighborhoods, which may represent structural forms of racism. When examining differences in capacity to socially isolate, we found that high risk neighborhoods had higher subway ridership during NYS-mandated social distancing. Finally, our neighborhood infection risk index is also associated with cumulative COVID-19 mortality at the ZCTA level. This implies that the same social factors that inform increased disease risk are also associated with severe outcomes, either directly or through intermediates.
A growing body of literature is examining the greater impact of COVID-19 on communities of color. As some have noted, COVID-19 is not creating new health disparities, but exacerbating those that already exist3. A recent investigation found that county and ZCTA area-based socioeconomic measures, specifically using crowding, percent POC, and a measure of racialized economic segregation, were useful in identifying higher COVID-19 infections and mortality in Illinois and New York18. Work on COVID-19 mortality in Massachusetts has found excess death rates for areas of higher poverty, crowding, proportion POC, and racialized economic segregation19. Similarly, researchers have begun to identify counties that are particularly susceptible to severe COVID-19 outcomes using a combination of biological, demographic, and socioeconomic variables20. They identify areas with high population density, low rates of health insurance, and high poverty as particularly at risk. However, a stated limitation of this work is that many of these variables are interrelated.
Our study has many strengths. First, we acknowledge and address the strong interrelation of social variables by using a data-driven method for modeling mixtures of exposures: BWQS. By using this method, we create a composite index that captures the combined effect of the constituent variables. This process is also supervised, meaning that the variables are not weighted equally in the composite index, but instead the approach empirically learns their individual contributions to explaining the outcome. Others have addressed the multicollinearity of social determinants with the use of dimensionality reduction techniques such as principal components analysis (PCA) in the case of the neighborhood deprivation index21. However, traditional PCA only considers correlations between SES variables, whereas a supervised method captures features most relevant for the outcome. Second, our approach largely relies on ACS data, which is available across the USA, and may allow for the identification of other communities nationwide that are particularly vulnerable to future outbreaks or even other novel respiratory pathogens. Third, we explicitly excluded race and ethnicity from the creation of the index because we were more interested in identifying social processes that may facilitate infection risk, rather than those that may imply biological or behavioral explanations to health disparities22. The theory underlying these relationships is that structural racism sorts POC into areas of high disadvantage, and those structural forms of disadvantage facilitate pathogen spread. To demonstrate this, we employed the index to understand neighborhood differences in capacity to social distance. This finding provides additional evidence that low-income communities and communities of color may be less able to socially distance6. Fourth, our spatial analysis of COVID-19-mortality shows that the BWQS risk index may not only be useful in identifying infection risk, but also risk of severe outcomes. Finally, our data sources and analysis code are publicly-available, meaning that others can 1) reproduce these analyses, 2) expand on the work by assessing different modeling strategies and 3) assess the utility in other parts of the country.
This study also has notable limitations. First, we were unable to identify a measure of multigenerational housing at the ZCTA level, which may represent a pathway for infection, and potentially severe disease. Second, by not including race in our models, we may be missing an opportunity to tune these models to the impacts of interpersonal and structural forms of racism23. Third, early testing data in NYC was largely limited to hospitalized individuals, therefore those with more severe disease9. Consequently, ZCTA infection data may be confounded by the distribution of factors that drive disease severity. We addressed this by adjusting our BWQS regression for the amount of overall testing per ZCTA. Relatedly, for our spatial analysis of COVID-mortality, we were unable to access a ZCTA-level measure of chronic diseases. Since communities of color have higher rates of chronic disease at younger ages24, and chronic diseases increase the likelihood of severe COVID-19 outcomes, this is an important challenge. However, because social disparities are a major contributor to differences in the chronic conditions that increase the likelihood of severe disease, we did not want to adjust for a causal intermediate. Instead, we adjust for spatial autocorrelation to account for residual risk factors that are more similar in nearby neighborhoods. Fourth, we use pre-pandemic social variables derived from the 2018 ACS and thus do not directly account for variation in mobility25. However, this should be captured, in part, by median income and other measures of affluence in our BWQS index. Fifth, our analysis of public transit only utilized data from subway turnstiles, but not bus ridership. Although buses are an important form of transit in NYC, especially in the outer parts of the boroughs, the MTA does not provide time-varying ridership data. Further, buses were made free during the pandemic, so accurate ridership data are likely unavailable to the NYC government as well26. Finally, an unfortunate potential consequence of creating a neighborhood risk index is the possibility of stigmatization of neighborhoods with high risk index values22. This is not our intention, and hopefully not the effect, as our goal is to identify social factors that facilitate viral spread, and demonstrate that current public health guidance is not equally observable by all populations. Therefore, it is up to policymakers and practitioners to identify those populations and design/implement interventions accordingly.
Conclusion
In this study, we created a neighborhood measure of social disadvantage that is specifically tuned to the impacts of COVID-19 infections and mortality and we show that this measure is associated with the capacity to socially distance, which may represent an important pathway for COVID-related health disparities. This is an important area of investigation given the large toll that COVID-19 has had, and will likely continue to have unless action is taken, on disadvantaged communities of color in NYC and elsewhere.
Methods
Data sources and cleaning
SARS-CoV-2 testing and COVID-19 mortality data
The New York City Department of Health and Mental Hygiene (NYC DOHMH) has been publicly releasing daily testing data (positive and total tests) at the patient’s home ZIP Code Tabulation Area (ZCTA) level since April 1, 2020, and COVID-19 related mortality data since mid-May, both available on GitHub27. The NYC DOHMH utilizes modified ZCTA geographies, designed to still be mergeable to the Census Bureau ZCTA designations. Our analyses relied on pre-pandemic demographic data to describe variation in neighborhood-level disease burden after much of the community had potential for exposure. Since spatiotemporal patterns in infection risk were highly variable at the beginning of the pandemic in relation to many independent viral introductions within NYC28, we estimated cumulative infections on May 7, 2020, four weeks after NYC’s peak infection period. We estimated time from symptom onset to death as 16 days29. Therefore we chose May 23, 2020 for our cumulative COVID-19 mortality analysis. This analysis is not human subjects research as it did not include any intervention or interaction with individuals or any identifiable private information.
Census data
We downloaded the Census Bureau’s 2018 American Community Survey (ACS) data via the tidycensus R package30. Data were collected for the 177 ZCTAs in NYC. Variables included: the total population, number of households, median income, median rent, health insurance status, unemployment, individuals at or below 150% of the federal poverty level, race and ethnicity, industry of employment, and mode of transportation to work. A full list of variables are provided in Supplemental Table 1. We created a proxy for proportion in essential worker positions using industry of employment variables. This estimate of essential workers was a sum of those who reported employment in the agricultural, construction, wholesale trade, transportation and utilities, and education/healthcare industries, divided by the total working-age population. To account for teachers mostly working from home, and healthcare workers being essential, we included only half of the education/healthcare industry respondents. From these data we also estimated the average household size by dividing the total population by the number of households. We utilize race and ethnicity according to the following categories: Non-Hispanic Asian, Non-Hispanic Black, Non-Hispanic White, Hispanic/Latino of any race, and aggregate all other races into Other.
Residential buildings and food access data
We calculated the volume of residential space by merging datasets the NYC building footprints dataset31 and merged it with the Primary Land Use Tax Lot Output (PLUTO) dataset32. We divided residential volume by total population to calculate mean residents per residential volume, a metric of residential population density. Food access was used as a measure for the likelihood that individuals need to leave their neighborhoods for basic necessities. We estimated food access using data from New York State’s Open Data portal for Retail Food Stores33. Businesses were restricted to J, A, and C establishment code designations in order to identify those most likely to provide fresh foods and produce, and then manually removed any business names that indicated being a corner store or pharmacy, or primarily selling alcohol/tobacco. We spatially joined the point locations to our ZCTA shapefile and divided by the total Census population to calculate a ‘grocers per 1,000 people’ variable as a proxy for food access.
Mobility and transit data
The Metropolitan Transit Authority (MTA) of NYC releases subway utilization data on a weekly basis34. These data include the number of entrances and exits per station. For each day and geographic area, we summed all system entrances and exits.To account for typical usage of the subway on each month and day of the week, we divided the total turnstile count for each day and area by the median daily count on the same day of the week within the same month throughout the period 2015-2019.
Quantitative Analyses
Cross-sectional Neighborhood Infection Risk Index
Socioeconomic variables are known to be closely correlated with one another, which is a challenge to model fitting and interpretation of the underlying latent relationship. To address these challenges, we develop a weighted combination of socioeconomic variables to explain the cumulative number of COVID-19 cases per ZCTA using Bayesian weighted quantile sums regression (BWQS)35. BWQS distinguishes two groups of predictors. In one group, which comprises our socioeconomic variables, the predictors are transformed into decile ranks to limit the influence of outliers, and the coefficients are forced to lie in [0, 1] and sum to 1 with a uniform Dirichlet prior. We included a large candidate list of socioeconomic variables in the BWQS that could represent some of the underlying infection dynamics attributable to socioeconomic disadvantage. They included selected demographic variables collected from the 2018 ACS, as well as derived variables such as population density (persons per square foot of the ZCTA) and residential population density (persons per cubic foot of ZCTA residential volume). Our final list of variables was based on an iterative process according to: 1) maximizing model fit, measured by the widely applicable information criterion (WAIC), 2) removing one variable when bivariate correlations were high (|τ| ≥ 0.9), and 3) our understanding of underlying social processes in relation to infectious disease. The other group of variables in a BWQS regression are the covariates, which in our case consist solely of the population-adjusted total number of tests administered per ZCTA. We included this to account for variation in disease surveillance. The predictor is untransformed and the coefficient is less constrained, using a normal prior with mean 0 and SD 100. A negative-binomial distribution is used for the dependent variable: the cumulative number of positive SARS-CoV-2 tests per 100,000 people. The resulting weighted index was our neighborhood infection risk index.
We visualized the distribution of the neighborhood risk scores by self-reported race/ethnicity as per the ACS categories and total population. We also separate the neighborhood risk scores into three categories: below the 25th percentile, between the 25th and 75th percentiles, and above the 75th percentile. Populations were aggregated by race/ethnicity and then divided by the total population of the associated ZCTAs.
Capacity to Social Distance
Our BWQS model uses cross-sectional data to create an infection risk index, but we wanted to assess the degree to which those differences in infections were explained longitudinally by inability to socially isolate/distance. We utilized MTA transit data as a proxy for social distancing since public transit may reflect conditions that contribute to greater exposure risks. Subway stations are in a fraction of NYC ZCTAs, and individuals often traverse ZCTAs to get to a station, so we aggregated subway utilization to 42 United Hospital Fund (UHF) neighborhoods. UHF neighborhoods are composed of adjacent ZCTAs approximating community districts. Aberrantly low utilization observations (<10%) in February and early March 2020 were removed when explained by planned weekend service changes - specifically those in low subway density areas. We computed a population-weighted BWQS index per UHF.
We modeled change in relative subway usage leading up to, and during, the NYS on PAUSE period. Relative subway utilization is a proportion, therefore the transition from business-as-usual to social distancing roughly followed a sigmoidal decay. A mean nonlinear response can be modeled by nonlinear least squares when a functional form is specified, as implemented by the drc R package36. We utilized a generalized Weibull formula, which took the following functional form: where c is the lower asymptote, d is the upper asymptote, b is the slope, time index is transformation of the date as an integer, e is the inflection point of the function, and relative use is the proportion of subway ridership. The model accommodates curve fitting with interaction terms to identify differences in model fit per group. For ease of interpretation and visualization, this was utilized to assess differences for high (above the median) versus low (below the median) BWQS index neighborhoods. We sought to identify any differences in slope (b) and the lower asymptote (c) as indicative of differences in the ability to socially isolate. An F-test was used to compare a naive model (without considering the BWQS index) to a model with interaction by high versus low BWQS index.
Neighborhood infection risk and mortality
Given high COVID-related mortality in disadvantaged communities, we wanted to assess if our measure of neighborhood infection risk was also associated with cumulative COVID mortality by total population. To do so, we employed a negative binomial model, regressing ZCTA-level COVID mortality on the BWQS infection risk index, adjusting for the proportion of the population that was greater than or equal to 65 years old. In order to adjust for spatial autocorrelation, and thus unmeasured spatial confounding, we employed a spatial filtering approach whereby we identify the eigenvector associated with spatial autocorrelation (as measured by Moran’s I), and explicitly adjusted for those values in the negative binomial regression37,38. The goal, then, was to “filter out” spatial autocorrelation from the residuals. Negative binomial models were implemented with the MASS package, supplemented with spatial functions from the spdep and spatialreg packages39,40.
Data Availability
All data were derived from public use datasets via government websites. All analytic code, including download procedures, are available to the public in GitHub.
https://github.com/justlab/COVID_19_admin_disparities/tree/v1
Data Availability
All data were derived from public use datasets via government websites. All analytic code, including download procedures, are available to the public in GitHub at https://github.com/justlab/COVID_19_admin_disparities/tree/v1
Contributions
DC and ACJ conceptualized the study and DC drafted the manuscript. DC conducted all analyses with statistical support from EC and NFP. ACJ, EC, and ND provided feedback on design and analysis. The Bayesian Weighted Quantile Sum regression was designed and implemented by EC and NFP, with the log link function implemented by NFP. KA developed the procedures and indices for relative subway utilization. JR ingested DOHMH data. All authors reviewed and approved the manuscript.
Supplemental materials
Acknowledgements
This work was supported by grant UL1TR001433 and P30ES023515. DC is funded by NIH T32HD049311. Thanks to Sebastian Rowland for his thoughtful comments on a draft.