COVID-19 Twitter-based analysis reveals differential concerns across areas with socioeconomic disparities ========================================================================================================= * Yihua Su * Aarthi Venkat * Yadush Yadav * Lisa B. Puglisi * Samah J. Fodeh ## ABSTRACT **Objective** We sought to understand how U.S. residents responded to COVID-19 as it emerged, and the extent to which spatial-temporal factors impacted response. **Materials and Methods** We mined and reverse-geocoded 269,556 coronavirus-related social media postings on Twitter from January 23rd to March 25th, 2020. We then ranked tweets based on the socioeconomic status of the county they originated from using the Area Deprivation Index (ADI); that we also used to identify areas with high initial disease counts (“hotspots”). We applied topic modeling on the tweets to identify chief concerns and determine their evolution over time. We also investigated how topic proportions varied based on ADI and between hotspots and non-hotspots. **Results** We identified 45 topics, which shifted from early-outbreak-related content in January, to the presidential election and governmental response in February, to lifestyle changes in March. Highly resourced areas (low ADI) were concerned with stocks, social distancing, and national-level policies, while high ADI areas shared content with negative expression, prayers, and discussion of the CARES Act economic relief package. Within hotspots, these differences stand, with the addition of increased discussion regarding employment in high ADI versus low ADI hotspots. **Discussion** Topic modeling captures the major concerns in COVID-19-related discussion on a social media platform in the early months of the pandemic. Our study extends previous studies that utilized topic modeling on COVID-19 related tweets and linked the identified topics to socioeconomic status using ADI. Comparisons between low and high ADI areas indicate differential Twitter discussions, corresponding to greater concern with economic hardship and impacts of the pandemic in less resourced communities, and less focus on general public health messaging. **Conclusion** This work demonstrates a novel framework for assessing differential topics of conversation correlating to income, education, and housing disparities. This, with integration of COVID-19 hotspots, offers improved analysis of crisis response on Twitter. Such insight is critical for informed public health messaging campaigns in future waves of the pandemic, which should focus in part specifically on the interests of those who are most vulnerable in the lowest resourced health settings. Keywords * COVID-19 * Twitter * social media * socioeconomic status * topic modeling ## INTRODUCTION The novel severe acute respiratory syndrome coronavirus, SARS-CoV-2, which causes the disease COVID-19, has led to a global pandemic over the course of just a few months. With no specific treatment for the disease and fears of the burden of illness overwhelming health systems, the primary public health focus has been on disease mitigation strategies [1-3]. These strategies have introduced new concepts to the general public, such as social distancing and recommendations for routine masking. These mitigation efforts along with others, including travel bans, shelter-in-place orders and school closures, were anticipated to negatively affect many sectors of the U.S. economy, and they have drastically changed the quotidian lives of most Americans. Given marked community-level socioeconomic disparities and segregation in the U.S. that predated COVID-19, the impacts of these measures were likely to have disparate uptake by and impact on Americans depending on where they live [4]. With the expansive geography of the United States (U.S.), and modern-day travel patterns, the disease initially was focused in a few cities, and these so-called “hotspots” were a primary focus of much of the initial media coverage [5]. Despite this focus, other COVID-19 hotspots with large marginalized populations later emerged [6,7], highlighting the importance of understanding differential reactions to the crisis, as this could be critical for shaping future public health communication and allocation of health resources. Social media has been a prominent venue for personal and public health communication, both in previous public health crises and now too with COVID-19. Pre-COVID-19, social media research in the context of health was primarily focused on examining the patient experience [8-12]. Comments and reviews on Twitter were used to measure healthcare quality [10] and monitor health status of patients along with sentiment level [12]. Twitter, specifically, has the advantage of short, real-time content availability with quick access to a network of similar discussions through hashtags. It has been beneficial in various research areas for its openness and availability [13]. It has also been useful in understanding social networks, public health messaging, and forecasting spread [14-17]. Twitter played an important role in Ebola outbreak surveillance by detecting the epidemic nearly a week before its first case [15]. Influenza infection rate [16] and ZIKV case number [17] predictions, learned from the tweet count pattern of disease-related tweets, were also proven successful. During COVID-19, Twitter has been used to assess mitigation strategies such as social distancing [18] [19], capture self-reported symptoms of COVID-19 [20], Twitter, however, has not, to our knowledge, been used as a tool to identify trends in public responses to a health crises at the local level, while factoring in socioeconomic status. Using public health communication to mitigate health disparities is not a novel concept [21], and is in line with future directions laid out in the National Institute on Minority Health and Health Disparities 2019 research framework [22], but the science on implementation of this approach is underdeveloped and is an area of active research. In this study, we sought to assess a novel approach to use Twitter to understand how COVID-19 related concerns differed by area socioeconomic status in the initial phase of the COVID-19 pandemic in the United States. ## MATERIALS AND METHODS ### Twitter Dataset The dataset we used for this analysis is composed of Twitter entries (tweets) in English posted by users in the United States from January 23rd to March 25th 2020. We mined the tweets with a Standard Search API using the keywords ‘coronavirus’, ‘corona virus’, ‘corona’, ‘covid’, ‘covid-19’, ‘covid 19’ and ‘covid19’. For each tweet, we obtained standard attributes: unique de-identified user ID, time of tweet, text of the tweet and four geographic coordinates (latitude and longitude) delineating the bounding box [23] from which the tweet was posted. For privacy reasons, Twitter does not provide the exact location that tweets were posted from. **Figure 1** demonstrates the overall workflow which will be further detailed in the following sections. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/18/2020.11.18.20233973/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/F1) Figure 1. Data Integration and Analysis Workflow. ### Preprocessing of Tweets We pre-processed through the removal of punctuation marks, numbers, emojis, URLs, stop words, and end of line characters. We shortened the remaining words to the root using the stemmer package provided by the NLTK toolkit [24]. We removed tweets that were with missing or invalid data such as those without a month or date of entry, valid user ID entry or valid stemmed tweet text. Finally, we filtered out tweets containing only words that occurred in less than 20 documents or more than 50% of all documents (of which only “coronavirus” was excluded) in order to achieve better topic models. This is a common approach [25, 26], used to avoid spurious associations by excluding words based on term distribution. ### Reverse Geocodes of Tweets We employed GeoPy [27] to reverse geocode the coordinates and output the county and state name of each tweet. As the bounding box provides enough information to confidently geotag the tweet at the county resolution, we used the midpoint of the rectangle of latitude and longitude coordinates of each tweet as the effective location. This location was then linked to a five-digit FIPS code, a code designed to uniquely identify counties and states in the U.S., to determine the location of tweets at the county level. We followed a similar approach in our previous work [19] to map tweets to the county level. ### Area Deprivation Index (ADI) Designation We leveraged ADI from The Neighborhood Atlas [28], a location-based socioeconomic index at the census block group level which incorporates income, education, employment and housing data and has been used to inform health delivery and policy. ADI scores range from 0 to 100, where 0 corresponds to low deprivation and 100 corresponds to high deprivation. We mapped the location of each tweet, derived from the reverse geocoding tweets process, to the median ADI score of all the census block groups within the county using its FIPS code. Counties were considered “low”, “mid”, or “high” ADI based on the ADI distribution of the unique counties represented in the dataset. Low ADI designation was assigned to counties from the lowest quintile of the distribution, and high ADI designation was assigned to counties from the highest quintile of the distribution as has been done with other studies using ADI [29, 30]. ### Hotspot Identification We defined hotspots in January and February as areas with any cases of COVID-19 because there were few U.S. cases in these months and they were concentrated (also as published by the New York Times [31]). For analyzing hotspots in March, we leveraged the curated resource The U.S. COVID-19 Atlas [32], defining a tweet as from a hotspot if the county was listed among the published population-adjusted hotspots. ### Topic Modeling We performed topic modeling of the tweets using a Latent Dirichlet Allocation (LDA) approach [33]. LDA is an unsupervised approach and has been proven successful in modeling topics in tweets [34]. We leveraged LDA from the MALLET package [35] to detect topics from COVID-related tweets. To determine the optimal number of topics, we compared topics by their coherence scores, which act as a proxy for interpretability by measuring the degree of semantic similarity between top words in the topic [36]. We used the topic-word distribution to annotate topics. We first ranked words of a topic and then assigned the underlying theme. ### Spatiotemporal Analysis We leveraged the document-topic probability distribution for this analysis. We compared topic prevalence over time, across low and high ADI areas, between hotspots and non-hotspots areas, and within hotspots between low ADI and high ADI areas. #### Temporal analysis of topic prevalence To study topic evolution and how the public reactions to COVID-19 varied temporally, we averaged the topic distributions of all tweets for each month. We then compared the average scores of all topics over time. For selected topics, we plotted out the daily topic dynamic to demonstrate how the daily average topic distribution changed. #### Spatial analysis of topic prevalence To compare the dominant topics in counties of low versus high ADI designation, we computed the log odds ratios of dominant topics in both groups. We first identified the dominant topic – the topic with the highest probability – for all tweets, then we calculated the log odds ratio of dominant topics among both groups to achieve a fair comparison, especially when there was a significant difference in number of tweets in each group. The log odds ratio of a topic can be interpreted as the probability of dominance of that topic in one group over another. The odds that any topic T dominates in a group G are calculated as: ![Formula][1] The log odds ratio of any topic T between two groups G, G1 is calculated as: ![Formula][2] A positive log odds ratio indicates that topic T is more likely to appear in group G, and a negative log odds ratio indicates that the topic T is more likely to appear in group G1. We did the same analysis to compare topic prevalence between hotspots and non-hotspots. We also implemented the chi-squared test and independent t-test to assess the differences in discussed topics across geographically grouped tweets. More specifically, the chi-squared test was used to determine whether there was a statistically significant difference between the expected dominant topic frequencies and observed dominant topic frequencies across the ADI groups and hotspot groups. We further leveraged independent t-tests to determine whether there was a difference between the means of the dominant topic probabilities in the low and high ADI groups. ## RESULTS ### Preprocessing and Integration of Tweets Pre-processing resulted in 269,556 tweets (95.6% of the original geocoded dataset) from 119,611 Twitter users (out of which 63 users had more than 100 tweets). This dataset represents 1331 counties from all 50 states, the District of Columbia, and Puerto Rico. The range of the ADI is from 3 to 98. **Table 1** summarizes the characteristics of the final dataset. View this table: [Table 1.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/T1) Table 1. Characteristics of Dataset. Summary statistics of Twitter dataset in terms of user, geographic, and socioeconomic distribution. ### Topic Modeling We evaluated models ranging from 10 to 50 topics and selected the model with 45 topics given that it had the highest coherence score (0.571). **Table 2** lists all 45 topics and **Table 3** presents examples of representative tweets (tweets with the highest probability for the given topic) for selected topics. Representative tweets for all topics are available in **Supplementary Table 1**. Topics were named based on the common theme of the top words. For example, we defined topic 1 as “Shopping” due to its top words “toilet”, “paper”, “store”, “shop”, “buy”, “walmart”, and “groceri” (stemmed version of groceries). View this table: [Table 2.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/T2) Table 2. Identified topics based on LDA View this table: [Table 3.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/T3) Table 3. Example topics and the tweet with the highest probability of belonging to the topic. *Twitter handles removed to preserve Twitter users’ privacy without changing the meaning of the original tweets. **In Figure 2**, we visualize a selected number of topics using word clouds. We show the top 10 words in each topic wherein the font size in each plot reflects the importance of a word in a specific topic. Word clouds for all topics are available in **Supplementary Figure 1**. ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/18/2020.11.18.20233973/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/F2) Figure 2. Visualization of the top 10 words in example topics. ### Comparing Topic Prevalence over Time Through the monthly averaged distribution of topics, we delineated the topic dynamic from January to March and noted topics that peaked by month in **Figure 3**. For each month, topic prevalence compared to both of the other months had a significance of p <.0001 unless indicated otherwise. In January (**Figure 3A**), there were significant peaks in topics such as intense expression, negative expression, and personal expression (vs. Mar, p <.001) (Row 1). These topics are associated with profanity, anxiety, and emotions. We also noted a peak in discussion regarding an early understanding of the novel disease, namely symptoms, flu deaths, and preventative measures (vs. Feb, p <.01; vs. Mar, p <.05) (Row 2). Further, we noted significant discussion regarding China, international outbreak events (vs. Feb, p <.01; vs. Mar, p <.0001), and ethnicity (Row 3), as well as tweets concerning case counts (vs. Feb, p <.05; vs. Mar, p <.0001), hotspots (vs. Feb, ns; vs. Mar, p <.0001), and confirmed cases (Row 4). ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/18/2020.11.18.20233973/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/F3) Figure 3. Distribution of topics grouped by month. A. Topics with higher proportions in tweets posted in January. B. Topics with higher proportions in tweets posted in February. C. Topics with higher proportions in tweets posted in March. Topics that had the same proportions for all months not shown. Significance testing results from two-sided Welch’s t-test with Bonferroni correction. Significance legend: ns: 5.00e-02 < p <= 1.00e+00. *: 1.00e-02 < p <= 5.00e-02. **: 1.00e-03 < p <= 1.00e-02. \***|: 1.00e-04 < p <= 1.00e-03. \**\*|\*: p <= 1.00e-04 In February (**Figure 3B)**, there was a significant rise in discussion surrounding the election, President Trump (Row 1), news articles, stocks (Row 2), the task force conference, and the CDC (Row 3). February also saw a significant discussion surrounding vaccines (vs. Mar, p <.0001) and travel (vs. Jan, p <.05; vs. Mar, p <.0001) (Row 4). In March (**Figure 3C)**, there was a rise in discussions related to social distancing and disease mitigation strategies, namely closures, cancellations (vs. Jan, p <.0001; vs. Feb, p <.001), social distancing, staying home, online media (vs. Jan, p <.05; vs. Feb, p <.0001), and education (Row 1). In general, there were higher topic proportions of activities related to quarantine, in particular exercising, sport, shopping, prayers, words related to time, and adaptation (vs. Feb, p <.0001) (Row 2). March also resulted in more dissemination of information, discussion regarding the CARES Act, discussion of cases in Florida and New York, and tweets related to employment and local business support (Row 3). Finally, in March there was a significantly higher proportion of tweets related to the pandemic (vs. Jan, p <.0001; vs. Feb, p <.001), public health measures, tests and test results, and also a higher prevalence of COVID-related hashtags (Row 4). ### Comparing Topic Prevalence between Low and High ADI areas ADI-specific analysis revealed significant differences in topic prevalence between low and high ADI areas. Comparing areas at the highest and lowest quintiles of ADI designation demonstrated differential effects (p <.001) in tweets by county level socioeconomic resourcing. Topics that are more likely to dominate in high ADI counties and low ADI counties are shown in **Figure 4A**. Tweets from high ADI areas are more likely to share emotional content with intense (p <.0001), negative (p <.01), personal expression (p <.01) or prayers (p <.05), as well as news regarding confirmed cases, the outbreak in China, flu deaths, and the CARES Act (all p <.0001). On the other hand, tweets from low ADI areas were more likely to discuss the impact of COVID-19 on hotspots, local businesses, and New York status (all p <.0001). Topics related to the larger public health crisis (p <.001) and pandemic (p =.001), as well as dissemination of information (p <.0001), stocks (p <.01), and the task force conference (p =.01), were also significantly more prevalent in tweets from lower ADI areas. These areas were also more concerned about the progress of potential treatments like vaccines (p <.001). While tweets with political topics about elections (p =.937) and president Trump (p =.605) were more likely to come from low ADI areas, the differences were not statistically significant. Observing the topic proportion progress from January through March (**Figure 4B**), we noticed that “Intense Expression” and “CARES Act” topics had consistent trends at both high and low ADI areas, with the high ADI areas having an overall higher daily average topic probability. Furthermore, topics associated with public health policies and disease mitigation strategies in March such as “Social Distancing” and “Local Business Support” arose in tweets from low ADI areas at a higher prevalence than tweets from high ADI areas. ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/18/2020.11.18.20233973/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/F4) Figure 4. Topic prevalence comparisons between High and Low ADI based on Log odds ratio. A. Topics with significance difference between both groups (p <.05) B. Topic dynamic for example topics. ### Comparing Topic Prevalence between Hotspots and Non-Hotspots The differences in the dominant topics’ prevalence between hotspots and non-hotspots areas were significant (p <.001). **Figure 5** demonstrates that tweets from hotspots had a higher log odds ratio for topics including New York, social distancing, public health and pandemic, information dissemination, exercise/sport, education, time, closures and employment. Tweets that were not posted from hotspots expressed negative or intense emotion, concern regarding the CDC guidelines and task force conference, international events and flu deaths, as well as stocks and shopping. ![Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/18/2020.11.18.20233973/F5.medium.gif) [Figure 5.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/F5) Figure 5. Topic prevalence between hotpots vs non-hotspots based on log odds ratio. ### Comparing Topic Prevalence Within Hotspots between Low and High ADI areas We examined the tweets within the hotspots and compared topic prevalence between areas of high ADI and low ADI. **Figure 6A** demonstrates that tweets associated with confirmed cases, closures, intense expression, and hashtags were more prevalent from high ADI hotspots. Notably, tweets regarding employment concerns (p <.001) were also more likely to come from high ADI hotspots, which wasn’t significant in previous analysis comparing ADI and hotspots separately. Furthermore, tweets from low ADI hotspots were significantly more concerned with exercise, stocks, information dissemination, vaccine treatment, and cases in New York. We next observed the topic dynamics for selected topics from tweets collected in March (note that no high ADI areas were hotspots in January and February). There were notable spikes in employment concerns and intense expression from high ADI hotspots, whereas these topics remain consistent throughout the month for tweets from low ADI hotspots. Tweets about New York and social distancing remained consistently high in low ADI tweets. ![Figure 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/11/18/2020.11.18.20233973/F6.medium.gif) [Figure 6.](http://medrxiv.org/content/early/2020/11/18/2020.11.18.20233973/F6) Figure 6. Topic prevalence comparisons within Hotspots between low and high ADI areas. A. Topics significant difference between the two groups (p <.05). B. Topic dynamic for example topics. ## DISCUSSION Topic modeling of COVID-19-related social media data from Twitter demonstrates significant differences in individual responses to the pandemic based on geographic area, local disease prevalence and socioeconomic status. Over the progression of the crisis, tweets varied in topic, and these variations evolved over time and differed across counties with socioeconomic disparities. To our knowledge, this is the first study to link the ADI to geocoded tweets in order to explore the impact of geographic area-based socioeconomic status on tweet content. This analysis follows the early pandemic timeline and establishes that topic modeling performs well in identifying major subjects of discussion on Twitter and successfully capture the nuances of their variability. Topic modeling has been applied to COVID-19-related tweets in an overlapping window of time (January 23 to March 7, 2020) [37], however limited topics, thus concerns, were identified and no analysis was reported about the emergence of topic during that period. As the first cases of COVID-19 broke news in January, we found the fear sentiment in tweets as people were broadly focused on disseminating as much information as possible. Similar conclusions were reported by Xue et. Al [37]. As time progressed, we found a massive increase in the number of tweets and there was increasing focus on local cases and events, public health information dissemination and testing, and quarantine activities. Topic prevalence over time was explored in COVID-19 related tweets by Ordun et al [38], however, the analysis was limited to reporting trends and lacked extended investigations of linking the trending topics to other health or social factors. In our study, we have linked topic prevalence to socioeconomic status. Specifically, the topic prevalence comparisons between low and high ADI areas demonstrated that tweets from high ADI areas were more likely to share content regarding personal experiences, which ranged from positive affirmations of hope and prayers to negative or and intense expressions of anxiety or frustration. This was not surprising given that the disparate impact of the pandemic and the associated economic fallout have disproportionately impacted poorer communities, and Black and Hispanic communities have faced some of the highest rates of unemployment [39]. This was in many ways a result of the hard-hit industries overrepresented in these communities as well as inability to perform jobs in these industries from home [40, 41]. Furthermore, centuries of structural racism in the United States have led to lower resourcing in these areas and higher rates of medical co-morbidities that have been shown to increase COVID-19 risk [39] – all potentially contributing factors to an increase in intense, negative, and personal discussion in these areas pertaining to the public health and economic crisis. Tweets from low ADI areas in March showed more discussion of social distancing and local business support, as quarantine policies hurt local businesses and resulted in discussions about bill relief to support these businesses. This result is consistent with the quicker response to stay at home orders from low ADI areas and is in line with recent reports of movement dynamic differences between low-income and high-income areas [42]. The higher prevalence of discussion surrounding stocks that was noted in low ADI areas was consistent with a greater stock market wealth residing amongst the wealthiest US households [43]. In the comparison between low and high ADI areas within hotspots, we identified that tweets with intense expression and those about employment insecurity were significantly more likely to come from high ADI hotspots, reinforcing the notion that, even after restricting to areas with high case counts, income and racial disparity result in disproportionate affects due to closures and job loss [41]. Furthermore, low ADI counties were significantly more concerned with information dissemination, cases in New York (on average a large low ADI hotspot), stocks, and vaccine treatment which follows our nationwide analysis of low ADI areas as showing increased focus on social and institutional reactions to the crisis. Our approach of integrating a location-based socioeconomic index with Twitter topics offered increased insight into the topics inferred from the text, allowing a novel framework for assessing differential sentiment topics of conversation as they correlate to income, education, and housing disparities. Our integration of published COVID-19 hotspots further enables time-specific information of disease spread and how this corresponds to topics discussed on Twitter. These nuances are valuable for recognizing how public health communication, resource allocation policy, and information dissemination can best shape health crisis response to the needs of different communities, especially those with the lowest health resourcing, in future waves of the pandemic and emerging infectious disease outbreaks. Future public health efforts may use Twitter topic modeling to target messaging to the unique concerns of local communities and study the impact of health resource utilization. ### Limitations Our study successfully explored on the pandemic topics of conversation across tweets. However, there were a few limitations. For technical reasons on the server, fewer tweets were scraped on some dates. However, our previous work on [44] has shown that that we were still able to glean valuable conclusions from our data that represent the early pandemic progression. Another limitation for all Twitter-based research is that tweets posted from private accounts could not be retrieved from the API. Furthermore, due to restrictions with Twitter geocoding, there was some degree of positional inaccuracy that we accepted in our study design in that we were only able to collect geographic coordinates to the resolution of a county, and therefore characterized each tweet by the county rather than the census tract or block group. Given the inherent geographic masking techniques used by Twitter to promote confidentiality, and our study design which involved cross-area estimation and simple geographic centroid assessment [19], we acknowledge aggregation bias as a study limitation. Despite this, however, we found that, on average, the county ADI was distributed such that the median ADI was a reasonable approximation for the county. ## CONCLUSION Twitter analysis linking geocoded tweets to markers of geographic socioeconomic resourcing demonstrates that the COVID-19 pandemic has differentially impacted areas of the United States that are already institutionally underserved. This finding highlights the need to address the specific fears and concerns of these communities through personalized public health messaging and policy reform, addressing consistent issues such as job security and negative emotions likely associated with greater instability during the crisis. Our work indicates the emerging utility for linking natural language processing techniques to analyze real-time social media data and measures of social determinants of health. ## Supporting information Supplementary Figure 1 [[supplements/233973_file03.pdf]](pending:yes) Supplementary Table 1 [[supplements/233973_file04.docx]](pending:yes) ## Data Availability Publicly available Tweets downloaded using the Twitter API. ## FUNDING This research was supported in part by the Gruber Foundation (to A.V.). ## CONFLICT OF INTEREST STMT There is no conflict of interest. ## FIGURE LIST Supplementary Figure 1. Visualization of the top 10 words in all topics. * Received November 18, 2020. * Revision received November 18, 2020. * Accepted November 18, 2020. * © 2020, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## REFERENCES 1. 1.Centers for Disease Control and Prevention. If You Are Sick or Caring for Someone.; 2020. [https://www.cdc.gov/coronavirus/2019-ncov/if-you-are-sick/index.html](https://www.cdc.gov/coronavirus/2019-ncov/if-you-are-sick/index.html). Accessed April 4, 2020. 2. 2.Centers for Disease Control and Prevention. Social Distancing, Quarantine, and Isolation.; 2020. [www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/social-distancing.html](http://www.cdc.gov/coronavirus/2019-ncov/prevent-getting-sick/social-distancing.html). Accessed April 4, 2020. 3. 3.Wilder-Smith A, Freedman DO. Isolation, quarantine, social distancing and community containment: pivotal role for old-style public health measures in the novel coronavirus (2019-nCoV) outbreak, Journal of Travel Medicine, Volume 27, Issue 2, March 2020. 4. 4.Buchanan L, Patel J, Rosenthal B, et al. A Month of Coronavirus in New York City: See the Hardest-Hit Areas. The New York Times, 1 April 2020. [https://www.nytimes.com/interactive/2020/04/01/nyregion/nyc-coronavirus-cases-map.html](https://www.nytimes.com/interactive/2020/04/01/nyregion/nyc-coronavirus-cases-map.html). Accessed July 20, 2020. 5. 5.Chiwaya N and Murphy J. Tracking new coronavirus cases in the first wave of hot spots across the United States. NBC News, 1 April 2020. [https://www.nbcnews.com/health/health-news/coronavirus-count-state-day-2020-united-states-n1173421](https://www.nbcnews.com/health/health-news/coronavirus-count-state-day-2020-united-states-n1173421). Accessed July 20, 2020. 6. 6.Chowkwanyun M and Reed A. Racial health disparities and Covid-19 – caution and context. N. Engl. J. Med 2020. 7. 7.Oppel Jr. RA, Gebeloff R, Lai KK, et al. The Fullest Look Yet at the Racial Inequity of Coronavirus. The New York Times, 5 July 2020. [https://www.nytimes.com/interactive/2020/07/05/us/coronavirus-latinos-african-americans-cdc-data.html](https://www.nytimes.com/interactive/2020/07/05/us/coronavirus-latinos-african-americans-cdc-data.html). Accessed July 17, 2020. 8. 8.Afyouni S, Fetit AE, Arvanitis TN. #DigitalHealth: exploring users’ perspectives through social media analysis. Stud Health Technol Inform 2015;213:243–6. 9. 9.Benetoli, A., Chen, T. F., & Aslani, P. How patients’ use of social media impacts their interactions with healthcare professionals. Patient education andcounseling 2018; 101(3), 439–444. 10. 10.Greaves F, Ramirez-Cano D, Millett C, Darzi A, Donaldson L. Use of sentiment analysis for capturing patient experience from free-text comments posted online. J Med Internet Res. 2013;15(11):e239–51. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/jmir.2721&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24184993&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F11%2F18%2F2020.11.18.20233973.atom) 11. 11.Alemi F, Torii M, Clementz L, Aron DC. Feasibility of real-time satisfaction surveys through automated analysis of patients’ unstructured comments and sentiments. Qual Manag Health Care 2012;21(1):9–19. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1136/bmjqs-2012-001213&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22207014&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F11%2F18%2F2020.11.18.20233973.atom) 12. 12.Kashyap, Ranjitha & Nahapetian, Ani. Tweet Analysis for User Health Monitoring 2015; 348–351. 13. 13. Shirley Ann Williams, Melissa Terras, Claire Warwick. What people study when they study Twitter: Classifying Twitter related academic papers. Journal of Documentation 2013;69. 14. 14.Ahmed W, Bath PA, Sbaffi L, et al. Novel insights into views towards H1N1 during the 2009 Pandemic: a thematic analysis of Twitter data. Health Info Libr J 2019;36(1):60–72. 15. 15.Odlum M and Yoon S. What can we learn about the Ebola outbreak from tweets? Am J Infect Control 2015;43(6), 563–571. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajic.2015.02.023&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26042846&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F11%2F18%2F2020.11.18.20233973.atom) 16. 16.Paul MJ, Dredze M, and Broniatowski D. Twitter improves influenza forecasting. PLoS Curr 2014;6. 17. 17.Masri S, Jia J, Li C, et al. Use of Twitter data to improve Zika virus surveillance in the United States during the 2016 epidemic. BMC Public Health 2019;761. 18. 18.Younis J, Freitag H, Ruthberg JS, Romanes JP, Nielsen C, Mehta N. Social Media as an Early Proxy for Social Distancing Indicated by the COVID-19 Reproduction Number: Observational Study. JMIR Public Health Surveill 2020;6(4):e21340 19. 19.Kwon J., Grady C., Feliciano J. T., Fodeh S. J. Defining facets of social distancing during the COVID-19 pandemic: Twitter Analysis. Journal of Biomedical Informatics 2020. Sarker, A., Lakamana, S., Hogg-Bremer, W., Xie, A., Al-Garadi, M. A., & Yang, Y. C. Self-reported COVID-19 symptoms on Twitter: an analysis and a research resource. J Am Med Inform Assoc 2020], and explore fake news and rumors related to the pandemic [ref 20. 20.Ahmed, W., Vidal-Alaball, J., Downing, J., & Lopez Segui, F. (2020). COVID-19 and the 5G Conspiracy Theory: Social Network Analysis of Twitter Data. J Med Internet Res 2020;22(5), e19458. 21. 21.Freimuth, V. S., & Quinn, S. C. (2004). The contributions of health communication to eliminating health disparities. American journal of public health, 94(12), 2053–2055. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2105/AJPH.94.12.2053&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=15569949&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F11%2F18%2F2020.11.18.20233973.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000225560800008&link_type=ISI) 22. 22.Alvidrez J, Castille D, Laude-Sharp M, Rosario A, Tabor D. The National Institute on Minority Health and Health Disparities Research Framework. Am J Public Health. 2019 Jan;109(S1):S16–S20. 23. 23.Formal twitter vocabulary. [https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/guides/basic-stream-parameters](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/guides/basic-stream-parameters) 24. 24.Bird, Steven, Edward Loper and Ewan Klein. Natural Language Processing with Python. O’Reilly Media Inc. 2009. 25. 25.Fan, A., Doshi-Velez, F., & Miratrix, L. (2019). Assessing topic model relevance: Evaluation and informative priors. Statistical Analysis and Data Mining: The ASA Data Science Journal, 12(3), 210–222. 26. 26.1. Zaki M.J., 2. Yu J.X., 3. Ravindran B., 4. Pudi V. Ming ZY., Wang K., Chua TS. (2010) Vocabulary Filtering for Term Weighting in Archived Question Search. In: Zaki M.J., Yu J.X., Ravindran B., Pudi V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science, vol 6118. Springer, Berlin, Heidelberg. 27. 27.Geopy 2.0.0. [https://pypi.org/project/geopy/](https://pypi.org/project/geopy/). Accessed July 15, 2020. 28. 28.University of Wisconsin School of Medicine Public Health. 2015 Area Deprivation Index v2.0. [https://www.neighborhoodatlas.medicine.wisc.edu/](https://www.neighborhoodatlas.medicine.wisc.edu/). Accessed May 12, 2020. 29. 29.Knighton AJ, Savitz L, Belnap T, Stephenson B, VanDerslice J. Introduction of an Area Deprivation Index Measuring Patient Socioeconomic Status in an Integrated Health System: Implications for Population Health. EGEMS (Wash DC). 2016 Aug 11;4(3):1238. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.13063/2327-9214.1238&link_type=DOI) 30. 30.Vart, P., Coresh, J., Kwak, L., Ballew, S.H., Heiss, G., & Matsushita, K. (2017). Socioeconomic Status and Incidence of Hospitalization With Lower-Extremity Peripheral Artery Disease: Atherosclerosis Risk in Communities Study. Journal of the American Heart Association: Cardiovascular and Cerebrovascular Disease, 6. 31. 31.Data from The New York Times, based on reports from state and local health agencies. [https://github.com/nytimes/covid-19-data](https://github.com/nytimes/covid-19-data). Accessed May 1, p2020. 32. 32.Li, Xun, Lin, Qinyun, and Kolak, Marynia. The U.S. COVID-19 Atlas, 2020. [https://www.uscovidatlas.org](https://www.uscovidatlas.org). Accessed April 3, 2020. 33. 33. D. Blei, A. Ng. and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research 2003; 3:993–1022. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.5555/944919.944937&link_type=DOI) 34. 34.Negara ES, Triadi D and Andryani R. Topic Modelling Twitter Data with Latent Dirichlet Allocation Method. 2019 International Conference on Electrical Engineering and Computer Science (ICECOS) 2019; 386-390. 35. 35.McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.” [http://mallet.cs.umass.edu](http://mallet.cs.umass.edu). 2002. 36. 36.Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine Learning in Python, JMLR 2011; 2825–2830. 37. 37.Xue, J., Chen, J., Chen, C., Zheng, C., Li, S., & Zhu, T. (2020). Public discourse and sentiment during the COVID 19 pandemic: Using Latent Dirichlet Allocation for topic modeling on Twitter. PloS one, 15(9), e0239441. 38. 38.Ordun, C., Purushotham, S., & Raff, E. (2020). Exploratory analysis of covid-19 tweets using topic modeling, umap, and digraphs. arXiv preprint arxiv:2005.03082. 39. 39.Galea S, Abdalla SM. COVID-19 Pandemic, Unemployment, and Civil Unrest: Underlying Deep Racial and Socioeconomic Divides. JAMA 2020. 40. 40.Spievack N, Gonzalez J, Brown S. Latinx unemployment is highest of all racial and ethnic groups for the first time on record. Urban Wire 2020. 41. 41.Long H and Dam AV. U.S. unemployment rate soars to 14.7 percent, the worst sicne the Depression era. The Washington Post, 8 May 2020. [https://www.washingtonpost.com/business/2020/05/08/april-2020-jobs-report/#comments-wrapper](https://www.washingtonpost.com/business/2020/05/08/april-2020-jobs-report/#comments-wrapper). Accessed August 4, 2020. 42. 42.Valentino-DeVries J, Lu D and Dance GJX. Location Data Says It All: Staying at Home During Coronavirus is a Luxury. The New York Times, 3 April 2020. [https://www.nytimes.com/interactive/2020/04/03/us/coronavirus-stay-home-rich-poor.html](https://www.nytimes.com/interactive/2020/04/03/us/coronavirus-stay-home-rich-poor.html). Accessed April 3, 2020. 43. 43.Ricketts L. When the Stock Market Rises, Who Benefits? Federal Reserve Bank of St. Louis, 2018. [https://www.stlouisfed.org/on-the-economy/2018/february/when-stock-market-rises-who-benefits](https://www.stlouisfed.org/on-the-economy/2018/february/when-stock-market-rises-who-benefits). Accessed July 17, 2020. 44. 44.de Smith MJGM, Longley PA. Centroids and centers. Geospatial analysis, 5th edn. 2015. [http://www.spatialanalysisonline.com/HTML/index.html?centroids\_and\_centers.htm](http://www.spatialanalysisonline.com/HTML/index.html?centroids_and_centers.htm). Accessed July 17, 2020 [1]: /embed/graphic-2.gif [2]: /embed/graphic-3.gif