Investigating mental and physical disorders associated with COVID-19 in online health forums

Objectives: Online health forums provide rich and untapped real-time data on population health. Through novel data extraction and natural language processing (NLP) techniques, we characterise the evolution of mental and physical health concerns relating to the COVID-19 pandemic among online health forum users. Setting and design: We obtained data from 739,434 posts by 53,134 unique users of three leading online health forums: HealthBoards, Inspire and HealthUnlocked, from the period 1st January 2020 to 31st May 2020. Using NLP, we analysed the content of posts related to COVID-19. Primary outcome measures: (i) Proportion of forum posts containing COVID-19 keywords (ii) Proportion of forum users making their very first post about COVID-19 (iii) Number of COVID-19 related posts containing content related to physical and mental health comorbidities Results: Posts discussing COVID-19 and related comorbid disorders spiked in early- to mid-March around the time of global implementation of lockdowns prompting a large number of users to post on online health forums for the first time. The pandemic and corresponding public response has had a significant impact on posters' queries regarding mental health. Conclusions: We demonstrate it is feasible to characterise the content of online health forum user posts regarding COVID-19 and measure changes over time. Social media data sources such as online health forums can be harnessed to strengthen population-level mental health surveillance.


Introduction
Measures to tackle the COVID-19 pandemic have resulted in unprecedented societal restrictions worldwide. The mental health impacts of these measures and accompanying socioeconomic stressors are likely to be extensive; identifying and quantifying these impacts are now an urgent priority. [1] For example, social distancing restrictions make it harder to maintain regular contact between individuals and their friends and family as well as health and social care professionals. Furthermore, the psychological and emotional burden of the pandemic (and its consequences) may increase risk of relapse or worsen existing mental health disorders. Conversely, mental disorders can increase susceptibility to infections. [2,3] Real-world data from online resources may be extracted using natural language processing (NLP) techniques to provide automated, population-level health surveillance. These methods can be used to rapidly ascertain discussion related to COVID-19 and associated symptoms and comorbidities. NLP has previously been used to identify medically relevant information from web pages and analyse extracted text. [4,5] Applying these techniques to real world data sources such as social media and online forums may be used to supplement active data collection from participants in prospective observational research. Recent studies have applied this approach to Twitter, Facebook and Reddit data to forecast the emergence of depression and post-traumatic stress disorder, [6] predict depression in the general population, [7] identify mothers at risk of postpartum depression, [8] and investigate suicidal ideation. [9] While social media platforms such as Twitter, Facebook and Reddit are commonly used, other internet resources such as online health forums have so far been neglected. Online health forums are enriched for health information and receive millions of posts each year, therefore providing untapped reservoirs of healthcare data at population level.
In a recent proof-of-concept study we demonstrated that online health forums can be extracted to detect health discussion trends that correlate with real-life events.
[10] Here, we use the same technology to analyse online health forum data discussing mental and physical health problems associated with the COVID-19 pandemic. We use NLP techniques to extract data from online health forum posts related to the COVID-19 pandemic, references to specific comorbid illnesses, and their direct and indirect impacts on mental or physical health.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint Investigating mental and physical disorders associated with COVID-19 in online health forums

Study Design and Setting
We obtained data from online health forums using NLP. Online forums are discussion websites hosted on the internet where people hold conversations in the form of posted messages. A single conversation is called a thread. Threads are chains of posts identified within a forum by a title and an individual URL. Clicking on the thread title opens the thread which contains one or more posts which may be from the same user who started the thread (i.e. the original poster) or different users who have replied within the thread. In this study, we analysed text data in thread titles and in individual posts within a thread. We analysed posts written in English only. Depending on the forum's settings, users can be anonymous or have to register with the forum to post messages, with most users opting not to use personally identifiable information to register their account. Registration may not be required for readonly access. Most forums recommend that users do not use personally identifiable information when posting. Online health forums specifically cover health topics and offer peer support for various health conditions. We collected data from three major online health forums posted from 1 st January 2020 to 31 st May 2020: HealthBoards (www.healthboards.com), Inspire (www.inspire.com) and HealthUnlocked (www.healthunlocked.com). These forums were chosen on the basis that they have global user coverage, include subforums on several aspects of healthcare, have a large user base contributing to regular activity on the forum, and are feasible to extract information from using NLP.
HealthBoards was founded in California, USA in 1997 and offers patient to patient health support. Inspire, founded in 2005, is a US healthcare social network managing online support groups for patients and caregivers. HealthUnlocked is a British online health forum launched in 2011 with a similar offering to HealthBoards and Inspire. Registration and participation in all three forums are free of charge to users.

Definition of search terms
To investigate the potential impact of COVID-19 on users posting in online health forums, we classified threads and posts using keywords related to the COVID-19 pandemic and various groups of case-insensitive keywords relating to medical treatment in an intensive care unit or physical symptoms as a direct consequence of COVID-19 infection or mental health symptoms as a consequence of measures in response to the pandemic.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; Table 1 provides the final keywords used to search posts within COVID-19 related threads;  the Python coded search terms are provided in Supplementary Tables 2 and 3. We tested the specificity of keywords by searching for matches occurring before 1 st January 2020. For threads, these were matches in the title and URL, while for posts, these were matches in the entire text (see Extracting and matching keywords below). Term incidence and excluded keywords are provided in Supplementary Table 5.
Data pre-processing Data obtained from different online health forums come in various formats. We standardised and normalised the data before analysing them. This included normalisation of Unicode strings and whitespace characters, standardisation of date and time, and standardisation of location through the GeoNames.org database.

Extracting and matching keywords
We extracted the keywords in thread titles and post content using lemmatisation. For flexibility and efficiency, search terms in posts and thread titles were matched using regular expressions that accounted for both inflection and common spelling variants. Matching was case-insensitive and limited to whole words in the post content and thread title; when matching thread URLs, parts containing words were considered. To prevent spurious matches, words shorter than four letters (e.g. ICU) were considered valid matches only if they were delimited by non-word characters.

Analysis of COVID-19 threads to identify changes in COVID-19 related user activity and physical and mental health associations over time
We identified the users contributing to a COVID-19 related thread in a given week. We then retrieved all the other posts made by the same authors in the previous, same and subsequent calendar weeks. We scanned such posts for physical symptom, mental health symptom or intensive care keywords as defined in Table 1, and recorded whether each of these topics was mentioned by the author during the time window. We performed this analysis to establish variations in the prevalence of concerns relating to physical symptom, mental health symptom and intensive care keywords over the course of the pandemic during 2020. Weekly counts were measured each Sunday for the previous week.

Analysis of thread titles
We inspected thread titles to identify how many mentioned a comorbidity in the title. We searched for terms related to autoimmune disorders, mental disorders or worry, cancer, cardiovascular problems or stroke, and diabetes as listed in Table 2; the Python coded search terms are provided in Supplementary Table 4.

Analysis of first-time posters in a COVID-19 related thread
We analysed the first ever post published by a user to determine the proportion of first-time posters who started out by contributing to a COVID-19 related thread. We performed this . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint Investigating mental and physical disorders associated with COVID-19 in online health forums analysis to determine the degree to which new users were motivated to make their first post in relation to the COVID-19 pandemic and how this varied over time during 2020.

Implementation and computation
All descriptive analyses were performed using bespoke software written in Python. An outline of the coding approach employed is included in the Supplementary Material.

Ethics and Data Sharing
We consulted and adhered to internet research guidelines from the Association of Internet Researchers [11] and the British Psychological Society (BPS) [12] to inform study development.
All data have been provided in aggregate form to protect the privacy of forum users. As we analysed data in aggregate form, it was not possible to seek individual user consent. However, users were aware that their data were available for anyone to view online by virtue of contributing to publicly available online health forums. Of note, the BPS guidelines advise that "valid consent should be obtained where it cannot be reasonably argued that online data can be considered 'in the public domain' or that undisclosed usage is justified on scientific value grounds". This approach is consistent with similar studies examining healthcare related data from Twitter. [13,14] QMUL is registered as a data controller with the Information Commissioner's Office (ICO; registration number: Z5507327), which covers all research activities undertaken at the university. All data were analysed on QMUL IT facilities, which employs a two-layer security model as per their security policy.
Given licensing and privacy issues, it is not possible to publicly release the aggregate dataset generated from the three online health forums investigated. However, we welcome collaboration with other researchers and healthcare policy makers. Anyone interested in accessing the aggregate data and data analysis code should contact the guarantor (f.smeraldi@qmul.ac.uk).

Patient and public involvement
As the data were analysed in aggregate form it was not possible to involve individual forum users in the design or conduct of the study.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ;

Related posts and active threads
HealthUnlocked was the most frequently used forum accounting for 97% of overall posts and 97% of posts mentioning COVID-19 in the thread title or post content during the study period (Table 3). For quite a long period, most posts about COVID-19 (over 90% at the beginning of the observation period, and remaining above 50% until the week ending 29 th March) were written by users who had not yet posted on the topic. By the end of the observation period, the percentage of weekly posts in COVID-19 threads written by new entrants to the discussion reduced to a still quite sizeable 30%. While many of these users may have posted before on the forum about other topics, Supplementary Figure 4 presents the proportion of posters whose very first post to a forum appeared in a COVID-19 related thread. This figure peaked above 20% in the week ending 22 nd March. Considering that these forums have a very broad spectrum, this is a remarkably high fraction. It includes both new joiners and users who were previously silent members of the forums, possibly for a long time (so-called "lurkers"), and who may have been spurred into a more active role by the pandemic.

Thread title analysis
Over a quarter of COVID-19 related thread titles mentioned another condition of interest (Table 4). After cancer and autoimmune diseases, mental health represented a major area of concern for online health forum users posting about COVID-19, comparable to respiratory and circulatory diseases ( Table 2). Around 0.5% of thread titles mentioned two or more comorbidities.

User analysis
Posts in threads related to COVID-19 were analysed to determine the number of users contributing in each given week. For each active user, all posts in the previous, same and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint Investigating mental and physical disorders associated with COVID-19 in online health forums 9 following calendar weeks were scanned irrespective of thread for mentions of physical symptom, mental health symptom or intensive care keywords. The number of active users mentioning each of these concerns peaked in the week ending 22 nd March and subsequently declined but still remained elevated above the January baseline. In particular, users discussing mental health outnumbered users mentioning the other topics ( Figure 1).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint

Discussion
Using a novel technique to analyse data from online health forums, we found a marked increase in posts related to COVID-19 across the observation period of 1 st January 2020 to 31 st May 2020. The frequency of these posts increased rapidly in early March 2020 corresponding with the World Health Organisation's declaration of COVID-19 as a pandemic.
During this period, we found mental health symptom keywords were most frequently mentioned by authors of COVID-19 related posts (either contextually or in separate messages), followed by physical symptoms and intensive care keywords, suggesting that the pandemic and public health response to it has had a significant impact on posters' concerns regarding mental health. The marked increase in mental health symptom related posts in early March, when the WHO declared the COVID-19 pandemic, correlates with preliminary worldwide data that show increases in anxiety and depression in response to the outbreak.
The mental health impacts of COVID-19 and associated physical distancing restrictions are likely to be extensive and wide-reaching. There is a growing body of evidence supporting the neuropsychiatric effects of coronavirus infections.
[15] Restrictions fuel socioeconomic stressors such as unemployment, loneliness and financial burden, which are all implicated in the development of mental ill health.
[16] Increased rates of bereavement, newfound caring responsibilities and interruptions to education are likely to be particularly stressful to children and young adults. [17] A preliminary survey of 3,545 German respondents found evidence of substantial mental health burden from travel and physical distancing restrictions, including increased levels of stress, anxiety, depressive symptoms, sleep disturbance and irritability.
[18] Worsening mental health has been confirmed in samples with both pre-and post-pandemic information for direct comparison: The Avon Longitudinal Study of Parents and Children (ALSPAC) study found probable anxiety disorder doubled compared to pre-pandemic sizes (26% vs 13%) and lower wellbeing, particularly in young people, women and those with pre-existing conditions.
[19] The literature on social media mining for COVID-19 mental health related trends is limited. A study analysing sentiment evolution trends of four emotions across Twitter -fear, anger, sadness, joy -has been able to identify developing shared distress, and topics of interest relating to those emotions. [20] Our findings also suggest that mental and physical health concerns documented in online forum posts have levelled off following their peak in March 2020. The number of users active in COVID-19 threads who also wrote posts concerning mental health symptoms reduced from their peak in March of 1,355 (per week) to 253 by the end of the observation period (compared to a mean number of 30 per week in January), suggesting that as time went on most users had begun to adjust to the consequences of the pandemic. Other NLP studies have also identified a similar trend. An analysis of 10 million Google searches within the United States found large shifts in mental health symptom searches linked to stay-at-home . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; orders in the United States across the week commencing 16 th March 2020.
[21] Searches for topics related to anxiety, negative thoughts about oneself and the future, insomnia and suicidal ideation dramatically increased prior to stay-at-home orders, levelling off upon the announcement of stay-at-home orders. These patterns were relatively unique to searches for mental health related information and not physical conditions.
Over the entire period, on average 4% of first-time posters (over 20% in the peak period) made their very first contribution to the forum in a COVID-19 related thread. Furthermore, 77% of COVID-19 threads were started by users who had never posted about the topic before, and chose to start out by creating their own thread. A certain degree of motivation is required to take someone to the point of making that first post on a forum, and also for starting a thread; our finding suggests that the pandemic is driving users to engage more actively in community forum services in times of uncertainty.

Strengths and weaknesses
Online health forums are an important source of real-world, real-time, population-level data on people living through the COVID-19 pandemic. Online health forums also afford users anonymity to discuss aspects of their experience they might otherwise have been embarrassed or fearful to disclose in identifiable forms of social media. We have demonstrated that it is possible to automate information extraction from these posts using natural language processing, providing access to a rich reservoir of previously untapped real-world data from health-specific online resources.
Our approach was able to automatically extract data from a large sample of over 53,000 unique users at a fraction of the cost of previous approaches that have relied on social media individual participant recruitment and manual review of posts generating sample sizes in the low hundreds. Our study has some limitations. At present it is difficult to establish whether concerned posters have pre-existing mental or physical health issues, have experienced confirmed COVID-19 illness themselves, are recovered, or have become unwell for the first time. Online health forums are help-seeking communities; this introduces self-selection bias in which individuals from disadvantaged backgrounds who do not have IT equipment/network connection to access online resources are under-represented and our results are therefore not generalisable to the entire population. Furthermore, as these forums have worldwide coverage we cannot isolate trends to one geographic region. However, future work could utilise the location data (see Data pre-processing in Methods) to explore this avenue.

Conclusions and future research
Publicly accessible sources of real-world data, such as online health forums analysed in this study, can strengthen population-level physical and mental health surveillance and provide a rapid and inexpensive means to inform public healthcare policy. We found that the majority . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint of posts in online forum data related to COVID-19 concerned features related to mental health and that the peak in frequency of posts corresponded with the early phase of the pandemic, indicating the significant impact of COVID-19 on the mental health of susceptible populations.
As the pandemic evolves, further research using online forum data could improve our understanding of the long-term consequences of COVID-19 infection [23] and the longerterm socioeconomic consequences of travel and physical distancing restrictions that have been employed in many countries to manage viral transmission.[24,25] Analysis of realworld data, including social media and online health forums, could provide a useful insight into attitudes and perceptions towards novel therapeutics. This will be crucial to maximising uptake of effective preventative approaches such as mask-wearing, physical distancing, hygiene measures and potential vaccines.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Figure 1: Number of users making posts in threads related to COVID-19 which included physical symptoms, mental health symptoms or intensive care keywords
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review) preprint
The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint   N = 3,342). b Percentage is calculated as number of posts / total posts (N = 44,894) Note: Percentages do not add up to 100% because some threads contained mentions of more than one comorbidity . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Supplementary Table 1. Key to regular expressions
Examples: • 'auto[_\s-]?immune' will match all of "autoimmune", "auto-immune" and "auto immune". • 'psycho[st]i[sc]' will match both "psychotic" and "psychosis" • 'infarct\w*' will match "infarct", "infarcts", "infarction", "infarctions" and "infarcted" Matching is case-insensitive. When matching post content or thread titles, keywords have to appear as separate whole words; this requirement is lifted when matching thread URLs. To reduce spurious matches, keywords up to three letters long have to appear in URLs surrounded by underscores (_) or other non-alphanumeric characters in order to be counted as a match.   . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

Supplementary
The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.  Figure 4: Proportion of users whose very first post was in a COVID-19 related thread, given weekly . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted December 16, 2020. ; https://doi.org/10.1101/2020.12.14.20248155 doi: medRxiv preprint