Abstract
Background There remains significant uncertainty in the definition of the long COVID disease, its expected clinical course, and its impact on daily functioning. Social media platforms can generate valuable insights into patient-reported health outcomes as the content is produced at high resolution by patients and caregivers, representing experiences that may be unavailable to most clinicians.
Objective We aim to determine the validity and effectiveness of advanced NLP approaches built to derive insight into Long COVID-related patient-reported health outcomes from social media platforms.
Methodology We use Transformer-based BERT models to extract and normalize long COVID Symptoms and Conditions (SyCo) from English posts on Twitter and Reddit. Furthermore, we estimate the occurrence and co-occurrence of SyCo terms at any point or across time and locations. Finally, we compare the extracted health outcomes with human annotations and highly utilized clinical outcomes grounded in the medical literature.
Result Based on our findings, the top three most commonly occurring groups of long COVID symptoms are systemic (such as “fatigue”), neuropsychiatric (such as “anxiety” and “brain fog”), and respiratory (such as “shortness of breath”). Regarding the co-occurring symptoms, the pair of ‘fatigue & headaches’ is most common. In addition, we show that other conditions, such as infection, hair loss, and weight loss, as well as mentions of other diseases, such as flu, cancer, or Lyme disease, are among the top reported terms by social media users.
Conclusion The outcome of our social media-derived pipeline is comparable with the outcomes of peer-reviewed articles relevant to long COVID symptoms. Overall, this study provides unique insights into patient-reported health outcomes from long COVID and valuable information about the patient’s journey that can help healthcare providers anticipate future needs.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study was supported by Vector Institute and its partner companies and medical subject matter experts. The partner companies include Deloitte, Roche, and TELUS.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Social media posts from Twitter and Reddit was used in this study. We used Twitter academic API to extract tweets. We hashed all usernames and removed URLs through the de-identification process. To further pseudonymize the data, we transformed special characters in the tweets or Reddit posts to lowercase and extracted contractions.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
Data Availability
Our github repository provides source codes for reproducing the data referred to in the manuscript from Twitter and Reddit data.