Abstract
Background Patients of certain diseases are less likely to approach the healthcare system but remain active in social media. Young Social Anxiety Disorder (SAD) patients, in particular, are a hard-to-reach population due to disease symptomatology, unmet need and age-related barriers, which makes obtaining first-hand access to patient perspectives challenging.
Objective To create a curated cohort of patients from social media that report their age in the range of 13 to 25 years old and confirm having a SAD diagnosis or having received therapy for SAD, and to assess the value of the content posted by these users for observational studies of SAD.
Methods We collected 535k posts by 118k Reddit users from the r/SocialAnxiety subreddit. We then developed precise regular expressions to extract age, diagnosis and therapy mentions. We manually annotated the full set of expressions extracted and double-annotated 5% of the age mentions and 10% of the diagnosis and therapy mentions. Using similar methodology, we identified mentions of comorbidities and substance use.
Results Our validated cohort includes 37,073 posts by 1,102 users that meet the inclusion criteria. The age, diagnosis, and therapy mention detection had a precision of 68%, 31%, and 44%, respectively, with an inter-annotator agreement of 0.96, 0.96, and 0.78. Sixty-one percent of the users in the cohort report having one or more comorbidities on top of their SAD diagnosis (Fleiss’s Kappa=0.79) and 13% report a concerning use of drugs or alcohol (Fleiss’s Kappa=0.87). We compared the characteristics of our social media cohort to the published literature on SAD.
Conclusions Patients with SAD post actively on Reddit and their perspectives can be captured and studied directly from these data. Extracting age, therapy, substance abuse and comorbidities (and potentially other patient data) can address realworld data source biases. Thus, social media is a valuable source to create cohorts of hard-to-reach patient populations that may not enter the healthcare system.
Competing Interest Statement
GGH is a consultant to F. Hoffmann-La Roche Ltd (Roche Pharmaceuticals). The authors declare that there is no conflict of interest regarding the publication of this paper.
Funding Statement
GGH and KO were supported in part by the National Library of Medicine (R01LM011176).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The dataset consists of Reddit social media data that is public for anyone and can also be downloaded at Academic Torrents: https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
lucia.schmidt{at}roche.com
karoc{at}pennmedicine.upenn.edu
graciela.gonzalezhernandez{at}csmc.edu
raul.rodriguez-esteban{at}roche.com
Data Availability
The data that support the findings of this study are available upon reasonable request. The Reddit data is public and can be found at Academic Torrents.