TY - JOUR T1 - SEED: Symptom Extraction from English Social Media Posts using Deep Learning and Transfer Learning JF - medRxiv DO - 10.1101/2021.02.09.21251454 SP - 2021.02.09.21251454 AU - Arjun Magge AU - Davy Weissenbacher AU - Karen O’Connor AU - Matthew Scotch AU - Graciela Gonzalez-Hernandez Y1 - 2022/01/01 UR - http://medrxiv.org/content/early/2022/03/21/2021.02.09.21251454.abstract N2 - The increase of social media usage across the globe has fueled efforts in digital epidemiology for mining valuable information such as medication use, adverse drug effects and reports of viral infections that directly and indirectly affect population health. Such specific information can, however, be scarce, hard to find, and mostly expressed in very colloquial language. In this work, we focus on a fundamental problem that enables social media mining for disease monitoring. We present and make available SEED, a natural language processing approach to detect symptom and disease mentions from social media data obtained from platforms such as Twitter and DailyStrength and to normalize them into UMLS terminology. Using multi-corpus training and deep learning models, the tool achieves an overall F1 score of 0.86 and 0.72 on DailyStrength and balanced Twitter datasets, significantly improving over previous approaches on the same datasets. We apply the tool on Twitter posts that report COVID19 symptoms, particularly to quantify whether the SEED system can extract symptoms absent in the training data. The study results also draw attention to the potential of multi-corpus training for performance improvements and the need for continuous training on newly obtained data for consistent performance amidst the ever-changing nature of the social media vocabulary.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThe work at University of Pennsylvania was supported by the National Institutes of Health (NIH) National Library ofMedicine (NLM) grant R01LM011176 awarded to GG.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The Institutional Review Board (IRB) of the University of Pennsylvania reviewed the studies for which this data was collected and deemed them exempt human subjects research under category (4) of paragraph (b) of the US Code of Federal Regulations Title 45 Section 46.101 for publicly available data sources (45 CFR 46.101(b)(4)).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesDatasets and code are available from the Health Language Processing website. https://healthlanguageprocessing.org/pubs/seed/ ER -