An unsupervised and customizable misspelling generator for mining noisy health-related text sources

Abeed Sarker; Graciela Gonzalez-Hernandez

doi:10.1016/j.jbi.2018.11.007

An unsupervised and customizable misspelling generator for mining noisy health-related text sources

J Biomed Inform. 2018 Dec:88:98-107. doi: 10.1016/j.jbi.2018.11.007. Epub 2018 Nov 13.

Authors

Abeed Sarker¹, Graciela Gonzalez-Hernandez²

Affiliations

¹ Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, United States. Electronic address: abeed@pennmedicine.upenn.edu.
² Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, United States.

Abstract

Background: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources.

Materials and methods: The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system.

Results: On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F₁-score of 0.69 and F₁₄-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms demonstrated an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included.

Discussion: Our proposed spelling variant generator has several advantages over past spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations.

Conclusion: The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.

Keywords: Clinical notes; Data science; Misspelling generation; Natural language processing; Social media; Spelling variant generation; Text mining.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Data Collection / methods
Data Mining / methods*
Electronic Health Records
Fuzzy Logic
Medical Informatics / methods*
Natural Language Processing*
Pattern Recognition, Automated
Reproducibility of Results
Social Media*

Abstract

Publication types

MeSH terms

Grants and funding