TY - JOUR T1 - Fifty Ways to Tag your Pubtypes: Multi-Tagger, a Set of Probabilistic Publication Type and Study Design Taggers to Support Biomedical Indexing and Evidence-Based Medicine JF - medRxiv DO - 10.1101/2021.07.13.21260468 SP - 2021.07.13.21260468 AU - Aaron M. Cohen AU - Jodi Schneider AU - Yuanxi Fu AU - Marian S. McDonagh AU - Prerna Das AU - Arthur W. Holt AU - Neil R. Smalheiser Y1 - 2021/01/01 UR - http://medrxiv.org/content/early/2021/07/16/2021.07.13.21260468.abstract N2 - Objective Indexing articles according to publication types (PTs) and study designs can be a great aid to filtering literature for information retrieval, especially for evidence syntheses. In this study, 50 automated machine learning based probabilistic PT and study design taggers were built and applied to all articles in PubMed.Materials and Methods PubMed article metadata from 1987-2014 were used as training data, with 2015 used for recalibration. The set of articles indexed with a particular study design MeSH term or PT tag was used as positive training sets. For each PT, the rest of the literature from the same time period was used as its negative training set. Multiple features based on each article title, abstract and metadata were used in training the models. Taggers were evaluated on PubMed articles from 2016 and 2019. A manual analysis was also performed.Results Of the 50 predictive models that we created, 44 of these achieved an AUC of ∼0.90 or greater, with many having performance above 0.95. Of the clinically related study designs, the best performing was SYSTEMATIC_REVIEW with an AUC of 0.998; the lowest performing was RANDOM_ALLOCATION, with an AUC of 0.823.Discussion This work demonstrates that is feasible to build a large set of probabilistic publication type and study design taggers with high accuracy and ranking performance. Automated tagging permits users to identify qualifying articles as soon as they are published, and allows consistent criteria to be applied across different bibliographic databases. Probabilistic predictive scores are more flexible than binary yes/no predictions, since thresholds can be tailored for specific uses such as high recall literature search, user-adjustable retrieval size, and quality improvement of manually annotated databases.Conclusion The PT predictive probability scores for all PubMed articles are freely downloadable at http://arrowsmith.psych.uic.edu/evidence_based_medicine/mt_download.html for incorporation into user tools and workflows. Users can also perform PubMed queries at our Anne O’Tate value-added PubMed search engine http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/AnneOTate.cgi and filter retrieved articles according to both NLM-annotated and model-predicted publication types and study designs.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was supported by National Institutes of Health (NIH)/National Library of Medicine grant number R01LM010817.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This is bibliographic publication research, and therefore not human subject research.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe PT predictive probability scores for all PubMed articles are freely downloadable at http://arrowsmith.psych.uic.edu/evidence_based_medicine/mt_download.html for incorporation into user tools and workflows. Users can also perform PubMed queries at our Anne O'Tate value-added PubMed search engine http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/AnneOTate.cgi and filter retrieved articles according to both NLM-annotated and model-predicted publication types and study designs. Furthermore, the data underlying the Manual Review of Extreme Disagreements will be deposited and made available in the Dryad Digital Repository. ER -