RT Journal Article SR Electronic T1 Evaluation of Patient-Level Retrieval from Electronic Health Record Data for a Cohort Discovery Task JF medRxiv FD Cold Spring Harbor Laboratory Press SP 19005280 DO 10.1101/19005280 A1 Steven D. Bedrick A1 Aaron M. Cohen A1 Yanshan Wang A1 Andrew Wen A1 Sijia Liu A1 Hongfang Liu A1 William R. Hersh YR 2019 UL http://medrxiv.org/content/early/2019/11/12/19005280.abstract AB Objective Growing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well-understood. The objective of this research was to assess patient-level information retrieval (IR) methods using electronic health records (EHR) for different types of cohort definition retrieval.Materials and Methods We developed a test collection consisting of about 100,000 patient records and 56 test topics that characterized patient cohort requests for various clinical studies. Automated IR tasks using word-based approaches were performed, varying four different parameters for a total of 48 permutations, with performance measured using B-Pref. We subsequently created structured Boolean queries for the 56 topics for performance comparisons. In addition, we performed a more detailed analysis of 10 topics.Results The best-performing word-based automated query parameter settings achieved a mean B-Pref of 0.167 across all 56 topics. The way a topic was structured (topic representation) had the largest impact on performance. Performance not only varied widely across topics, but there was also a large variance in sensitivity to parameter settings across the topics. Structured queries generally performed better than automated queries on measures of recall and precision, but were still not able to recall all relevant patients found by the automated queries.Conclusion While word-based automated methods of cohort retrieval offer an attractive solution to the labor-intensive nature of this task currently used at many medical centers, we generally found suboptimal performance in those approaches, with better performance obtained from structured Boolean queries. Insights gained in this preliminary analysis will help guide future work to develop new methods for patient-level cohort discovery with EHR data.Competing Interest StatementSteven Chamberlin, Aaron Cohen, and William Hersh have research funding from Alnylam Pharmaceuticals that is unrelated to the work described in this paper.Funding StatementThis work was supported by NIH Grant 1R01LM011934 from the National Library of Medicine.Author DeclarationsAll relevant ethical guidelines have been followed and any necessary IRB and/or ethics committee approvals have been obtained.YesAll necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesAny clinical trials involved have been registered with an ICMJE-approved registry such as ClinicalTrials.gov and the trial ID is included in the manuscript.Not ApplicableI have followed all appropriate research reporting guidelines and uploaded the relevant Equator, ICMJE or other checklist(s) as supplementary files, if applicable.Not ApplicableThe data used for this study is protected health information that came from the electronic health record system at Oregon Health & Science University, so cannot be made publicly available.