RT Journal Article SR Electronic T1 Comparative Effectiveness of Knowledge Graphs- and EHR Data-Based Medical Concept Embedding for Phenotyping JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2020.07.14.20151274 DO 10.1101/2020.07.14.20151274 A1 Lee, Junghwan A1 Liu, Cong A1 Kim, Jae Hyun A1 Butler, Alex A1 Shang, Ning A1 Pang, Chao A1 Natarajan, Karthik A1 Ryan, Patrick A1 Ta, Casey A1 Weng, Chunhua YR 2020 UL http://medrxiv.org/content/early/2020/07/17/2020.07.14.20151274.abstract AB Objective Concept identification is a major bottleneck in phenotyping. Properly learned medical concept embeddings (MCEs) have semantic meaning of the medical concepts, thus useful for feature engineering in phenotyping tasks. The objective of this study is to compare the effectiveness of MCEs learned by using knowledge graphs and EHR data for facilitating high-throughput phenotyping.Materials and Methods We investigated four MCEs learned from different data sources and methods. Knowledge-graphs were obtained from the Observational Medical Outcomes Partnership (OMOP) common data model. Medical concept co-occurrence statistics were obtained from Columbia University Irving Medical Center’s (CUIMC) OMOP database. Two embedding methods, node2vec and GloVe, were used to learn embeddings for medical concepts. We used phenotypes with their corresponding concepts generated and validated by the Electronic Medical Records and Genomics (eMERGE) network to evaluate the performance of learned MCEs in identifying phenotype-relevant concepts.Results Precision@k% and Recall@k% in identifying phenotype-relevant concepts based on a single concept and multiple seed concepts were used to evaluate MCEs. Recall@500% and Precision@500% based on a single seed concept of MCE learned using the enriched knowledge graph were 0.64 and 0.13, compared to Recall@500% and Precision@500% of MCE learned using the hierarchical knowledge graph (0.61 and 0.12), 5-year windowed EHR (0.51 and 0.10), and visit-windowed EHR (0.46 and 0.09).Conclusion Medical concept embedding enables scalable identification of phenotype-relevant medical concepts, thereby facilitating high-throughput phenotyping. Knowledge graphs constructed by hierarchical relationships among medical concepts learn more effective MCEs, highlighting the need of more sophisticated use of big data to leverage MCEs for phenotyping.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was supported by National Library of Medicine grants R01LM009886 and 1R01LM012895-03, National Human Genome Research Institute grant U01HG008680, and National Center for Advancing Translational Science grant 1OT2TR003434-01.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study received institutional review board approval with a waiver for informed consent.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe source code is publicly available at https://github.com/WengLab-InformaticsResearch/mcephe