TY - JOUR T1 - Clinical Knowledge Extraction via Sparse Embedding Regression (KESER) with Multi-Center Large Scale Electronic Health Record Data JF - medRxiv DO - 10.1101/2021.03.13.21253486 SP - 2021.03.13.21253486 AU - Chuan Hong AU - Everett Rush AU - Molei Liu AU - Doudou Zhou AU - Jiehuan Sun AU - Aaron Sonabend AU - Victor M. Castro AU - Petra Schubert AU - Vidul A. Panickan AU - Tianrun Cai AU - Lauren Costa AU - Zeling He AU - Nicholas Link AU - Ronald Hauser AU - J. Michael Gaziano AU - Shawn N. Murphy AU - George Ostrouchov AU - Yuk-Lam Ho AU - Edmon Begoli AU - Junwei Lu AU - Kelly Cho AU - Katherine P. Liao AU - Tianxi Cai AU - with the VA Million Veteran Program Y1 - 2021/01/01 UR - http://medrxiv.org/content/early/2021/03/13/2021.03.13.21253486.abstract N2 - Objective The increasing availability of Electronic Health Record (EHR) systems has created enormous potential for translational research. Even with a working knowledge of EHR, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the ability to share data across institutions to establish a cooperative and integrated knowledge network. In this project, we demonstrate that multi-center large-scale code embeddings can be used to efficiently identify relevant features related to a disease or condition of interest.Method We constructed large-scale code embeddings for a wide range of codified concepts, including diagnosis codes, medications, procedures, and laboratory tests from EHRs from two large medical centers. We developed knowledge extraction via sparse embedding regression (KESER) for feature selection and integrative network analysis based on the trained code embeddings. We evaluated the quality of the code embeddings and assessed the performance of KESER in feature selection for eight diseases. Besides, we developed an integrated clinical knowledge map combining embedding data from both institutions.Results The features selected by KESER were comprehensive compared to lists of codified data generated by domain experts. Additionally, features identified automatically via KESER used in the development of phenotype algorithms resulted in comparable performance to those built upon features selected manually or identified via existing feature selection methods with patient-level data. The knowledge map created using an integrative analysis identified disease-disease and disease-drug pairs more accurately compared to those identified using single institution data.Conclusion Analysis of code embeddings via KESER can effectively reveal clinical knowledge and infer relatedness among diseases, treatment, procedures, and laboratory measurement. This approach automates the grouping of clinical features facilitating studies of the condition. KESER bypasses the need for patient-level data in individual analyses providing a significant advance in enabling multi-center studies using EHR data.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis research used resources of the Knowledge Discovery Infrastructure at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The study protocol was approved by the MGB Human Research Committee (IRB00010756). No patient contact occurred in this study which relied on secondary use of data allowing for waiver of informed consent as detailed by 45 CFR 46.116. These activities were approved through the VA Central IRB. They were supported by Million Veteran Program, VA Central IRB 10-02, and approved under VA Central IRB protocol 18-38. All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).Yes I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesDATA AVAILABILITY The knowledge network is available at https://celehs.hms.harvard.edu/network/ https://celehs.hms.harvard.edu/network/ ER -