Learning probabilistic phenotypes from heterogeneous EHR data

Rimma Pivovarov; Adler J Perotte; Edouard Grave; John Angiolillo; Chris H Wiggins; Noémie Elhadad

doi:10.1016/j.jbi.2015.10.001

Learning probabilistic phenotypes from heterogeneous EHR data

J Biomed Inform. 2015 Dec:58:156-165. doi: 10.1016/j.jbi.2015.10.001. Epub 2015 Oct 14.

Authors

Rimma Pivovarov¹, Adler J Perotte², Edouard Grave³, John Angiolillo⁴, Chris H Wiggins⁵, Noémie Elhadad⁶

Affiliations

¹ Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA. Electronic address: rip7002@nyp.org.
² Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA. Electronic address: ajp2120@cumc.columbia.edu.
³ Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA. Electronic address: eg2795@cumc.columbia.edu.
⁴ College of Physicians and Surgeons, Columbia University, New York, NY, USA. Electronic address: ja2686@cumc.columbia.edu.
⁵ Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY, USA. Electronic address: chw2@columbia.edu.
⁶ Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA. Electronic address: noemie.elhadad@columbia.edu.

Abstract

We present the Unsupervised Phenome Model (UPhenome), a probabilistic graphical model for large-scale discovery of computational models of disease, or phenotypes. We tackle this challenge through the joint modeling of a large set of diseases and a large set of clinical observations. The observations are drawn directly from heterogeneous patient record data (notes, laboratory tests, medications, and diagnosis codes), and the diseases are modeled in an unsupervised fashion. We apply UPhenome to two qualitatively different mixtures of patients and diseases: records of extremely sick patients in the intensive care unit with constant monitoring, and records of outpatients regularly followed by care providers over multiple years. We demonstrate that the UPhenome model can learn from these different care settings, without any additional adaptation. Our experiments show that (i) the learned phenotypes combine the heterogeneous data types more coherently than baseline LDA-based phenotypes; (ii) they each represent single diseases rather than a mix of diseases more often than the baseline ones; and (iii) when applied to unseen patient records, they are correlated with the patients' ground-truth disorders. Code for training, inference, and quantitative evaluation is made available to the research community.

Keywords: Clinical phenotype modeling; Computational disease models; Electronic health record; Medical information systems; Phenotyping; Probabilistic modeling.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Electronic Health Records*
Humans
Learning*
Phenotype
Probability*

Grants and funding

T15 LM007079/LM/NLM NIH HHS/United States