RT Journal Article SR Electronic T1 Finding Rare Disease Patients in EHR Databases via Lightly-Supervised Learning JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2020.07.06.20147322 DO 10.1101/2020.07.06.20147322 A1 Rich Colbaugh A1 Kristin Glass YR 2020 UL http://medrxiv.org/content/early/2020/07/07/2020.07.06.20147322.abstract AB There is considerable interest in developing computational models capable of detecting rare disease patients in population-scale databases such as electronic health records (EHRs). Deriving these models is challenging for several reasons, perhaps the most daunting being the limited number of already-diagnosed, ‘labeled’ patients from which to learn. We overcome this obstacle with a novel lightly-supervised algorithm that leverages unlabeled and/or unreliably-labeled patient data – which is typically plentiful – to facilitate model induction. Importantly, we prove the algorithm is safe: adding unlabeled/unreliably-labeled data to the learning procedure produces models which are usually more accurate, and guaranteed never to be less accurate, than models learned from reliably-labeled data alone. The proposed method is shown to substantially outperform state-of-the-art models in patient-finding experiments involving two different rare diseases and a country-scale EHR database. Additionally, we demonstrate feasibility of transforming high-performance models generated through light supervision into simpler models which, while still accurate, are readily-interpretable by non-experts.Competing Interest StatementThe authors are shareholders in Volv Global.Funding StatementVolv Global provided support for the research and studies reported in the submitted manuscript.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The use of the PHARMO data is controlled by the independent Compliance Committee STIZON/PHARMO Institute. All decisions of the Compliance Committee STIZON/PHARMO Institute are based on the applicable legislation in the Netherlands, e.g. the Personal Data Protection Act and the Medical Treatment Contract Act. Within this legal framework, the Code of Conduct 'Use of Data in Health Research' is an important document for the interpretation of the use of this kind of data for scientific research in the Netherlands, and is approved by the Dutch Data Protection Authority.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe electronic health record data used in this work is protected by Dutch Privacy Regulations, and cannot be made publically available.