RT Journal Article SR Electronic T1 De-identifying Spanish medical texts - Named Entity Recognition applied to radiology reports JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2020.04.09.20058958 DO 10.1101/2020.04.09.20058958 A1 Irene Pérez-Díez A1 Raúl Pérez-Moraga A1 Adolfo López-Cerdán A1 Jose-Maria Salinas-Serrano A1 María de la Iglesia-Vayá YR 2020 UL http://medrxiv.org/content/early/2020/04/14/2020.04.09.20058958.abstract AB Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Along-side, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it can be easily extended to other languages and medical texts, such as electronic health records.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis article describes work undertaken in the context of the DeepHealth project, “Deep-Learning and HPC to Boost Biomedical Applications for Health” (https://deephealth-project.eu/) which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825111“. The contents of this publication reflect only the author’s view, can in no way be taken to reflect the views of the European Union and the Community is not liable for any use that may be made of the information contained therein.Author DeclarationsAll relevant ethical guidelines have been followed; any necessary IRB and/or ethics committee approvals have been obtained and details of the IRB/oversight body are included in the manuscript.YesAll necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe data that support the findings of this study are available from BIMCV but restrictions apply to the availability of these data under a research use agreement. Data access can be requested at http://bimcv.cipf.es/. Supplementary information and code are available online at https://github.com/BIMCV-CSUSP/DiSMed.