RT Journal Article SR Electronic T1 A clinical specific BERT developed with huge size of Japanese clinical narrative JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2020.07.07.20148585 DO 10.1101/2020.07.07.20148585 A1 Yoshimasa Kawazoe A1 Daisaku Shibata A1 Emiko Shinohara A1 Eiji Aramaki A1 Kazuhiko Ohe YR 2020 UL http://medrxiv.org/content/early/2020/07/09/2020.07.07.20148585.abstract AB Generalized language models that pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate a development of a clinical specific BERT model with a huge size of Japanese clinical narrative and evaluated it on the NTCIR-13 MedWeb that has pseudo-Twitter messages about medical concerns with eight labels. Approximately 120 millions of clinical text stored at the University of Tokyo Hospital were used as dataset. The BERT-base was pre-trained with the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT tends to show higher performances on the MedWeb task than the other nonspecific BERTs, however, no significant differences were found. The advantage of training on domain-specific texts may become apparent in the more complex tasks on actual clinical text, and such corpus for the evaluation is required to be developed.Competing Interest StatementY.K and E.S belong to the 'Artificial Intelligence in Healthcare, Graduate School of Medicine, The University of Tokyo' which is an endowment department, supported with an unrestricted grant from 'I&H Co., Ltd.' and 'EM SYSTEMS company', but these sponsors had no control over the interpretation, writing, or publication of this work.Funding StatementThis project was partly funded by the Japan Science and Technology Agency, Promoting Individual Research to Nurture the Seeds of Future Innovation and Organizing Unique, Innovative Network (JPMJPR1654). There were no other funders. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:All data collection followed a protocol approved by the Institutional Review Board at the University of Tokyo Hospital (2019276NI).All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.Yes1. Clinical text used for pre-training the UTH-BERT cannot be shared because of not publicly available, and restrictions apply to their use. 2. The UTH-BERT model is available for non-commercial use at the following URL: https://ai-health.m.u-tokyo.ac.jp/uth-bert 3. NTCIR-13 dataset is available upon request at the following URL: https://www.nii.ac.jp/dsc/idr/ntcir/ntcir.html https://ai-health.m.u-tokyo.ac.jp/uth-bert https://www.nii.ac.jp/dsc/idr/ntcir/ntcir.html