PT - JOURNAL ARTICLE AU - Cameron J Fairfield AU - William A Cambridge AU - Lydia Cullen AU - Thomas M Drake AU - Stephen R Knight AU - Neil Masson AU - Nicholas L Mills AU - Riinu Pius AU - Catherine A Shaw AU - Honghan Wu AU - Stephen J Wigmore AU - Athina Spiliopoulou AU - Ewen M Harrison TI - ToKSA - Tokenized Key Sentence Annotation - a Novel Method for Rapid Approximation of Ground Truth for Natural Language Processing AID - 10.1101/2021.10.06.21264629 DP - 2021 Jan 01 TA - medRxiv PG - 2021.10.06.21264629 4099 - http://medrxiv.org/content/early/2021/10/07/2021.10.06.21264629.short 4100 - http://medrxiv.org/content/early/2021/10/07/2021.10.06.21264629.full AB - Objective Identifying phenotypes and pathology from free text is an essential task for clinical work and research. Natural language processing (NLP) is a key tool for processing free text at scale. Developing and validating NLP models requires labelled data. Labels are generated through time-consuming and repetitive manual annotation and are hard to obtain for sensitive clinical data. The objective of this paper is to describe a novel approach for annotating radiology reports.Materials and Methods We implemented tokenized key sentence-specific annotation (ToKSA) for annotating clinical data. We demonstrate ToKSA using 180,050 abdominal ultrasound reports with labels generated for symptom status, gallstone status and cholecystectomy status. Firstly, individual sentences are grouped together into a term-frequency matrix. Annotation of key (i.e. the most frequently occurring) sentences is then used to generate labels for multiple reports simultaneously. We compared ToKSA-derived labels to those generated by annotating full reports. We used ToKSA-derived labels to train a document classifier using convolutional neural networks. We compared performance of the classifier to a separate classifier trained on labels based on the full reports.Results By annotating only 2,000 frequent sentences, we were able to generate labels for symptom status for 70,000 reports (accuracy 98.4%), gallstone status for 85,177 reports (accuracy 99.2%) and cholecystectomy status for 85,177 reports (accuracy 100%). The accuracy of the document classifier trained on ToKSA labels was similar (0.1-1.1% more accurate) to the document classifier trained on full report labels.Conclusion ToKSA offers an accurate and efficient method for annotating free text clinical data.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was funded by a Medical Research Council Clinical Research Training Fellowship awarded to CJF (MR/T008008/1).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Ethical approval was granted by Lothian NHS Board South East Scotland Research Ethics Committee 01 (REC reference number 21/SS/0003). All data were deidentified and the need for individual consent was waived.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAs data comprises confidential free text records from electronic healthcare records data cannot be made publicly available. A script used to analyse the data has been made available through a URL in the manuscript. https://github.com/SurgicalInformatics/ToKSA