Predicting cause of death from free-text health summaries: development of an interpretable machine learning tool

Purpose: Accurately assigning cause of death is vital to understanding health outcomes in the population and improving health care provision. Cancer-specific cause of death is a key outcome in clinical trials, but assignment of cause of death from death certification is prone to misattribution, therefore can have an impact on cancer-specific trial mortality outcome measures. Methods: We developed an interpretable machine learning classifier to predict prostate cancer death from free-text summaries of medical history for prostate cancer patients (CAP). We developed visualisations to highlight the predictive elements of the free-text summaries. These were used by the project analysts to gain an insight of how the predictions were made. Results: Compared to independent human expert assignment, the classifier showed >90% accuracy in predicting prostate cancer death in test subset of the CAP dataset. Informal feedback suggested that these visualisations would require adaptation to be useful to clinical experts when assessing the appropriateness of these ML predictions in a clinical setting. Notably, key features used by the classifier to predict prostate cancer death and emphasised in the visualisations, were considered to be clinically important signs of progressing prostate cancer based on prior knowledge of the dataset. Conclusion: The results suggest that our interpretability approach improve analyst confidence in the tool, and reveal how the approach could be developed to produce a decision-support tool that would be useful to health care reviewers. As such, we have published the code on GitHub to allow others to apply our methodology to their data (https://zenodo.org/badge/latestdoi/294910364).


Introduction 69
Free-text electronic health records (hereafter, health records) contain information about a patient's 70 medical history. In health care, it is common for human experts to review health records to inform 71 decision making and clinical practice. This review process can be time consuming and prone to error 72 and there is considerable potential for algorithmic methods to support human decision-making in this 73 context (1). One example of this process is retrospective auditing of health care practice, which often 74 requires trained experts to manually review and assign outcomes or labels to the cases reviewed. The 75 high cost in person-hours can make such studies prohibitively expensive, including clinical trials, 76 which are essential for improving the delivery of patient care. 77 As one of the most common cancers diagnosed in the UK, it is not surprising that prostate cancer is 78 one of the cancers on which text mining of health records for research has focused (2). Much of this 79 work has concentrated on predicting cancer detection (3), disease progression (4) and optimizing 80 treatment (5). Knowledge of underlying cause of death is another key health outcome and machine 81 learning techniques have made it possible to extract cancer mortality directly from Medical 82 Certificates of Cause of Death (MCCD)(6). But these methods have not taken into account the 83 inherent misattribution that exists within the death certificate. For example, prostate cancer deaths 84 can be misattributed to other causes which result in underestimates of prostate cancer as a cause of 85 death. This is the case when deaths are attributed to the complications from investigations or 86 treatment of prostate cancer, rather than the disease itself (e.g. infection from biopsy, post-surgery 87 complications). Deaths from other causes can also be attributed to prostate cancer which would cause 88 an over-estimate of prostate cancer as cause of death (7). This was demonstrated in the Cluster 89 randomised trial of PSA testing for prostate cancer (CAP) where an independent committee assigned 90 cause of death, finding that death certification produced false positive prostate cancer deaths 9% of 91 the time. This increased to 23% if the individual had another cancer (not prostate cancer) diagnosed 92 during their lifetime (8). It led to a recommendation that assignment of prostate cancer death, 93 especially as an outcome in trials research, should be confirmed by an independent expert committee 94 (8). In the CAP trial, semi-structured free-text summaries of a patient's medical history from 95 hospital records are created by trained fieldworkers, which are then reviewed by an independent 96 committee of experts and assigned as either prostate cancer related death or not. The CAP dataset 97 thus provides a binary classification taskthe identification of prostate cancer deathfor which 98 machine learning algorithms can be trained on the annotation of human experts. 99 There have been significant advances in the field of text mining in recent years, with general purpose 100 deep neural network models such as BERT (9) achieving state-of-the art performance on multiple 101 natural language processing (NLP) tasks, such as question answering and named entity recognition. 102 Applying such NLP methodologies to clinical text data presents various challenges (10) and perhaps 103 the most significant is the shift in word distributions as compared to the standard corpora on which 104 models are trained. This challenge can be overcome by training on general biomedical corpora (11) 105 and/or by developing task-specific models. For example, the state-of-the-art in the detection of 106 medical concepts, such as ICD-10 codes, is to use task-specific recurrent neural network architectures 107 (12,13) which are pre-trained on a medical corpus such as MIMIC (14). However, on specific tasks, 108 classical NLP approaches are still able to compete with deep learning methods (15)(16)(17). 109 This work addresses the task of document classification. An extensive review of this topic, in 110 relation to clinical text data, is provided by Mujtaba et al (10). Specifically, we aim to train a binary 111 classifier on human expert annotations in order to identify patients that died from prostate cancer 112 using the CAP health records described above. Although optimal performance in such tasks is likely 113 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 16, 2021. users of an ML algorithm must be able to engage with and understand its predictions. Crucially, the 119 user must have the confidence to either accept or reject the predictions based on their clinical 120 expertise and a clear understanding of how the predictions were made. 121 In this study we developed graphical methods to explain which textual elements contribute to the 122 classification of a health records from CAP by a machine learning algorithm. We also used methods 123 from the FAT Forensics toolbox (22) and the TreeInterpreter Python package (23) to quantify feature 124 contributions. These contributions were then displayed to the user in an interpretable and visually 125 engaging format. The classifiers and interpretability methods were developed for the CAP dataset, 126 using expert committee assignment of prostate cancer-related death to train the classifiers. However, 127 the intention is that these methods will be developed into a decision-support tool that would help 128 human experts in the time-consuming task of classification of health records, both for this specific 129 task and for similar tasks with different data sources. 130

Materials and Methods 131
The machine learning classifier was developed using the CAP dataset, following which we 132 investigated a variety of visualization techniques to aid interpretation of the ML predictions. The 133 CAP dataset is outlined below (2.1). All code was written in Python and has been released publicly at 134 https://zenodo.org/badge/latestdoi/294910364 to facilitate re-use by other groups working in this 135 area. 136

CAP medical history summaries 137
The CAP trial is a cluster-randomised control trial (RCT) that aims to investigate the effectiveness of 138 screening for prostate cancer in the UK population, which has been running since 2001. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 16, 2021. Oncology and urology notes, inpatient notes, outpatient notes, radiology notes, multidisciplinary team meeting notes, lab reports, operative reports 154

Machine learning pipeline 155
We used a bag-of-words representation of the health records which was produced as follows. The 156 semi-structured text fields were concatenated into a single string and then lemmatized using the 157 WordNetLemmatizer from Python's Natural Language Toolkit (NLTK). The full details of the 158 lemmatization are provided in Algorithm A1 in Online Resource 1. We then extracted features from 159 the lemmatized text using CountVectorizer and TfidfTransformer from Scikit-learn. The 160 hyperparameters of CountVectorizer were optimized using GridSearchCV along with those of the 161 classifier algorithm. The classifier was trained on 80% of the dataset with 20% held out for testing. 162 Following optimisation of the hyperparameters with 5-fold cross validation, the best classifier was 163 refitted to the full training data. Full details of the feature extraction and model training are provided 164 in Algorithm A2 in Online Resource 1. During model development we tested three classifier 165 algorithms from scikit-learn. However, the code can easily be adapted to use any off-the-shelf or 166 bespoke classifier. We evaluated classifier performance using a suite of metrics, including accuracy 167 and AUC (area under the curve). 168

Interpretability methods 169
On top of the classification pipeline, we developed interpretable outputs that allow users to engage 170 with the classifier's prediction and which are intended to act as a proof-of-concept for a future 171 decision support tool based on this work. To achieve interpretability, we focused on the 172 communication of feature importance. We intended to convey an intuition for how the classifier 173 works and to allow users to consider whether they agree with individual predictions based on how 174 that prediction was made. In other words, we introduced the key elements of accountability (20) and 175 trust (18) to the classification system. We used word clouds to visually represent the relative 176 importance of the features (words and bigrams) by scaling their size. The colours of the words were 177 used to display the sign of the contribution of the feature towards the class prediction (where this 178 information was available). We extended this visual representation by producing augmented versions 179 of the original free-text summaries that displayed feature contributions in the context of the original 180 language used. 181 To quantify feature importance, we used four different approaches. Two of these methods are 182 specific to the tree-based classifiers, while two are generally applicable and can be used with any 183 machine learning classifier. The four methods are as follows: 184 1. Gini Importance -Is calculated as the normalised total reduction in the Gini impurity 185 brought about by splits on a given feature across the ensemble of trees (24). This metric is 186 part of the scikit-learn implementation of the Random Forest classifier and only provides an 187 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 16, 2021. theoretic methods that also provides a local measure of feature importance. 198

Results 199
A random forest classifier proved to be the best performing algorithm for the task of predicting achieved comparable levels of performance (see Table 2 and Figure 1). We attempted recalibration of 204 the random forest classifier output using both isotonic and sigmoid (27) methods (see Figure S1 in 205 Online Resource 1) but both methods made only minor improvements to the probabilities output. As 206 such, the main results presented here are for the uncalibrated classifiers. We hypothesized that the 207 cases represented in the CAP dataset exhibited different degrees of classification difficulty, 208 corresponding to the method with which the cases had been originally labelled with a cause of death. 209 These different methods are collectively referred to as the "cause of death route" (COD) and are 210 explained in Algorithm A3 in Online Resource 1. Using the COD route, we stratified the dataset into 211 hard and easy cases. We found that the classifier performance was worse for the hard cases than the 212 easy cases ( Figure S2 in Online Resource 1) and that the probabilities were better calibrated for the 213 easy cases ( Figure S3 in Online Resource 1). For hard cases, the classifiers tended to underestimate 214 the probability of prostate cancer death towards the lower end of the range. This implies the existence 215 of patients who actually died of prostate cancer, but which look to the classifier like they did not. 216 These results suggest that the stratification of cases based on COD route is meaningful and may have 217 significant implication for how a decision support tool could be used and evaluated in the future. 218 However, it should be noted that these groups are imbalanced with 2340 and 270 easy and hard cases 219 appearing in the dataset, respectively. We investigated the effect of authorship and found evidence of 220 clustering based on language style ( Figure S4 in Online Resource 1) but this did not significantly 221 affect performance (see section SM1.1 in Online Resource 1). 222 The random forest was chosen to explore interpretability based on its performance. This choice also 226 allowed us to explore feature importance metrics that are only applicable to tree-based algorithms. 227 We compared feature rankings obtained using the four metrics for feature importance using 228 Spearman's rank correlation coefficient (see table ST2 in Online Resource 1). The rankings 229 according to SHAP, LIME and Gini importance were all moderately to strongly correlated with the 230 TreeInterpreter rankings (ρ = 0.54 -0.65). Here we present interpretability results using the 231 TreeInterpreter metric for feature importance, but the equivalent outputs can easily be produced using 232 the alternative metrics (see for example, Figures S5 and S6 in Online Resource 1). 233 We produced word clouds to illustrate the most important features that contributed to the classifier 234 predictions. These word clouds were shared with members of the CAP study team, who confirmed 235 that the classifier was using clinically meaningful information to make predictions. Features with 236 large contributions (such as "bone scan", "spine", "hormone", "androgen") tend to be associated with 237 advanced stage prostate cancer. These features can contribute positively or negatively to the 238 classification depending on the frequency of occurrence of the term in the health record. For 239 example, both "hormone" and "bone scan" contribute positively to the classification of prostate 240 cancer death when present in individual cases (see Figure 2(B)) but when averaged across the dataset 241 they are indicative of non-prostate cancer death (see Figure 2(A)). 242 Therefore, we felt it was necessary to see these feature contributions in the context of the original 243 text, to determine if the classifier is correctly identifying textual elements that indicate prostate 244 cancer death. For this reason, we sought a format that would allow users to engage with the classifier 245 output and could potentially be used for decision support. The solution, arrived through dialogue with 246 members of the CAP project, was to produce augmented versions of the original health records which 247 we refer to as interpretable vignettes. A partial view of one of the interpretable vignettes is shown in 248 figure 3 and full examples are provided in Online Resources 2 and 3. The text uses the same 249 formatting as the word clouds to show feature contributions. Here the reader can see the context in 250 which the features appear and can therefore use their judgement to determine if the classifier is using 251 the textual element in a meaningful way towards its prediction (28). A legend is provided so that 252 readers can interpret the relative feature contribution sizes and summary information is provided in 253 the header of the vignette. 254

Discussion 255
For a machine learning tool to be useful, especially in the medical domain, it is essential for the user 256 to be able to interpret its output. As such, it is common for studies about clinical decision support or 257 prediction modelling to include a discussion of interpretability (29-31). However, as Lipton points 258 out, "interpretability is not a monolithic concept" (32)it includes distinct yet intersecting ideas such 259 as comprehension, transparency and trust. It is also subjective, in that what one user may find 260 interpretable another user may not. Most importantly, what the data scientist may consider to be 261 interpretable is not necessarily of use to a clinical decision maker. In this study we used four standard 262 interpretability metrics to produce measures of feature importance from a machine learning classifier 263 that was trained to identify prostate cancer deaths from medical summaries. We then developed 264 visual representations of these feature importances and presented them to one of the trained CAP 265 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 16, 2021. ; https://doi.org/10.1101/2021.07.15.21260082 doi: medRxiv preprint Free-text cause of death predictions 8 This is a provisional file, not the final typeset article reviewers, who would be the intended users of a future decision support system based on this work. 266 Their feedback is summarised below and used to highlight the strengths and weaknesses of our 267 approach, and to identify directions for future work. 268 The word clouds of feature contributions were intended to provide a high-level overview of what the 269 classifier learned from the data. The reviewer showed a preference for an alternative presentation of 270 this information, stating that "they may be a reasonable way to display the model weights, but I 271 probably would better understand a sorted list in tabular form with associated weight magnitudes". 272 The augmented vignettes (Online Resources 2 and 3) proved to be more useful. These allowed the 273 reviewer to engage with individual predictions, by highlighting features in the original text, using 274 font colour and size to indicate the direction and magnitude of the contribution to the prediction. 275 Similar visualisations have been used in other ML studies to provide interpretability (28,33-35). 276 Being able to see the feature contributions in the semantic context of the original text enabled the 277 reviewer to determine where the classifier was correctly or incorrectly using features. In general, the 278 reviewer felt that the highlighted feature contributions were consistent with their clinical assessment: 279 "Many of the colors make sense here (in Online Resource 2). Metastases would be consistent 280 with prostate cancer death…Meanwhile, mentions of "lungs", "ascites", "stomach", and 281 "thorax" in the vignette suggest the patient has some non-prostate-cancer condition that is 282 worthy of attention-those are appropriately yellow." 283 284 The features appearing in Online Resources 2 and 3 which were associated with prostate cancer 285 death, aligned with the indicators of advancing disease (e.g. bony metastases, hormone treatment) 286 that are used as clinical outcomes in prostate cancer trials (36). However, the feedback made it clear 287 that the visualisations were less interpretable than intended: 288 "I am confused, again, that 'bone scan' and 'hormone' are blue here (in Online Resource 3) 289 but were each yellow elsewhere (bone scan was yellow for Online Resource 2 and in Figure  290 2A; hormone was yellow in Figure 2A)." 291 Here the reviewer is referring to the ability of a feature (e.g. 'bone scan') to contribute positively to a 292 classification of prostate cancer death when present in the text, but to contribute negatively to the 293 classification when it is absent. This is an example of the potential for conflict, referred to by Lipton 294 (32), between what is a transparent and faithful representation of the mechanism of a classifier and 295 what is easily understandable by a human user. In this case, the problem might be overcome either by 296 including some indication of the actual feature value or by providing some training to the user to 297 resolve the apparent inconsistency. 298 299 The augmented vignettes let the user see if elements of text are missed or used incorrectly by the 300 classifier, in which case they can exercise caution when considering the prediction. In this way the 301 reviewer determined: 302 "that some of the descriptions I pay most attention to (the rising PSA values and, to a lesser 303 extent, the high Gleason score) are gray-presumably because the algorithm is ignoring 304 them." 305 Both PSA and Gleason score are numerical values which are often included in these medical 306 summaries, but which are not captured by our current feature representation. This feedback suggests 307 that an avenue for improved performance would be to incorporate prior clinical knowledge such as 308 the importance of these two scores. Interestingly, it may also improve trust in the system if users 309 could see that the classifier was making use of the elements of the medical summaries that they 310 consider to be most important. The interpretable vignettes also revealed that classification of prostate 311 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 16, 2021. ; https://doi.org/10.1101/2021.07.15.21260082 doi: medRxiv preprint cancer death was problematic when negation appeared in the text. Our bag-of-words feature 312 representation would not be expected to handle negation, so the application of methods to detect 313 negation in clinical text data (37,38) would likely boost performance. Off-the-shelf classifiers 314 achieved good performance on the CAP dataset. For different health record datasets, additional effort 315 may be required to achieve sufficient performance for a decision support tool to be useful. The 316 clustering of the health records based on authorship suggested that methods such as multiple-source 317 cross-validation (39) or domain adaptation (40) could be beneficial in dealing with differences in 318 writing styles within other datasets. Other methods to boost performance would likely be task-or 319 domain-specific and could include the addition of numeric clinical features extracted from structured 320 data (41), or the use of state-of-the-art deep learning methods (12,37). Such methods would be 321 compatible with our model-agnostic approach to interpretability. 322 The CAP dataset contains a proxy for 'difficulty' of the cause of death assignment. Although we 323 were not able to train a model to reliability predict hard cases, our cause-of-death-classifier did show 324 worse performance and calibration on the hard cases than on the easy cases. This suggests that the 325 stratification of cases according to difficulty is meaningful and is likely to have implications for the 326 future development and evaluation of a decision support tool. A systematic investigation of what 327 makes the hard cases more difficult to classify, and which features are most predictive for different 328 types of cases, will help to inform more targeted data acquisition from hospital records. Named-entity 329 recognition approaches could also be adapted to assist with this information retrieval (12,15). Such 330 knowledge could produce significant cost savings in data collection for CAP and similar projects. In 331 practice, the predictions for hard cases are less trustworthy and one way to address this would be to 332 produce reliable estimates of uncertainty (42). In an applied setting it would be important for the 333 transparency of the system to communicate to users the relative risks of both false positive and false 334 negatives. 335 The feedback of the CAP reviewer has given us confidence in the feasibility of these methods, and 336 the next stage is to develop them into a usable decision support tool, following a user-centric design 337 process with members of the intended user group (43). Key to this will be to adapt the visualisations 338 to be appropriate for users in a clinical setting. We will need to test our classifiers on new CAP 339 reviews to determine how well they generalise to unseen data. Our bag-of-words approach is limited 340 by the size of the training data. There are 1360 words in the test data set that do not appear in our 341 training data ( Figure S7 in Online Resource 1), and the CAP dataset has only limited overlap ( Figure  342 S8 in Online Resource 1) with an example biomedical corpus (44). To optimise classifier 343 performance in the future will likely require an adapted pre-trained deep learning model (11). We 344 have identified benchmarking datasets (45) that would allow comparison of different classification 345 approaches to ensure that the best model can be selected. It is clear from our results that the different 346 approaches to quantifying feature importance produce distinct feature rankings. Choosing the best 347 approaches to ensure user trust and system transparency will be achieved using A/B testing across a 348 range of users. The continued use of model-agnostic explainability methods will allow abstraction of 349 the decision support interface from the underlying classifier and would allow the tool usable across a 350 range of different tasks and datasets. For example, we plan to test our approach to interpretable 351 document classification on an intensive care dataset (46) that contains free-text medical notes that are 352 routinely used by hospital staff to audit clinical practice. 353

Conclusion 354
Algorithmic classification of health records, such as the identification of prostate cancer death in the 355 CAP dataset, could reduce the need for complex medical summaries to be reviewed by an 356 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 16, 2021. ; https://doi.org/10.1101/2021.07.15.21260082 doi: medRxiv preprint independent committee. We have demonstrated use of visual methods to explain classifier 357 predictions to human users, which could be deployed in a future decision support tool to reduce the 358 cognitive burden on individual reviewers. Knowledge of the predictive features could also be used to 359 target data extraction from hospitals, reducing the workload and cost required in creating the free-text 360 summaries. We encourage researchers to take a user-centric approach when developing interpretable 361 machine learning tools, to ensure maximum trust and usability in the system. 362 363 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 16, 2021. ; https://doi.org/10.1101/2021.07.15.21260082 doi: medRxiv preprint  Figure 1) Figure 3) . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

. The size of the word or bigram indicates its relative importance. Blue words or bigrams are indicative of prostate cancer death, while orange is indicative of not prostate cancer death. Feature contributions determined using TreeInterpreter (see main text). (A) Average feature contributions over the CAP test set; (B) a single case that was correctly predicted prostate cancer death by the classifier (shown in
The copyright holder for this preprint this version posted July 16, 2021. ; https://doi.org/10.1101/2021.07.15.21260082 doi: medRxiv preprint

Figure 3: Snapshot of an 'interpretable vignette' that allows users to engage with the prediction that is made by the classifier. This case was correctly predicted to be a prostate cancer death by the classifier (cause of death code = 2). As in figure 2 the word (or bigram) size indicates the magnitude of the contribution of that feature to the prediction and the colour indicates the sign of the contribution. Here the original format of the vignettes is retained, which is the format in which the decision makers would normally engage with the document. Full interpretable vignette examples are provided in Online Resources 2 and 3
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 16, 2021. ; https://doi.org/10.1101/2021.07.15.21260082 doi: medRxiv preprint