Abstract
Deep neural networks (DNNs) have achieved physician-level accuracy on many imaging-based medical diagnostic tasks, for example classification of retinal images in ophthalmology. However, their decision mechanisms are often considered impenetrable leading to a lack of trust by clinicians and patients. To alleviate this issue, a range of explanation methods have been proposed to expose the inner workings of DNNs leading to their decisions. For imaging-based tasks, this is often achieved via saliency maps. The quality of these maps are typically evaluated via perturbation analysis without experts involved. To facilitate the adoption and success of such automated systems, however, it is crucial to validate saliency maps against clinicians. In this study, we used two different network architectures and developed ensembles of DNNs to detect diabetic retinopathy and neovascular age-related macular degeneration from retinal fundus images and optical coherence tomography scans, respectively. We used a variety of explanation methods and obtained a comprehensive set of saliency maps for explaining the ensemble-based diagnostic decisions. Then, we systematically validated saliency maps against clinicians through two main analyses — a direct comparison of saliency maps with the expert annotations of disease-specific pathologies and perturbation analyses using also expert annotations as saliency maps. We found the choice of DNN architecture and explanation method to significantly influence the quality of saliency maps. Guided Backprop showed consistently good performance across disease scenarios and DNN architectures, suggesting that it provides a suitable starting point for explaining the decisions of DNNs on retinal images.
Introduction
Deep neural networks (DNNs) have become increasingly popular in medical image analysis [40, 20, 75, 61, 21]. Trained on various diagnostic tasks in imaging-based specialties of medicine, they have been shown to achieve physician-level accuracy [28, 19, 15, 30, 7, 79]. However, DNNs are often referred to as black boxes since their decision mechanisms are not transparent enough for clinicians to interpret and trust them [11, 42, 61]. From a practical and ethical point of view, this is one of the major roadblocks in translating cutting-edge machine learning research into meaningful clinical tools [22, 26, 27]. To tackle this challenge, a number of explanation methods have been proposed to expose the inner workings of a DNN underlying its decisions. In the case of image analysis, this is frequently done via saliency maps, where input pixels are associated with saliency scores according to their contribution to network outputs [4, 49]. The efficacy of a saliency map is typically evaluated via perturbation or sensitivity analysis [9, 64, 4, 35, 49, 51], without involving a human in the process. For medical imaging, we thus lack an understanding of how good different explanation methods are in providing saliency maps with clinical relevance.
To fill this gap, we systematically evaluated saliency maps for the decisions of DNNs trained to detect two prevalent eye diseases, diabetic retinopathy (DR) and neovascular age-related macular degeneration (nAMD), with respect to the expert opinions of clinical ophthalmologists. First, we compared saliency maps with disease-specific annotations of pathologies. Second, we performed perturbation analyses and compared the outcome to that obtained when using expert annotations as saliency maps. This allowed us to use perturbation analysis also as a tool to validate DNN explanations against clinicians.
We also introduced two technical novelties. First, we developed a post-processing method for saliency maps to improve the visualization of salient regions and standardize saliency maps for benchmarks. In addition, we computed saliency for Deep Ensembles [38, 23] to obtain saliency maps which were more informed than those obtained from individual networks in isolation.
Methods
DNNs are often trained for diagnostic classification of medical images. To introduce notation, we first review the basics of DNN-based image classification. Then, we describe our datasets and disease detection tasks as well as our methodology including model development and evaluation. Also, we discuss attribution methods for generating saliency maps and introduce our post-processing method within this classification framework.
In medical image analysis, a DNN achieves a diagnostic classification by learning a function that map inputs to outputs: y = fθ(x), where y is a class label (e.g. disease severity or presence/absence of disease) assigned by experts to an input image x and θ represents the DNN’s weights, which are tuned w.r.t. an objective on a finite dataset . The objective is usually to minimize the cross-entropy between labels and predictions, which can be expressed in the following form: , where is a hard label in multinomial (1-hot) representation, pn is a list of predicted class probabilities and k is an index into K classes. A DNN estimates the class probabilities typically via a softmax function in its final layer: , where ωk ⊂ θ represents the weights and bias for the k-th class in the softmax layer, is the feature representation by the network’s penultimate layer, and outputs are multinomial distributions: ∑ k pn,k = 1.
Diseases and Datasets
DR and nAMD are two prevalent and progressive eye diseases [13, 3, 80], both of which can be automatically graded using state-of-the-art DNNs [28, 15, 65, 37, 79].
In the case of DR, we used multiple publicly available collections of fundus images (Fig. 1a): Kaggle DR [33], Asia Pacific Tele-Ophthalmology Society (APTOS) DR [34], Messidor 2 [1, 16, 36], and Indian Diabetic Retinopathy Image Dataset (IDRiD) [59]. These images are graded by medical experts according to the International Clinical Diabetic Retinopathy Severity Scale (Table 1). In addition to the image-level DR grades, 81 of the IDRiD images are annotated at the pixel level with regards to pathologies associated with DR, i.e., microaneurysms, soft exudates, hard exudates and hemorrhages as well as the optic disc [59] (Fig. 1b).
In the case ofnAMD, we used 70 3D optical coherence to-mography (OCT) volume scans from patients at the University Eye Hospital Tübingen collected with Heidelberg Spectralis OCT (Heidelberg Engineering, Heidelberg, Germany). Depending on the settings used during clinical examinations, each volume consisted of 19,25,37,49 or 73 2D slices, namely B-scans (Fig. 1c). In total, there were 3762 B-scans (1751 left eye, 2011 right eye) and each was graded by a retina specialist according to the presence or absence of active nAMD (Table 2), which is characterized by intraretinal or subretinal fluid (Fig. 1d). Furthermore, we selected 73 B-scans from the validation (19) or test (54) sets to be annotated by a board-certified ophthalmologist at the pixel level w.r.t. nAMD activity. We excluded two of the annotated B-scans from our analyses due to the mismatch between their image-level grading and pixel-level annotations carried out by our clinicians (WI and LK, respectively). The use of this data set was permitted by the Institutional Ethics Committee of the University of Tübingen and was performed in line with all relevant laws and regulations.
Diagnostic Tasks, Network Architectures and Model Development
In the DR case, the task for the network was to detect DR from fundus images. Considering Mild DR as the disease onset (Table 1), we grouped the fundus images into healthy and diseased, according to the DR stages: {0} vs. {1,2,3,4}, respectively. For the nAMD case, the task was to recognize the nAMD activity from individual B-scans of the retina (Table 2). For both tasks, we used two well-established DNN architectures, ResNet50 [31] and InceptionV3 [74]. We obtained their implementations from Keras [14], pretrained on ImageNet [63]. We modified and fine-tuned them to our tasks. Also, we used a 2-way softmax encoding for the classification outcome for the sake of compatibility with the saliency methods (see Explanations in the Visual Domain).
DR detection networks
We modified the pretrained ResNet50 and InceptionV3 networks by replacing their dense layers with single layers containing 512 fully connected units followed by Batch Normalization [32] and ReLU [52] as well as a softmax layer with two outputs. We applied L2 and L1 regularizers to convolutional and penultimate layers, respectively. Also, we modified the objective functions to handle the class imbalance in the datasets (Table 1): , where nk is the number of images from class k in a minibatch. Using Stochastic Gradient Descent (SGD) with Nesterov’s Accelerated Gradients (NAG) [53, 73] and a momentum coefficient of 0.9, we trained the networks for 150 epochs on random partitions of all labeled images from Kaggle DR and APTOS DR combined (92,364 images, Table 1). More specifically, we performed 5-fold cross-validation within these images and used 80% of them for training. For each cross-validation run, we followed a stepwise learning rate schedule with rates 0.005, 0.001, 0.0005, 0.0001 after epochs 0, 25, 50, 85 respectively, on top of a decay rate of 0.00001. Also, during the first 10 epochs, only the dense layers were updated and convolutional layers were frozen. For the remaining epochs, all layers were fine-tuned to the task. The model performance was validated after each epoch on the remaining 20% of the images and the best configuration was saved for inference. In this scheme, each DNN instance was evaluated on a disjoint internal validation set. In order to get a better picture of our DNNs’ generalization performance, we finally evaluated them on an external validation set that comprised of both Messidor 2 and IDRiD images (Table 1).
nAMD activity detection networks
We modified the pretrained ResNet50 and InceptionV3 networks by concatenating max pooling to average pooling, adding two dense layers with 1024 and 512 units, which were also followed by Batch Normalization [32] and ReLU activation [52], and using a 2-way softmax classifier. All weight layers except the penultimate one were equipped with L2 regularization. We used L1 regularization to promote sparsity in the penultimate layer. Ultimately, both ResNet50 and InceptionV3 networks achieved classification based on 512 features obtained from their penultimate layers. In this case, we countered the class imbalance (Table 2) with random oversampling. Using SGD with NAG [53, 73], a momentum coefficient of 0.9, initial learning rate of 0.001, a decay rate of 0.0001 and a regularization constant of 0.00001, we trained networks for 100 epochs. During the first 10 epochs, the convolutional stacks were frozen and only the dense layers were trained. For the remaining epochs, all layers were fine-tuned to the task. The best models based on validation accuracy were saved after each epoch and used for inference on the test set.
Data augmentation and image preprocessing
Fundus images were first cropped to center such that the fundus circle touched the image borders. Namely, the longer axis of image height or width was cropped on both sides equally to the same length as the shorter axis. Then, images were resized to 512 × 512. During training, data augmentation was applied to the images. The augmentation pipeline included vertical and horizontal flips, rotation by ±180 degrees (pixels that have no image information due to rotation were set to black pixels), horizontal and vertical translation by ±20 pixels, brightness range ±30% and zoom range −20% -0%. After the first preprocessing and data augmentation, the specific preprocessing functions of ResNet50 or InceptionV3 from the Keras API [14] were applied.
B-scans contained 440 × 512 pixels (Fig. 1c). We performed data augmentation before feeding images to networks during training. The augmentation pipeline included random rotation within ±45 degrees, horizontal and vertical translations within ±30 pixels, brightness adjustments with ±10%, zoom with ±10%, and horizontal and vertical flips. Once images went through the pipeline, theywere locally color-normalized for contrast enhancement with background subtraction via a median filter of size 31. Then, appropriate preprocessing functions from the Keras API [14] were applied.
Overconfidence and calibration of predictive probabilities via Deep Ensembles
DNNs are overconfident about their predictions [29, 76, 44]. To obtain well-calibrated predictions and improve the performance of our networks, we used Deep Ensembles [38]. A Deep Ensemble simply consists of multiple DNNs, each of which is randomly initialized, follows a different optimization trajectory and explores a different mode in function space [38, 23]. Thus, the ensemble, even a small one with 3-5 DNNs, samples diverse and accurate predictors from a function space, exploits their diversity in decision-making and ultimately improves upon the single network performance both in accuracy and calibration [38, 23, 57], also in a DR detection scenario [8]. Using the network architectures, hyperparameters and training procedures described above, we constructed ensembles of 5 DNNs for our diagnostic tasks (Table 3, Fig. 2). In the DR case, we used the DNNs trained during cross-validation. For the nAMD task, we trained 5 DNNs per architecture. All DNNs were diversified by the randomness in the initialization of dense layers, shuffling of training examples as well as data augmentation.
Explanations in the Visual Domain
Saliency maps are frequently used to obtain explanations for a DNN’s decisions. We focused on saliency methods with implicit access to model structure and its internal state. These methods generate saliency maps via forward and backward passes [60, 4, 49, 51, 61]. They typically use backpropagation-based algorithms or relevance propagation rules. As a result, a DNN’s decision is unravelled by attributing its predictive values all the way back to the input domain [4, 49, 51, 61]. In this sense, an attribution is a mapping h from an RGB image x to its raw saliency map through a trained network. In order to compute saliency maps conveniently, we used the open-source library iNNvestigate [2]. We only considered common gradient or relevance-based methods, which included a variety of methods commonly used in ophthalmology and neuroimaging [62, 6, 5, 10, 65, 48, 78].
Gradient-based methods
A saliency map R for an image can be obtained by simply using backpropagation to compute the gradient of the predictive function w.r.t. inputs indexed by i, given a class of interest [68, 49].However, gradients are sensitive to pixel-based variation and yield scattered saliency maps [49, 55]. To reduces this sensitivity, Simple Taylor decomposition [9], which is also called input × gradient, emphasizes an input only if it is present and the network responds to it [49]: . Also, Deep Taylor decomposition (DTD) [50] computes the relevance scores in a layer-wise fashion: , where j is an index into connections, a represents activations and â is a root point used in decomposition. Here, the i-th neuron in a given layer receives relevance scores from its connections to the next layer w.r.t. derivative evaluations at DTD also ensures the positivity of relevance scores at each layer through local decompositions and constraints [50].
SmoothGrad [70] reduces the pixel-sensitivity of gradients by sampling inputs with additive noise and averaging over multiple maps. Its goal is to generate more informed and focused maps. Similarly, Integrated Gradients [72] assumes a baseline (blank) image and follows a path between the baseline and input . The gradients are integrated along the path. In practice, this means an approximation with a number of steps (e.g., 20-300 [72]) between x and . These sampling-based methods induce high computational costs, when large samples are needed for accurate explanations. Despite the cost, we used 256 samples (or steps) for the sake of accuracy.
Apart from using the model structure as is, DeConvNet [82, 81] reverses the network components, e.g., pooling layers, filters and activations, and maps high-level features to inputs. In addition to deconvolution, Guided Backprop [71] resorts to a combination of both forward and backward ReLUs during backpropagation for sharper visualization [55, 62, 10]. However, it is restricted to ReLU networks.
Layer-wise Relevance Propagation (LRP)
LRP [9] also relies on backward propagation but its conservation principle sets it apart from gradient-based methods. Within the LRP framework, each neuron distributes to its predecessors exactly the sum of relevance scores it receives from its successors [50, 49, 51]. As a result, an unnormalized network output (, namely logit) reaches the input layer and disseminates into saliency scores. In this regard, LRP explains the actual predictive outputs, instead of their local variation. It supports both positive and negative relevance, corresponding to the excitation or inhibition characteristics of neurons, respectively [50, 49, 51].
A simple propagation rule is the z-rule (LRP-Z or LRP-0):, where ω ⊂ θ between two layers. LRP-ε introduces an additional hyperparameter ε to suppress the impact of weak or noisy contributions from successors [51]: . We defaulted to ε = 0.05. A general rule is the α β-rule [50, 49, 51]:, where α − β =1 β > 0. + and − denote the excitatory and inhibitory parts. The hyperparameters α and β set the balance between the positive and negative relevance and modulate the behaviour of saliency maps. Thanks to the conservation principle, more sophisticated rules can also be composed of simple ones. For instance, LRP-αβ can modulate the flow of relevance through the convolutional layers, while LRP-ε emphasizes the most salient scores through the dense layers [51]. We considered two such rules designated as LRP-PresetA and LRP-PresetB with α = 1, β = 0 and α = 2, β = 1, respectively [2]. These can also be coupled with a flat rule that assumes uniform weights, i.e., ω = 1, in the very first layer during the propagation of relevance. As a result, the sensitivity to the first layer convolutional filters is reduced and the effect of higher layers is emphasized.
Post-processing and ensembling of saliency maps
Saliency maps essentially highlight regions in images based on which DNNs make their decisions. Thus, we summarized the raw saliency maps (see Eq. 1) into 2D, by aggregating the saliency scores along channels. Then, we dispatched the positive and negative scores back into the red and blue channels, respectively, for visualization of excitatory or inhibitory features (Fig. 3). As the saliency scores exhibited stark differences due to the underlying assumptions and objectives of attribution methods, we mapped the absolute values of aggregate scores into [0, 1] within channels. However, a näive mapping via min-max normalization led to extremely sparse maps, even with ensembling (Fig. 3, last column) and various attribution methods (Fig. 4, top rows in (a) and (b)). We proposed a non-linear transformation to improve the visualization of salient regions. Our procedure is a drop-in replacement for the min-max normalization and its detailed description is given in Appendix A.1. Briefly, the crucial parameter of our method is fν ∈ [0, 1] and it allows us to grow salient regions for better visualization (Fig. 4).
In principle, our method could be used on saliency maps with both excitatory and inhibitory features. However, we focused on the excitatory ones since our evaluation was concerned with the efficacy of saliency maps for explaining the DNN decisions in presence of lesions and their annotations by clinicians. In addition, we leveraged the local sensitivity of gradient-based methods in order to enhance their visualizations of salient regions. Namely, we took the absolute values of raw saliency scores beforehand, which was a handy trick used for Guided Backprop in recent applications [62, 10]. Given the similarities between gradient-based saliency maps and those from LRP-Z and LRP-Epsilon (Fig. 4), we used the same trick for these simple LRP configurations, as well. As other LRP rules were already good at disentangling the excitatory and inhibitory regions, we excluded them from this treatment.
Evaluation of Saliency Maps
We assessed the correspondence between saliency maps and expert annotations via Dice loss [47]: , where R was a saliency map and S the expert annotation. Intuitively, D ∈ [0, 1] is a normalized distance between R and S. When a saliency map perfectly matches the expert annotation, D decreases to 0. Otherwise, it indicates the degree of mismatch. It is also robust to imbalance between the numbers of foreground and background pixels, which is typically severe due to the relative size of annotations in medical images [47]. However, our post-processing influences D. Thus, given a triplet of disease scenario, DNN architecture and attribution method (Fig. 6a,b,d, and e), we searched for the optimal fν among 20 values spaced evenly within [0.0005, 1] on a log scale with a geometric1 progression. Our criterion was based on the overall (dis)agreement between saliency maps and expert annotations. The optimal values can be found in Table 4 in Appendix A.2. We also show examples of optimally processed saliency maps in Fig. 8 and Fig. 9.
We also performed perturbation analyses [9, 64, 35] and compared the perturbation trajectory ofsaliency maps to those of clinicians in order to obtain an alternative perspective on the clinical relevance of saliency maps. Our perturbation scheme involved a two-dimensional grid specified over a given image and we regarded each cell as a patch to be perturbed (Fig. 5). Then, given a saliency map, we ranked the patches based on the total patch saliency, perturbed the top-ranked patches with uniform random noise and measured the drop in the ensemble output for the class of interest, diseased. We followed the ranking and repeated the measurement until there was no more patch to perturb. In addition, we used random maps to facilitate random perturbation as our baseline. As expected, a saliency-based ranking led to faster decline than a random selection of patches, since the saliency map indicated the informative regions in an image more accurately than chance. Analogously, we used the rate of drop as a performance metric for saliency maps. However, when the total perturbation grew and disease evidence was lost, all methods converged to random (Fig. 7a and Fig. 7b). After all, we treated also expert annotations as saliency maps within this perturbation-based framework. This allowed us to validate saliency maps against clinicians by monitoring DNNs’ sensitivity to the removal of salient information determined by explanation methods as well as the clinicians themselves.
For fundus images, which were accompanied with widespread annotations, we used the settings described in Fig. 5 but we perturbed 4 patches per step. Thus, a fundus image was fully perturbed in 16 steps (Fig. 7a and Fig. 7b). Considering the local annotations of retinal fluid on B-scans, we increased the granularity of perturbations in order to precisely monitor the changes in the DNN outputs for nAMD activity. We used patches of 4 × 4 on a grid of 110 × 128 and perturbed 4 patches per step. To sidestep the formidable computation required to run the full-fledged analyses for this task, we stopped early after the 880th step out of 3520 (Fig. 7d and Fig. 7e). After all, we plotted the average relative differences in the ensemble outputs for being diseased against the steps (Fig. 7), by subtracting the drop observed via a random perturbation from those of ranked perturbations. As a performance metric, we used the values induced by attribution methods at steps 10 and 200 for the DR and nAMD scenarios, respectively.
Results
We developed DNNs to detect DR and active nAMD from retinal fundus images (Fig. 1a) and slices of OCT volume scans (Fig. 1c), respectively. For each disease, we used two well-known network architectures: ResNet50 [31] and InceptionV3 [74]. Then, we constructed two Deep Ensembles [38, 23] for each diagnostic task, which each consisted of five DNNs from a given architecture, trained with different random initializations and data augmentation. Thus, we used 20 DNNs in this study. While individual DNNs were accurate for their respective tasks, their ensembles further improved upon single network performance in both disease scenarios and across network architectures (Table 3 and Fig. 2). We also assessed the calibration of our ensembles via reliability diagrams and the Adaptive Expected Calibration Error (AECE) [17] and found them to be well calibrated (Fig. 2).
Interestingly, the diversity of DNNs in decision-making showed clearly in saliency maps. For example, the first two DR detection networks paid more attention to the hemorrhages, microaneurysm (indicated by a dotted arrow) and soft exudates (bottom right, Fig. 3a), while the soft exudate was completely unattended by the last three DNN instances. The fifth one also ignored hemorrhages and detected only microaneurysm in this area. In addition to the annotated lesions, the DNNs also detected two hemorrhages (indicated by solid arrows) at the bottom left (for more examples, see the first two rows of Fig. 8 in Appendix A.2). Similarly, the nAMD activity detection networks used the presence of intraretinal or subretinal fluid as revealed by saliency maps (Fig. 3b). However, the first DNN did not pay much attention to the subretinal fluid, while the fifth one highlighted it along with additional intraretinal cues. Despite the differences, DNNs also agreed on the saliency of the top end of the large intraretinal lesion. After all, the ensembles of DNNs led to well-informed and comprehensive saliency maps, thanks to the aggregation of different views from individual DNNs (Fig. 3). However, even the ensemble-based saliency maps were not immediately amenable to human interpretation, as they were extremely sparse (Fig. 4, top rows in (a) and (b)). We used a custom-developed post-processing method (see Methods and Appendix A.1) to improve the visualization of salient regions (Fig. 4). It also normalized the saliency scores that varied wildly due to the differences between attribution methods and network architectures.
We used such enhanced ensemble-based saliency maps to systematically evaluate the clinical relevance of DNNs with a focus on explainability. We first compared the saliency maps with expert annotations (Fig. 6), which were presented as segmentation maps (Fig. 1b and Fig. 1d), and assessed their (dis)similarities directly via Dice loss [47]. To exclude potentially misleading saliency maps due to misclassification from the analysis, we considered only the images that were correctly classified by all members of respective ensembles. Interestingly, all annotated fundus images from the IDRiD collection were correctly classified by all DR detection networks. This is likely due to the severity and spread of lesions in these images. For nAMD activity detection, DNNs with ResNet50 and InceptionV3 architectures classified 62 and 55 B-scans (out of 71) correctly, respectively. In order to obtain balanced groups for our analysis, we considered the intersection of these two sets containing 52 B-scans.
We used the optimally post-processed saliency maps for each combination of disease scenario, DNN architecture and attribution method (see Methods) and asked whether the match of the saliency maps to the clinical annotation was significantly influenced by DNN architecture or the attribution method (2-way repeated measures ANOVA, see Appendix A.4 for details). In the DR detection task, DNN architecture (F(1,80) = 41.340, p = 8.6 × 10−9) and attribution method (F(13,1040) = 43.764, p = 3.0 × 10−89) as well as their interaction (F(13,1040) = 106.684, p = 6.2 × 10−181) had a significant influence. We obtained similar results for the nAMD activity detection task (F(1,51) = 65.573, p = 1.0 × 10−10 and F(13,663) = 29.354, p = 3.8 × 10−57 for the main effects and F(13,663) = 44.823, p = 6.6 × 10−82 for their interaction). Using post-hoc testing, we found significant pairwise differences between the mean Dice loss of different attribution methods: For DR detection (Fig. 6a and b), Guided Backprop and SmoothGrad were competitive with each other and significantly outperformed all other methods (Fig. 6c). Guided Backprop also performed well in the nAMD activity detection task (Fig. 6d and e). It outperformed seven methods including SmoothGrad (Fig. 6f). However, five LRP configurations along with Deep Taylor were as good as Guided Backprop on average in this task. After all, DeConvNet yielded the worst saliency maps in terms of the match to clinical annotations.
We next studied which kind of lesions were most strongly highlighted in saliency maps, indicating that they play a key role in the diagnostic decisions of DNNs. For DR, we found that DNNs relied more on small lesions, such as microaneurysms (green) and hard exudates (dark blue), but they typically captured them incompletely (Fig. 8 in Appendix A.2). In contrast, large instances of soft exudates (cyan) and hemorrhages (magenta) were less taken into account by the DNNs. Even when such large lesions were attended by DNNs, they were only partially covered in saliency maps. As a result, the Dice loss for individual lesion types was larger on average for soft exudates than hard exudates, for example, but that strongly differed between methods (Fig. 10e-h in Appendix A.3). Likewise, substantially large hemorrhages were almost completely ignored by DNNs (Fig. 8, 4th row). Also, different saliency methods highlighted different lesions or anatomical structures in the retina, even for the same network architecture (Fig. 8). For instance, Guided Backprop almost always pointed at DR lesions, whereas SmoothGrad often focused on vessels (in and out of the optic disc) and captured fewer lesions. While Guided Backprop’s top preferences were microaneurysms and hemorrhages (Fig. 10a-d), hard and soft exudates as well as the optic disc were typical formations highlighted by SmoothGrad (Fig. 10e-j). Integrated Gradients also behaved similar to Guided Backprop but it performed worse than the two overall. Finally, we observed that our post-processing method emphasized not only the lesions themselves but also their surroundings. In particular, tiny lesions such as microaneurysms and hard exudates were subject to overgrowing in saliency maps, since we tuned fv with respect to Dice loss on the complete set of annotations, including those for large lesions. As a result, the average Dice loss values for microaneurysms and hard exudates were increased on top of these lesions being captured incompletely in the first place (Fig. 10a,b,e and f in Appendix A.3). This combined with the errors made on different parts reduced the overall gap between Guided Backprop and Smooth-Grad (Fig. 6a-c), even though their saliency maps looked quite different. On the bright side, our method was effective at detecting tiny relevance scores in the vicinity of DR lesions and bringing them up to human attention. In the nAMD activity detection task, small retinal fluid were the go-to pathology for DNNs (Fig. 9 in Appendix A.2). However, the large ones were not ignored, either. DNNs typically responded to the boundaries of large retinal fluid and saliency maps showed a cavity in the interior (Fig. 9, last three rows). Thus, the Dice loss for intraretinal fluid was larger than for subretinal fluid on average (Fig. 11 in Appendix A.3), since the former was usually larger in size than the latter. Interestingly, saliency methods were more consistent about their preferences for salient regions in this case. We attribute this to the small variety of pathologies. However, in addition to retinal fluid, DNNs used features from the fovea to discern nAMD activity (Fig. 9), even though it was not annotated by experts as key for the task. On the other hand, the effect of our post-processing was again apparent in saliency maps (Fig. 9). The retinal fluid and their surroundings were highlighted together and the Dice loss for small subretinal fluid was high on average (Fig. 11).
Next, we used perturbation analysis to validate the optimal saliency maps with respect to expert annotations. To this end, we used the expert annotations of clinically relevant lesions also as saliency maps. We performed 2-way repeated measures ANOVA based on the average differences between the ensemble outputs induced by ranked and random perturbations using the aforementioned design. In the DR detection task, we found that DNN architecture did not significantly influence our measure (F(1,80) = 1.901, p = × 10−1), whereas the choice of attribution method had a significant effect (F(15,1200) = 113.691, p = 7.8 × 10−218) as had interaction of these two factors (F(15,1200) = 5.466, p = 7.8 × 10−11). The effects followed a similar trend in the nAMD activity detection task (main effects: F(1,51) = 0.189, p = 6.7 × 10−1; F(15,765) = 116.869, p = 4.2 × 10−186); interaction: F(15,765) = 6.004, p = 5.8 × 10−12). Using post-hoc testing, we again found significant pairwise differences between the means of attribution methods. In the DR detection task (Fig. 7a and Fig. 7b), Guided Backprop was the best method on average, competitive with seven methods, including the expert annotation, and significantly outperforming eight methods (Fig. 7c). Also, the expert annotation performed not significantly different than a number of saliency methods and better than SmoothGrad and DeConvNet on average. In the nAMD activity detection task (Fig. 7d and Fig. 7e), saliency methods and expert annotation closely followed in the early stages of perturbations. However, the expert curves quickly stabilized into almost flat lines. The flat lines indicated that the perturbation order essentially followed random selection of patches once the most important pathologies annotated by clinicians were removed. Perturbations with respect to saliency maps led to further reduction beyond the expert curves, indicating the use of additional features by DNNs. After all, Integrated Gradients outperformed five methods, one of which was the expert annotation (Fig. 7f). Guided Backprop, Deep Taylor, Input × Gradient, SmoothGrad and six LRP configurations were as good as Integrated Gradients on average. Surprisingly, DeConvNet achieved a better performance in comparison with the earlier scenarios.
Our two analyses — direct comparisons of lesions using Dice loss and perturbation analysis — provided complementary information about the factors influencing the quality of saliency maps: The first analysis indicated that the DNN architecture can be a role for explainability, interacting with the attribution method. Across tasks and network architectures, Guided Backprop emerged as the most useful method for generating clinically relevant saliency maps (Fig. 6). Also, the methods, e.g., Guided Backprop and SmoothGrad, differed in their preferences for salient lesions and anatomical structures in the retina, even for a given architecture. For the perturbation analysis, we did not find an effect of DNN architecture and we observed similarities between the perturbation trajectories of many saliency methods and expert annotations (Fig. 7). The use of large patches combined with the spread and severity of DR lesions probably suppressed the differences between DNNs and clinicians in DR detection (Fig. 7a-b). But, in the nAMD scenario, the trajectories of saliency methods and expert annotation diverged after an initial period of collective descent (Fig. 7d-e). Interestingly, the curves based on the saliency methods continued to descent past the expert curves, suggesting that a few key instances of retinal fluid were mostly enough for a clinician to make a diagnosis, while DNNs also used fovea characteristics for detecting nAMD activity.
Discussion
DR and nAMD are two progressive eye diseases and major causes of blindness in the developed world [13, 3, 80]. Timely intervention is the key to avoiding them or preventing vision loss in both cases. Thus, clinicians need cost-effective, accurate and trustworthy solutions to support the early diagnosis at scale [80, 28, 39, 15, 75]. Here, we developed accurate and well-calibrated ensembles of DNNs to detect DR and nAMD from retinal fundus images and slices of 3D OCT volume scans, respectively, and evaluated a comprehensive set of saliency maps for explaining the ensemble-based diagnostic decisions using a variety of published methods.
Interestingly, even the ensemble-based saliency maps were not readily interpretable by humans due to their sparsity. To improve the visualization of salient regions, we introduced a new post-processing method. Then, we systematically validated saliency maps against clinicians through two main analysis routes, including (1) a direct comparison of saliency maps with expert annotations of disease-specific pathologies and (2) perturbation analyses using also expert annotations as saliency maps. We found that the choice of DNN architecture and explanation method significantly influenced the quality of saliency maps. Moreover, DNNs used features both inside and outside the regions-of-interest (ROIs) annotated by clinicians. In particular, DNNs found additional instances of DR lesions that had not been explicitly annotated by clinicians. This could be because the heavily diseased images in the IDRiD dataset had not been completely annotated. In the nAMD case, extra cues were found in the fovea, which was never annotated by ophthalmologists in our study, as they only focused on signs of AMD activity they would typically use for diagnosis.
Saliency map generation to explain a classifier’s decision is superficially related to another popular task called semantic segmentation. However, segmentation is a causal task, while classification is anti-causal [12]. Also, DNNs are opportunistic classifiers in the sense that they exploit statistical regularities and image features to reach their objectives [24, 25]. Therefore, saliency maps for explaining the decisions of DNNs trained to achieve classification may differ from the segmentation maps typically used to train DNNs for segmentation in the first place. However, we gained insights into the diagnostic decisions of DNNs through the comparisons of saliency maps with expert annotations presented as segmentation maps. For instance, our DR detection networks mostly used a subset of small but sufficiently informative lesions, such as microaneurysms and hard exudates as well as small instances of hemorrhages. They also exploited soft exudates and large hemorrhages, albeit less frequently and only partially. Overall, they used efficient decision rules [25] mostly based on the characteristics of Mild and Moderate DR, as the task was to detect only the presence of DR. The opportunistic nature of DNNs also showed in nAMD activity detection. For instance, they detected large retinal fluid simply by its boundaries. Also, they exploited the fovea along with retinal fluid. Given that retinal fluid caused changes in the foveal contour during nAMD [46, 67], DNNs probably associated these changes with disease activity. Even though such associationist characteristics would not lead to causal explanations in principle [58], saliency maps showed that the DNN decisions were medically plausible. In this respect, DNNs, provided that they are also coupled with well-calibrated uncertainty estimation [8], can be deployed to facilitate the cooperation of clinicians and algorithms in the form of assisted reading [65].
In addition, our analyses indicated key practical limitations of the saliency methods in question. First, DR lesions such as microaneurysms and hard exudates as well as small bodies of retinal fluid in the case of nAMD indicate early-onset cases. As DNNs exploit retinal images opportunistically and the resulting saliency maps may include sparse regions even after our post-processing, the pitfall is that such minuscule but critical pathologies can be overlooked while screening for timely intervention. To alleviate this, alternative saliency methods designed for coarse maps can be used. Grad-CAM [66] and its combinations with Guided Backprop, or saliency bounding boxes [41] are good candidates to that end. Another important factor, which is somewhat neglected in our study, is the inter-grader variability in human readings of medical images. The inter-grader variability is high [18, 36], especially in segmentation tasks due to technical challenges and anatomical variability across patients [45]. Clinician performance is also subject to internal biases and experience levels. Thus, a more refined assessment of saliency maps could be achieved through multiple readers, also by estimating the ground truth segmentation from their annotations [83].
The decision mechanisms of DNNs and clinicians have also been recently compared via a perturbation-based reader study in the context of breast cancer screening [43]. The study included two groups of patients with either microcalcifications or soft tissue lesions, and indicated the bias of DNNs towards high-frequency features in both groups. While sharp and local peaks in mammogram images were salient features of microcalcifications, DNNs recognized soft tissue lesions typically from their boundaries without focusing on interiors. This is in line with our finding that the networks for DR and active nAMD detection used rather microaneurysms and the boundaries of large retinal fluid in the eye to make decisions. Also in line with our results, cancer screening networks found additional information outside the ROIs determined by radiologists [43].
In another recent study [69], an instance of InceptionV3 [74] was trained to predict the presence of choroidal neovascularization (CNV), diabetic macular edema (DME) or drusen from OCT images. Then, three experts graded saliency maps for its decisions on a scale between 0 and 5 according to their clinical relevance. In total, 13 saliency methods were used (9 of which are also used by our study). According to the subjective expert rating, Deep Taylor decomposition [50] and Guided Backprop [71] produced the most relevant saliency maps. Deep Taylor decomposition provided slightly better visualizations than Guided Backprop “due to clinically coherent explanations, better coverage of pathology, and lack of high-frequency noise” [69, p.7]. Thus, their study provides further evidence that Guided Backprop is a useful technique for obtaining clinically relevant saliency maps, especially considering that they did not use any special post-processing of the saliency maps for Guided Backprop (see Methods), which could have improved its saliency maps. Deep Taylor decomposition, however, performed less well in our study, hinting at a disagreement between their rating-based evaluation and our segmentation-based evaluation.
In contrast to this evidence by us and others [5, 69] in favor of Guided Backprob in a clinical setting, Guided Backprob has been shown to be insensitive to the object classes in ImageNet [63, 55]. This likely happens because the algorithm exploits local connections in convolutional layers, which extract a series of hierarchical feature representations from a given image, and the final dense layers, where class label assignments are made, have less impact on saliency maps [55]. Nevertheleess, as Guided Backprop was consistently among the best methods for generating saliency maps to explain the decisions of DNNs trained to detect retinal diseases in our and other studies, we believe that it should be further studied to understand its distinct behaviors when explaining DNN decisions on natural or medical images. Moreover, its restriction by design to ReLU networks (see Methods) should be relaxed to extend its applicability to new architectures beyond ReLU-based designs.
Conclusion
We studied the clinical relevance of saliency maps extracted from DNNs trained to detect DR and nAMD from retinal images. We used different network architectures, well-calibrated ensembles of DNNs and a variety of explanation methods to obtain a comprehensive set of saliency maps for explaining the ensemble-based diagnostic decisions. Then, we validated the saliency maps against ophthalmologist’s expert annotations. Overall, Guided Backprop emerged as the method of choice for generating saliency maps to explain the diagnostic decisions of DNNs on retinal images. In addition, a combination of multiple methods may reveal complementary characteristics in order to obtain well-rounded explanations.
Data Availability
Retinal fundus images used in this study were obtained from publicly available repositories indicated in the manuscript. The optical coherence tomography scans were obtained from the University Eye Clinic and their use was permitted by the Institutional Ethics Committee of the University of Tuebingen.
Author Contributions Statement
MSA and PB designed research; LBK devised the method for saliency map processing; MSA and LBK performed research, the AMD and DR experiments, respectively; GA gathered the OCT volumes and graded B-scans; WI also graded B-scans; LK annotated B-scans and provided medical advice together with GA, WI and FZ; MSA and PB supervised research; MSA, PB and LBK wrote the paper with input from all authors.
Acknowledgements
This research was supported by the German Ministry of Science and Education (BMBF, 01GQ1601 and 01IS18039A) and the German Science Foundation (BE5601/4-1 and EXC 2064, project number 390727645). Additional funding was provided by Novartis AG through a research grant. The funders did not have any influence in the study planning and design. The Messidor 2 collection [16] was kindly provided by the Messidor program partners. More information can be found at http://www.adcis.net/en/third-party/messidor/.
Appendix
A.1 Details of Post-processing
Given a 2D map for excitatory or inhibitory features, we rescaled its values w.r.t. the maximum possible sum of scores the map could have had after processing, i.e., D = H × W, : Then, we achieved a non-linear transformation by thresholding and another rescaling: We determined the threshold τ by solving the following problem: where ν was our target for total relevance and ν(τ) was a monotonically decreasing and implicit function of (see Appendix A.1.1) with upper and lower bounds: limτ →0 ν(τ) ≤D and , respectively. We performed a binary search to find a suitable τ. We also introduced a hyperparameter fν so that ν was easily adjusted: ν = fν D, where fν ∈ [0, 1] was the fraction of D. Intuitively, fν was a growth factor for salient regions. Thus, larger fν, larger the salient region (Fig. 4). However, the size of salient regions also depended on disease status and total class evidence carried over to logits. To update our initial choice in the light of evidence, we introduced a scaling parameter: where the evidence for class k, given an input image xn, was rescaled into [0, 1] w.r.t minimum and maximum evidence over all images and across classes. Then, ν = γfν D, which allowed for fine-tuning the ratios of salient regions with disease patterns and regions without over the image size. If the search interval was somehow violated after these adjustments, then we set as a precaution. Also, in order to avoid τ = 0, we heuristically set the minimum possible τ to , where d = 0.00001.
Appendix
A.1.1 Properties of ν(τ)
ν(τ) is a monotonically decreasing and implicit function, where .
Let τ1 ≤ τ2, then Given τ1 ≤ τ2, A.1.1 holds true in the following cases: Then, A.1.1 ⇒ A.1.2 ⇔ A.1.3.
Appendix
A.2 Optimum fν values for attribution methods and exemplary saliency maps
Appendix
A.3 Evaluation of saliency maps w.r.t. lesion types and their annotations
DR detection
AMD activity detection
Appendix
A.4 ANOVA and Post-hoc Tests
Direct comparison of saliency maps with expert annotations, DR detection
Direct comparison of saliency maps with expert annotations, AMD activity detection
Perturbation analysis, DR detection
Perturbation analysis, AMD activity detection
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵