Soft-Attention Improves Skin Cancer Classification Performance

In clinical applications, neural networks must focus on and highlight the most important parts of an input image. Soft-Attention mechanism enables a neural network toachieve this goal. This paper investigates the effectiveness of Soft-Attention in deep neural architectures. The central aim of Soft-Attention is to boost the value of important features and suppress the noise-inducing features. We compare the performance of VGG, ResNet, InceptionResNetv2 and DenseNet architectures with and without the Soft-Attention mechanism, while classifying skin lesions. The original network when coupled with Soft-Attention outperforms the baseline[16] by 4.7% while achieving a precision of 93.7% on HAM10000 dataset [25]. Additionally, Soft-Attention coupling improves the sensitivity score by 3.8% compared to baseline[31] and achieves 91.6% on ISIC-2017 dataset [2]. The code is publicly available at github.


Introduction
Skin cancer is the most common cancer and one of the leading causes of death worldwide. Every day, more than 9500 people 2 in the United States are diagnosed with skin cancer, with 3.6 million people 3 diagnosed with basal cell skin cancer each year. Early diagnosis of the illness has a significant effect on the patients' survival rates. As a result, detecting and classifying skin cancer is important.
It is difficult to distinguish between malignant and benign skin diseases because they look so similar. Although a dermatologist's visual examination is the first step in detecting and diagnosing a suspicious skin lesion, it is usually followed by dermoscopy imaging for further analysis [32]. Dermoscopy images provide a high-resolution magni- fied image of the infected skin region, but they are not without their drawbacks. Due to the image size being large, it becomes difficult for the feature extractors to extract out the relevant features for classification. Various methods such as Segmentation and detection, Transfer learning, General Adversarial networks, etc. have been used to detect and classify skin cancer. Despite significant progress, skin cancer classification is still a difficult task. This is due to the lack of annotated data and low inter-class variation. Furthermore, the task is complicated by contrast variations, color, shape, and size of the skin lesion, as well as the presence of various artifacts such as hair and veins. Inspired by the work done in [18], this paper studies the effect of soft attention mechanism in deep neural networks. Deep learning architectures identify the image class by learning the salient features and nonlinear interactions. The soft-attention mechanism improves performance by focusing primarily on relevant areas of the input. Moreover, the soft-attention mechanism makes the image classification process transparent to medical personnel, as it maps the parts of the input that the network uses to classify the image, thereby, increasing trust in the classification model.

Related Work
Following Krichevsky [12], large-scale image classification tasks using deep convolutional neural networks have become common. As reported in the paper [3], the task of skin cancer classification using images has improved rapidly since the implementation of Deep Neural Networks. To make progress, we suggest that soft attention be used to identify fine-grained variability in the visual features of skin lesions.
Existing art in the field of skin cancer classification used streamlined pipelines based upon current Computer Vision. [4]. Masood et al. in their paper. [13] proposed a general framework from the viewpoint of computer vision, where the methods such as calibration, preprocessing, segmentation, balancing of classes and cross validation are used for automated melanoma screening. In 2018, Valle et al. [26] investigated ten different methodologies to evaluate deep learning models for skin lesion classification. Data augmentation, model architecture, image resolution, input normalization, train dataset, use of segmentation, test data augmentation, additional use of support vector machines, and use of transfer learning are among the ten methodologies they evaluated. They stated that data augmentation had the greatest impact on model efficiency. The same observation is confirmed by Perez's 2018 paper "Data Augmentation for Skin Lesion Analysis" [15].
Nonetheless, the problems of low inter-class variance and class imbalance in skin lesion image datasets remain, seriously limiting the capabilities of deep learning models [30]. To fix the lack of annotated data, Zunair et al. [32] proposed the use of adversarial training and Bissoto et al. [1] proposed the use of Generative Adversarial Networks to produce realistic synthetic skin lesion photos.

Experiment Settings And Method
In this paper, five deep neural networks which are ResNet34, ResNet50 [6], Inception ResNet v2 [22], DenseNet201 [8] and VGG16 [20], are implemented with soft attention mechanism, to classify skin cancer images. ResNet34, ResNet50 [6], Inception ResNet v2, DenseNet201 [8] and VGG16 [20] are all state of the art feature extractors which are trained on ImageNet dataset. The main components and architecture of the proposed approach is described below:

Dataset
The experiment is performed on two datasets separately. The two datasets are as follows: HAM10000 dataset [25] and ISIC 2017 dataset.
The HAM10000 dataset [25] consists of 10015 dermatoscopic images of a size of 450 × 600. It consists of 7 diagnostic categories as follows: Melanoma(MEL), Melanocytic Nevi(NV), Basal Cell Carcinoma(BCC), Actinic Keratosis, and Intra-Epithelial Carcinoma(AKIEC), Benign Keratosis(BKL), Dermatofibroma(DF), Vascular lesions(VASC). All the images are resized to 299 x 299 for Inception ResNet v2 [22] Figure 2. Example of Skin lesions in HAM10000 dataset [25] 3 catagories as follows: benign nevi, seborrheic keratosis, and melanoma. The test dataset consist of 600 images. In this experiment we are training our model to classify only benign nevi and seborrheic keratosis. All the images resized to 224 x 224. The data in both datasets is then cleaned to remove class imbalances. This is done by the process of over-sampling and under-sampling of data so that there are equal number of images per class. The images are then normalized by dividing each pixel with 255 to keep the pixel values in the range 0 to 1.

Soft Attention
When it comes to skin lesion images, only a small percentage of pixels are relevant as the rest of the image is filled with various irrelevant artifacts such as veins and hair. So, to focus more on these relevant features of the image, soft attention is implemented. Inspired by the work proposed by Xu et al [28], for image caption generation and the work done by Shaikh et al [18], where they used attention mechanism on images for handwriting verification, in this paper, soft attention is used to classify skin cancer.  Figure [3], we can see that areas with higher attention are red in color . This is because soft attention discredits irrelevant areas of the image by multiplying the corresponding feature maps with low weights. Thus the low attention areas have weights closer to 0. With more focused information, the model performs better.
In the soft attention module as discussed in paper [18] and [23], the feature tensor (t) which flows down the deep neural network is used as input.
This feature tensor t ∈ R h×w×d is input to a 3D convolution layer [24] with weights W k ∈ R h×w×d×K , where K is the number of 3D weights. The output of this convolution is normalized using softmax function to generate K = 16 attention maps. As shown in Figure 1, these attention maps are aggregated to produce a unified attention map that acts as a weighting function α. This α is then multiplied with t to attentively scale the salient feature values, which is further scaled by γ a learnable scalar. Finally, the attentively scaled features (f sa ) are concatenated with the original feature t in form of a residual branch. During training we initialize γ from 0.01 so that the network can slowly learn to regulate the amount of attention required by the network.

Model Setup
In this section, the detailed architecture of all the models is discussed. For all experiments, to train the networks, Adam optimizer [11] of 0.01 learning rate and 0.1 epsilon is used. A batch normalization [10] layer is added after each layer in all the networks to introduce some regularization. For the HAM10000 dataset [25], since there are 7 classes of skin cancer, an output layer with 7 hidden units is implemented, followed by a softmax activation unit. All the experiments were executed on the Keras framework.

Inception ResNet v2
In Inception ResNet v2 [22], the soft attention layer is added to the Inception Resnet C block of the model where the feature size of the image is 8 x 8 as shown in Figure [5a]. In this case, the soft attention layer is followed by a maxpool layer with a pool size of 2x2, which is then concatenated with the filter concatenate layer of the inception block. The concatenate layer is then followed by a relu activation unit. To regularize the output of the attention layer, the activation unit is followed by a 0.5 dropout layer [21] as in Figure [4]. The network is trained for 150 epochs with early stopping patience of 30. The overall network is shown in Figure [5a Figure 5. 5a. End to end architecture Of Inception ResNet v2 [22] with Soft Attention Block. 5b. End to end architecture Of ResNet34 [6] with Soft Attention Block . conv x indicates convolution blocks, where x is the block number.

DenseNet201
In DenseNet201 [8], the soft attention layer is added to the 4th dense block where the size of feature map of the image is 7 x 7 as shown in Figure [6]. Like in the previous model, the soft attention layer is integrated with the same procedure as it was integrated with the Inception ResNet V2 [22] architecture. [4]. The network is trained for 150 epochs with early stopping patience of 35.

ResNet34 and ResNet50
In ResNet34 [6], a soft attention layer is added after the 3rd convolution block where the size of feature map is 28 x 28 as shown in [5b] whereas, in the ResNet50 [6], the soft attention layer is added after the 5th convolution block where the size of feature map is 7 x 7. In both cases, the soft attention layer is followed by a maxpool layer with a pool size of 2x2, which is then concatenated with the standard maxpool layer of the architecture, as shown in Figure [4]. The concatenate layer is then followed by a relu activation unit.
To regularize the output of the attention layer, the activation unit is followed by a 0.5 dropout [21] layer. This is the same approach as to how the soft attention module was integrated with the Inception ResNet V2 [22] architecture. The overall architecture for ResNet 34 model is shown in Figure [5b].

VGG16
In VGG16 [20], the soft attention layer is added after the conv layer 4 of the VGG16 architecture where the size of feature map is 28 x 28. Like in the previous model, the soft attention layer is integrated with the same procedure as it was integrated with the ResNet [6] and Inception ResNet V2 [22] architecture. [4]. The network is trained for 300 epochs with early stopping patience of 65. The overall architecture for the model is shown in Figure [8].
In Figure [8], a Conv layer block consists of two to three convolution layers with filters of sizes ranging from 64 to 512, followed by a maxpool layer. Conv layer 1, and Conv layer 2 consists of two convolution layers each with 64, and 128 filters respectively, and Conv layer 3, Conv layer 4 and Conv layer 5 consists of three convolution layers each with 256, 512 and 512 filters respectively.

Loss Function
In this experiment, there are seven different classes of skin cancer. Hence , categorical cross entropy loss (L CCE ) is used to optimize the neural network.
Here, as there are seven classes, C ∈ [0..6], where t i is the ground truth and s i is the CNN score for each class i in C. f (s) i is the softmax activation function applied to the scores.

Evaluation Metrics
In this paper, the model is evaluated using P recision = T P T P +F P , Accuracy = T P +T N T , Sensitivity = T P T P +F N , Specif icity = T N T N +F P and AUC scores [9]. Here TN, TP, FP, FN, T mean, True Negatives, True Positives, False Positives, False Negatives, Total Number respectively.  [25]. [22] refers to IRv2 architecture, [8] refers to DenseNet 201 architecture, [20] refers to VGG 16 architecture, and [6] refers to ResNet architecture. Table 1 lists, the performance of all the models in terms of precision, and AUC score on HAM10000 dataset [25]. In this table (+SA) stands for models with soft attention. IRv2 stands for Inception ResNet v2 [22], [6]34 stands for ResNet34 [6] and [6]50 stands for ResNet50 [6]. From the table, it can be observed that IRv2 when coupled with SA (IRv2+SA) shows significant improvements in results, with a precision and AUC score of 93.7% and 98.4% respectively, which are also the highest scores amongst all models. Furthermore, we can see that Soft Attention (SA) boosts the performance of IRv2 by 3.2% in terms of precision as compared to the original IRv2 model. This phenomenon is true for VGG16, ResNet34, ResNet50 and DenseNet201 as well. For instance, Soft Attention (SA) boosts the precision of DenseNet201 [8], ResNet34 [6], ResNet50 [6], and VGG16 [20] by 0.5%, 0.8%, 1.2% and 2% respectively. We see a similar behaviour for the AUC scores when SA block is integrated in to the networks, such as, the performance of ResNet50 [6], and ResNet34 [6] has grown by 0.6% and 1.5% respectively and the performance of DenseNet201 [8], and VGG16 [20] is on par with the original models.

Ablation Analysis
Although IRv2+SA performs the best in terms of weighted average(W.Avg), when we look at it's class wise performance, we can see that Soft Attention enhances the efficiency of the original IRv2 while categorizing AKIEC, BCC, DF and NV by 17%, 3%, 33% and 4% respectively in terms of precision. Moreover, when comparing AUC scores, the IRv2+SA performs better for BKL and MEL by 1.2% and 0.9% respectively, while, for BCC, NV and VASC, IRv2+SA performs as good as original model.
We thus select IRv2 coupled with SA (IRv2+SA) for our experiments, also the SA block consistently boosts the performance of it's original counterpart, hence, we can justify the integration of Soft Attention to the networks.

Quantitative Analysis
When we tested the model with different train-test splits on the HAM10000 dataset [25], we discovered that the model with 85 % training data outperforms the model with 80 % and 70 % training data by 2.2 % and 2.6 % respectively, as shown in Table 2 [25] Furthermore, the proposed approach is compared with state-of-the-art models for skin cancer classification on the HAM10000 dataset [25] in Table 3. Our Soft Attentionbased approach outperforms the baseline [16] by 4.7% in terms of precision. In terms of AUC scores, our Soft Attention-based approach clearly outperforms them all by 0.5% to 4.3%.

SA Map
Grad-cam SA Map Grad-cam SA Map Figure 9. Comparison of GradCAM [17] heatmaps with our Soft Attention (SA) maps on HAM10000 dataset [25] with the state-of-the-art models.  Table 4. Comparison with state-of-the-art-Model in terms of AUC, Accuracy, sensitivity and specificity score on ISIC-2017 dataset [2] From Table 4, it can be observed that in IRv2 5x5 +SA, and in IRv2 12x12 +SA, the attention layer was added when the feature map size is 5x5 and 12x12 respectively. Out of the two models with soft attention, the model IRv2 5x5 +SA outperforms IRv2 12x12 +SA in terms of AUC scores, Accuracy, and Specificity by a percentage of 2.4%, 0.6%, and 12.2% respectively whereas IRv2 12x12 +SA outperforms IRv2 5x5 +SA in terms of Sensitivity by 2.9%. In this case, the attention layer was added when the feature size is 5x5. When IRv2 5x5 +SA is compared with the ARL-CNN50 [31] (baseline model), it performs on par with it in terms of AUC score but our model outperforms it when it comes to accuracy and Sensitivity by 3.6% and 3.8% respectively. But ARL-CNN50 [31] takes the upper hand when it comes to Specificity by 3.4%. Since sensitivity measures the proportion of correctly identified positives and specificity measures the proportion of correctly identified negatives, we are prioritizing Sensitivity because classifying a person with cancer as not having cancer is riskier than vice versa. In Fig.9, we show pairs of comparison between the Soft Attention maps with Grad-CAM [17] heatmaps. In the first pair, the SA map focuses on the main part of the lesion area whereas the Grad-cam heatmap is slightly shifted towards top left and is also spread out on the uninfected area of skin. We have similar observations for the second and third pairs as well. From this observation it is evident that the Soft Attention maps are focused more on the relevant locations of the image compared to Grad-CAM [17] heatmaps.

Conclusion
In this paper, we present the implementation and utility of Soft Attention mechanism being applied while image encoding to tackle the problem of high-resolution skin cancer image classification. The model outperformed the current state-of-the-art approaches on the HAM10000 dataset [25] and the ISIC-2017 dataset [2]. This demonstrates the Soft Attention based deep learning architecture's potential and effectiveness in image analysis. The Soft Attention mechanism also eliminates the need of using external mechanisms like GradCAM [17], and internally provides the location of where the model focuses while categorizing a disease, while also boosting the performance of the main network. Soft Attention has the added advantage of naturally dealing with image noise internally. In the future, this model can be implemented in dermoscopy systems to assist dermatologists. This mechanism can be easily implemented to classify data from other medical databases as well.