PT - JOURNAL ARTICLE AU - Nishanth Arun AU - Nathan Gaw AU - Praveer Singh AU - Ken Chang AU - Mehak Aggarwal AU - Bryan Chen AU - Katharina Hoebel AU - Sharut Gupta AU - Jay Patel AU - Mishka Gidwani AU - Julius Adebayo AU - Matthew D. Li AU - Jayashree Kalpathy-Cramer TI - Assessing the (Un)Trustworthiness of Saliency Maps for Localizing Abnormalities in Medical Imaging AID - 10.1101/2020.07.28.20163899 DP - 2021 Jan 01 TA - medRxiv PG - 2020.07.28.20163899 4099 - http://medrxiv.org/content/early/2021/07/15/2020.07.28.20163899.short 4100 - http://medrxiv.org/content/early/2021/07/15/2020.07.28.20163899.full AB - Purpose To evaluate the trustworthiness of saliency maps for abnormality localization in medical imaging.Materials and Methods Using two large publicly available radiology datasets (SIIM-ACR Pneumothorax Segmentation and RSNA Pneumonia Detection), we quantified the performance of eight commonly used saliency map techniques in regards to their 1) localization utility (segmentation and detection), 2) sensitivity to model weight randomization, 3) repeatability, and 4) reproducibility. We compared their performances versus baseline methods and localization network architectures, using area under the precision-recall curve (AUPRC) and structural similarity index (SSIM) as metrics.Results All eight saliency map techniques fail at least one of the criteria and were inferior in performance compared to localization networks. For pneumothorax segmentation, the AUPRC ranged from 0.024-0.224, while a U-Net achieved a significantly superior AUPRC of 0.404 (p<0.005). For pneumonia detection, the AUPRC ranged from 0.160-0.519, while a RetinaNet achieved a significantly superior AUPRC of 0.596 (p<0.005). Five and two saliency methods (out of eight) failed the model randomization test on the segmentation and detection datasets, respectively, suggesting that these methods are not sensitive to changes in model parameters. The repeatability and reproducibility of the majority of the saliency methods were worse than localization networks for both the segmentation and detection datasets.Conclusion We suggest that the use of saliency maps in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network.Supplemental material is available for this article.Summary The use of saliency maps to interpret deep neural networks trained on medical imaging fails several key criteria for utility and robustness, highlighting the need for scrutiny before clinical application.Key PointsEight popular saliency map techniques were evaluated for their utility and robustness in interpreting deep neural networks trained on chest radiographs.All the saliency map techniques fail at least one of the criteria defined in the paper, indicating their use for high-risk medical applications to be problematic.Instead, the use of detection or segmentation models are recommended if localization is the ultimate goal of interpretation.Competing Interest StatementJ. Kalpathy-Cramer has research funding from Genetech and GE.Funding StatementResearch reported in this publication was supported by a training grant from the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health under award number 5T32EB1680 to K. Chang and J. B. Patel and by the National Cancer Institute (NCI) of the National Institutes of Health under Award Number F30CA239407 to K. Chang. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This publication was supported from the Martinos Scholars fund to K. Hoebel. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Martinos Scholars fund. This study was supported by National Institutes of Health (NIH) grants U01CA154601, U24CA180927, U24CA180918, and U01CA242879, and National Science Foundation (NSF) grant NSF1622542 to J. Kalpathy-Cramer. This research was carried out in whole or in part at the Athinoula A. Martinos Center for Biomedical Imaging at the Massachusetts General Hospital, using resources provided by the Center for Functional Neuroimaging Technologies, P41EB015896, a P41 Biotechnology Resource Grant supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB), National Institutes of Health.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The datasets were obtained from online Kaggle competitions and were already anonymized.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesWe train our models and generate saliency maps using publicly available chest x-ray (CXR) images from the SIIM-ACR Pneumothorax Segmentation and RSNA Pneumonia Detection datasets which are openly available online at the below links. https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation https://www.kaggle.com/c/rsna-pneumonia-detection-challenge 2DTwo-dimensionalACRAmerican College of RadiologyAUCArea Under the CurveAUPRCArea Under the Precision-Recall CurveAUROCArea Under the Receiver Operating Characteristic CurveAVGAverage of all masks (bounding boxes/segmentations) across the training and validation datasetsCNNConvolutional Neural NetworkFNFalse NegativesFPFalse PositivesGBPGuided-backpropGGCAMGuided Gradient-weighted Class Activation MappingGRADGradient ExplanationGradCAMGradient-weighted Class Activation MappingIGIntegrated GradientsLOWLow baselinePRPrecison-RecallReLURectified Linear UnitRNETRetinaNetROCReceiver Operator CharacteristicRSNARadiological Society of North AmericaSGSmoothgradSIGSmooth IGSIIMSociety for Imaging Informatics in MedicineSSIMStructural Similarity Index MeasureTPTrue PositivesUNETU-Net