Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
- 91 Downloads
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Our approach—Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g.VGG), (2) CNNs used for structured outputs (e.g.captioning), (3) CNNs used in tasks with multi-modal inputs (e.g.visual question answering) or reinforcement learning, all without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show that even non-attention based models learn to localize discriminative regions of input image. We devise a way to identify important neurons through Grad-CAM and combine it with neuron names (Bau et al. in Computer vision and pattern recognition, 2017) to provide textual explanations for model decisions. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. Our code is available at https://github.com/ramprs/grad-cam/, along with a demo on CloudCV (Agrawal et al., in: Mobile cloud visual media computing, pp 265–290. Springer, 2015) (http://gradcam.cloudcv.org) and a video at http://youtu.be/COjUB9Izk6E.
KeywordsGrad-CAM Visual explanations Visualizations Explanations Interpretability Transparency
This work was funded in part by NSF CAREER awards to DB and DP, DARPA XAI Grant to DB and DP, ONR YIP awards to DP and DB, ONR Grant N00014-14-1-0679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, an Allen Distinguished Investigator award to DP from the Paul G. Allen Family Foundation, ICTAS Junior Faculty awards to DB and DP, Google Faculty Research Awards to DP and DB, Amazon Academic Research Awards to DP and DB, AWS in Education Research Grant to DB, and NVIDIA GPU donations to DB. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor. Funding was provided by Virginia Polytechnic Institute and State University.
- Agrawal, A., Batra, D., & Parikh, D. (2016). Analyzing the behavior of visual question answering models. In EMNLP.Google Scholar
- Agrawal, H., Mathialagan, C. S., Goyal, Y., Chavali, N., Banik, P., Mohapatra, A., Osman, A., & Batra, D. (2015). CloudCV: Large scale distributed computer vision as a cloud service. In Mobile cloud visual media computing (pp. 265–290). Springer.Google Scholar
- Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In ICCV.Google Scholar
- Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. In computer vision and pattern recognition.Google Scholar
- Bazzani, L., Bergamo, A., Anguelov, D., Torresani, L. (2016). Self-taught object localization with deep networks. In WACV.Google Scholar
- Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Cinbis, R. G., Verbeek, J., & Schmid, C. (2016). Weakly supervised object localization with multi-fold multiple instance learning. IEEE transactions on pattern analysis and machine intelligence.Google Scholar
- Das, A., Agrawal, H., Zitnick, C. L., Parikh, D., & Batra, D. (2016). Human attention in visual question answering: Do humans and deep networks look at the same regions? In EMNLP.Google Scholar
- Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., & Batra, D. (2018). Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
- Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., & Batra, D. (2017a). Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
- Das, A., Kottur, S., Moura, J. M., Lee, S., & Batra, D. (2017b). Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE international conference on computer vision (ICCV).Google Scholar
- de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., & Courville, A. C. (2017). Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
- Dosovitskiy, A., & Brox, T. (2015). Inverting convolutional networks with convolutional networks. In CVPR.Google Scholar
- Erhan, D., Bengio, Y., Courville, A., & Vincent, P. (2009). Visualizing higher-layer features of a deep network. University of Montreal, 1341.Google Scholar
- Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2009). The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
- Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., et al. (2015). From captions to visual concepts and back. In CVPR.Google Scholar
- Gan, C., Wang, N., Yang, Y., Yeung, D.-Y., & Hauptmann, A. G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR.Google Scholar
- Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? In NIPS: Dataset and methods for multilingual image question answering.Google Scholar
- Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.Google Scholar
- Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. stat.Google Scholar
- Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., & Farhadi, A. (2017). Iqa: Visual question answering in interactive environments. arXiv preprint arXiv:1712.03316.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.Google Scholar
- Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV.Google Scholar
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM MM.Google Scholar
- Johns, E., Mac Aodha, O., & Brostow, G. J. (2015). Becoming the expert—interactive multi-class machine teaching. In CVPR.Google Scholar
- Johnson, J., Karpathy, A., & Fei-Fei, L. (2016). DenseCap: Fully convolutional localization networks for dense captioning. In CVPR.Google Scholar
- Karpathy, A. (2014). What I learned from competing against a ConvNet on ImageNet. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/.
- Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR.Google Scholar
- Kolesnikov, A., & Lampert, C. H. (2016). Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV.Google Scholar
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.Google Scholar
- Lin, M., Chen, Q., & Yan, S. (2014a). Network in network. In ICLR.Google Scholar
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014b). Microsoft coco: Common objects in context. In ECCV.Google Scholar
- Lipton, Z. C. (2016). The mythos of model interpretability. arXiv preprint arXiv:1606.03490v3.
- Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.Google Scholar
- Lu, J., Lin, X., Batra, D., & Parikh, D. (2015). Deeper LSTM and normalized CNN visual question answering model. https://github.com/VT-vision-lab/VQA_LSTM_CNN.
- Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In NIPS.Google Scholar
- Mahendran, A., & Vedaldi, A. (2016a). Salient deconvolutional networks. In European conference on computer vision.Google Scholar
- Mahendran, A., & Vedaldi, A. (2016b). Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 1–23.Google Scholar
- Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In ICCV.Google Scholar
- Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In CVPR.Google Scholar
- Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2015). Is object localization for free?—weakly-supervised learning with convolutional neural networks. In CVPR.Google Scholar
- Pinheiro, P. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In CVPR.Google Scholar
- Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In NIPS.Google Scholar
- Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should i trust you?” Explaining the predictions of any classifier. In SIGKDD.Google Scholar
- Selvaraju, R. R., Chattopadhyay, P., Elhoseiny, M., Sharma, T., Batra, D., Parikh, D., & Lee, S. (2018). Choose your neuron: Incorporating domain knowledge through neuron-importance. In Proceedings of the European conference on computer vision (ECCV) (pp. 526–541).Google Scholar
- Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016). Grad-CAM: Why did you say that? Visual explanations from deep networks via gradient-based localization. CoRR. arXiv:1610.02391
- Selvaraju, R. R., Lee, S., Shen, Y., Jin, H., Ghosh, S., Heck, L., Batra, D., & Parikh, D. (2019) Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the international conference on computer vision (ICCV).Google Scholar
- Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.Google Scholar
- Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR. arXiv:1312.6034
- Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. A. (2014). Striving for simplicity: The all convolutional net. CoRR. arXiv:1412.6806
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).Google Scholar
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In CVPR.Google Scholar
- Vondrick, C., Khosla, A., Malisiewicz, T., & Torralba, A. (2013). HOGgles: Visualizing object detection features. ICCV.Google Scholar
- Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.Google Scholar
- Zhang, J., Lin, Z., Brandt, J., Shen, X., & Sclaroff, S. (2016). Top-down neural attention by excitation backprop. In ECCV.Google Scholar
- Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2014). Object detectors emerge in deep scene cnns. CoRR. arXiv:1412.6856
- Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR.Google Scholar
- Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence.Google Scholar