Abstract
Many machine learning systems deployed for real-world applications such as recommender systems, image captioning, object detection, etc. are ensembles of multiple models. Also, the top-ranked systems in many data-mining and computer vision competitions use ensembles. Although ensembles are popular, they are opaque and hard to interpret. Explanations make AI systems more transparent and also justify their predictions. However, there has been little work on generating explanations for ensembles. In this chapter, we propose two new methods for ensembling visual explanations for VQA using the localization maps for the component systems. Our novel approach is scalable with the number of component models in the ensemble. Evaluating explanations is also a challenging research problem. We introduce two new approaches to evaluate explanations—the comparison metric and the uncovering metric. Our crowd-sourced human evaluation indicates that our ensemble visual explanation is significantly qualitatively outperform each of the individual system’s visual explanation. Overall, our ensemble explanation is better 61% of the time when compared to any individual system’s explanation and is also sufficient for humans to arrive at the correct answer, just based on the explanation, at least 64% of the time.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Based on the performance reported on the CodaLab Leader-board and human performance reported on the task in Antol et al. (2015).
References
Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP2016)
Aha DW, Darrell T, Pazzani M, Reid D, Sammut C, (Eds) PS (2017) Explainable Artificial Intelligence (XAI) Workshop at IJCAI. URL http://home.earthlink.net/~dwaha/research/meetings/ijcai17-xai/
Andreas J, Rohrbach M, Darrell T, Klein D (2016a) Learning to compose neural networks for question answering. In: Proceedings of NAACL2016
Andreas J, Rohrbach M, Darrell T, Klein D (2016b) Neural module networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 39–48
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) VQA: Visual Question Answering. In: Proceedings of ICCV2015
Bau D, Zhou B, Khosla A, Oliva A, Torralba A (2017) Network dissection: Quantifying interpretability of deep visual representations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3319–3327
Berg T, Belhumeur PN (2013) How do you tell a blackbird from a crow? In: Proceedings of ICCV2013
Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional neural network for Visual Question Answering. arXiv preprint arXiv:151105960
Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding 163:90–100
Dietterich T (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) First International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, Springer-Verlag, pp 1–15
Fridman L, Jenik B, Terwilliger J (2018) DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning. arXiv preprint arXiv:180102805
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal Compact Bilinear pooling for Visual Question Answering and Visual Grounding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016)
Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 317–326
Goyal Y, Mohapatra A, Parikh D, Batra D (2016) Towards Transparent AI Systems: Interpreting Visual Question Answering Models. In: International Conference on Machine Learning (ICML) Workshop on Visualization for Deep Learning
Gunning D (2016) Explainable Artificial Intelligence (XAI), DARPA Broad Agency Announcement, URL https://www.darpa.mil/attachments/DARPA-BAA-16-53.pdf
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hendricks LA, Akata Z, Rohrbach M, Donahue J, Schiele B, Darrell T (2016) Generating Visual Explanations. In: Proceedings of the European Conference on Computer Vision (ECCV2016)
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780
Johns E, Mac Aodha O, Brostow GJ (2015) Becoming the expert-interactive multi-class machine teaching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2015)
Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016)
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV2014), Springer, pp 740–755
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems (NIPS2016), pp 289–297
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems (NIPS2014), pp 1682–1690
Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 30–38
Park DH, Hendricks LA, Akata Z, Schiele B, Darrell T, Rohrbach M (2016) Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:161204757
Rajani NF, Mooney RJ (2016) Combining Supervised and Unsupervised Ensembles for Knowledge Base Population. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP2016), URL http://www.cs.utexas.edu/users/ai-lab/pub-view.php?PubID=127566
Rajani NF, Mooney RJ (2017) Stacking With Auxiliary Features. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI2017), Melbourne, Australia
Rajani NF, Mooney RJ (2018) Stacking With Auxiliary Features for Visual Question Answering. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2016)
Ross AS, Hughes MC, Doshi-Velez F (2017) Right for the right reasons: Training differentiable models by constraining their explanations. In: Proceedings of IJCAI2017
Samek W, Binder A, Montavon G, Lapuschkin S, Müller KR (2017) Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: The IEEE International Conference on Computer Vision (ICCV2017)
Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-scale Image Recognition. In: Proceedings of ICLR2015
Viswanathan V, Rajani NF, Bentor Y, Mooney RJ (2015) Stacked Ensembles of Information Extractors for Knowledge-Base Population. In: Association for Computational Linguistics (ACL2015), Beijing, China, pp 177–187
Wolpert DH (1992) Stacked Generalization. Neural Networks 5:241–259
Xu H, Saenko K (2016) Ask, Attend and Answer: Exploring question-guided spatial attention for visual question answering. In: Proceedings of ECCV2016
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015a) Object detectors emerge in deep scene CNNs. In: Proceedings of the International Conference on Learning Representations (ICLR2015)
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015b) Simple baseline for visual question answering. arXiv preprint arXiv:151202167
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Rajani, N.F., Mooney, R.J. (2018). Ensembling Visual Explanations. In: Escalante, H., et al. Explainable and Interpretable Models in Computer Vision and Machine Learning. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-98131-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-98131-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98130-7
Online ISBN: 978-3-319-98131-4
eBook Packages: Computer ScienceComputer Science (R0)