Ensembling Visual Explanations

Rajani, Nazneen Fatema; Mooney, Raymond J.

doi:10.1007/978-3-319-98131-4_7

Ensembling Visual Explanations

Nazneen Fatema Rajani¹¹ &
Raymond J. Mooney¹¹

Chapter
First Online: 30 November 2018

4345 Accesses
4 Citations

Part of the book series: The Springer Series on Challenges in Machine Learning ((SSCML))

Abstract

Many machine learning systems deployed for real-world applications such as recommender systems, image captioning, object detection, etc. are ensembles of multiple models. Also, the top-ranked systems in many data-mining and computer vision competitions use ensembles. Although ensembles are popular, they are opaque and hard to interpret. Explanations make AI systems more transparent and also justify their predictions. However, there has been little work on generating explanations for ensembles. In this chapter, we propose two new methods for ensembling visual explanations for VQA using the localization maps for the component systems. Our novel approach is scalable with the number of component models in the ensemble. Evaluating explanations is also a challenging research problem. We introduce two new approaches to evaluate explanations—the comparison metric and the uncovering metric. Our crowd-sourced human evaluation indicates that our ensemble visual explanation is significantly qualitatively outperform each of the individual system’s visual explanation. Overall, our ensemble explanation is better 61% of the time when compared to any individual system’s explanation and is also sufficient for humans to arrive at the correct answer, just based on the explanation, at least 64% of the time.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover + eBook: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Based on the performance reported on the CodaLab Leader-board and human performance reported on the task in Antol et al. (2015).

References

Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP2016)
Google Scholar
Aha DW, Darrell T, Pazzani M, Reid D, Sammut C, (Eds) PS (2017) Explainable Artificial Intelligence (XAI) Workshop at IJCAI. URL http://home.earthlink.net/~dwaha/research/meetings/ijcai17-xai/
Andreas J, Rohrbach M, Darrell T, Klein D (2016a) Learning to compose neural networks for question answering. In: Proceedings of NAACL2016
Google Scholar
Andreas J, Rohrbach M, Darrell T, Klein D (2016b) Neural module networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 39–48
Google Scholar
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) VQA: Visual Question Answering. In: Proceedings of ICCV2015
Google Scholar
Bau D, Zhou B, Khosla A, Oliva A, Torralba A (2017) Network dissection: Quantifying interpretability of deep visual representations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3319–3327
Google Scholar
Berg T, Belhumeur PN (2013) How do you tell a blackbird from a crow? In: Proceedings of ICCV2013
Google Scholar
Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional neural network for Visual Question Answering. arXiv preprint arXiv:151105960
Google Scholar
Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Computer Vision and Image Understanding 163:90–100
Article Google Scholar
Dietterich T (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) First International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, Springer-Verlag, pp 1–15
Google Scholar
Fridman L, Jenik B, Terwilliger J (2018) DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning. arXiv preprint arXiv:180102805
Google Scholar
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal Compact Bilinear pooling for Visual Question Answering and Visual Grounding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016)
Google Scholar
Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 317–326
Google Scholar
Goyal Y, Mohapatra A, Parikh D, Batra D (2016) Towards Transparent AI Systems: Interpreting Visual Question Answering Models. In: International Conference on Machine Learning (ICML) Workshop on Visualization for Deep Learning
Google Scholar
Gunning D (2016) Explainable Artificial Intelligence (XAI), DARPA Broad Agency Announcement, URL https://www.darpa.mil/attachments/DARPA-BAA-16-53.pdf
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Google Scholar
Hendricks LA, Akata Z, Rohrbach M, Donahue J, Schiele B, Darrell T (2016) Generating Visual Explanations. In: Proceedings of the European Conference on Computer Vision (ECCV2016)
Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation 9(8):1735–1780
Article Google Scholar
Johns E, Mac Aodha O, Brostow GJ (2015) Becoming the expert-interactive multi-class machine teaching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2015)
Google Scholar
Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016)
Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV2014), Springer, pp 740–755
Google Scholar
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems (NIPS2016), pp 289–297
Google Scholar
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems (NIPS2014), pp 1682–1690
Google Scholar
Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), pp 30–38
Google Scholar
Park DH, Hendricks LA, Akata Z, Schiele B, Darrell T, Rohrbach M (2016) Attentive explanations: Justifying decisions and pointing to the evidence. arXiv preprint arXiv:161204757
Google Scholar
Rajani NF, Mooney RJ (2016) Combining Supervised and Unsupervised Ensembles for Knowledge Base Population. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP2016), URL http://www.cs.utexas.edu/users/ai-lab/pub-view.php?PubID=127566
Rajani NF, Mooney RJ (2017) Stacking With Auxiliary Features. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI2017), Melbourne, Australia
Google Scholar
Rajani NF, Mooney RJ (2018) Stacking With Auxiliary Features for Visual Question Answering. In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Google Scholar
Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2016)
Google Scholar
Ross AS, Hughes MC, Doshi-Velez F (2017) Right for the right reasons: Training differentiable models by constraining their explanations. In: Proceedings of IJCAI2017
Google Scholar
Samek W, Binder A, Montavon G, Lapuschkin S, Müller KR (2017) Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems
Google Scholar
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: The IEEE International Conference on Computer Vision (ICCV2017)
Google Scholar
Simonyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-scale Image Recognition. In: Proceedings of ICLR2015
Google Scholar
Viswanathan V, Rajani NF, Bentor Y, Mooney RJ (2015) Stacked Ensembles of Information Extractors for Knowledge-Base Population. In: Association for Computational Linguistics (ACL2015), Beijing, China, pp 177–187
Google Scholar
Wolpert DH (1992) Stacked Generalization. Neural Networks 5:241–259
Article Google Scholar
Xu H, Saenko K (2016) Ask, Attend and Answer: Exploring question-guided spatial attention for visual question answering. In: Proceedings of ECCV2016
Google Scholar
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015a) Object detectors emerge in deep scene CNNs. In: Proceedings of the International Conference on Learning Representations (ICLR2015)
Google Scholar
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015b) Simple baseline for visual question answering. arXiv preprint arXiv:151202167
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Texas at Austin, Austin, TX, USA
Nazneen Fatema Rajani & Raymond J. Mooney

Authors

Nazneen Fatema Rajani
View author publications
You can also search for this author in PubMed Google Scholar
Raymond J. Mooney
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nazneen Fatema Rajani .

Editor information

Editors and Affiliations

INAOE, Puebla, Mexico
Hugo Jair Escalante
University of Barcelona, Barcelona, Spain
Sergio Escalera
INRIA, Université Paris Sud, Université Paris Saclay, Paris, France
Isabelle Guyon
Open University of Catalonia, Barcelona, Spain
Xavier Baró
Radboud University Nijmegen, Nijmegen, The Netherlands
Yağmur Güçlütürk
Radboud University Nijmegen, Nijmegen, The Netherlands
Umut Güçlü
Radboud University Nijmegen, Nijmegen, The Netherlands
Marcel van Gerven

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Rajani, N.F., Mooney, R.J. (2018). Ensembling Visual Explanations. In: Escalante, H., et al. Explainable and Interpretable Models in Computer Vision and Machine Learning. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-319-98131-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-98131-4_7
Published: 30 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98130-7
Online ISBN: 978-3-319-98131-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics