Quantifying the Amount of Visual Information Used by Neural Caption Generators

  • Marc TantiEmail author
  • Albert Gatt
  • Kenneth P. Camilleri
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


This paper addresses the sensitivity of neural image caption generators to their visual input. A sensitivity analysis and omission analysis based on image foils is reported, showing that the extent to which image captioning architectures retain and are sensitive to visual information varies depending on the type of word being generated and the position in the caption as a whole. We motivate this work in the context of broader goals in the field to achieve more explainability in AI.


Image captioning Sensitivity analysis Explainable AI 



The research in this paper is partially funded by the Endeavour Scholarship Scheme (Malta). Scholarships are part-financed by the European Union - European Social Fund (ESF) - Operational Programme II Cohesion Policy 2014–2020 Investing in human capital to create more opportunities and promote the well-being of society.


  1. 1.
    Bernardi, R., et al.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. JAIR 55, 409–442 (2016)CrossRefGoogle Scholar
  2. 2.
    Kádár, Á., Chrupała, G., Alishahi, A.: Representation of linguistic form and function in recurrent neural networks. Comput. Linguist. 43(4), 761–780 (2017). Scholar
  3. 3.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the CVPR 2015 (2015).
  4. 4.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  5. 5.
    Samek, W., Wiegand, T., Müller, K.R.: Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. ITU Journal: ICT Discoveries - Special Issue 1 - Impact Artif. Intell. (AI) Commun. Netw. Serv. 1(1), 39–48 (2018).
  6. 6.
    Shekhar, R., et al.: Foil it! find one mismatch between image and language caption. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 255–265. Association for Computational Linguistics, Vancouver, Canada, July 2017.
  7. 7.
    Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR 1409.1556 (2014)Google Scholar
  8. 8.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Marc Tanti
    • 1
    Email author
  • Albert Gatt
    • 1
  • Kenneth P. Camilleri
    • 1
  1. 1.University of MaltaMsidaMalta

Personalised recommendations