Multimedia Tools and Applications

, Volume 78, Issue 3, pp 2921–2935 | Cite as

Question action relevance and editing for visual question answering

  • Andeep S. ToorEmail author
  • Harry Wechsler
  • Michele Nappi


Visual Question Answering (VQA) expands on the Turing Test, as it involves the ability to answer questions about visual content. Current efforts in VQA, however, still do not fully consider whether a question about visual content is relevant and if it is not, how to edit it best to make it answerable. Question relevance has only been considered so far at the level of a whole question using binary classification and without the capability to edit a question to make it grounded and intelligible. The only exception to this is our prior research effort into question part relevance that allows for relevance and editing based on object nouns. This paper extends previous work on object relevance to determine the relevance for a question action and leverage this capability to edit an irrelevant question to make it relevant. Practical applications of such a capability include answering biometric-related queries across a set of images, including people and their action (behavioral biometrics). The feasibility of our approach is shown using Context-Collaborative VQA (C2VQA) Action/Relevance/Edit (ARE). Our results show that our proposed approach outperforms all other models for the novel tasks of question action relevance (QAR) and question action editing (QAE) by a significant margin. The ultimate goal for future research is to address full-fledged W5 + type of inquires (What, Where, When, Why, Who, and How) that are grounded to and reference video using both nouns and verbs in a collaborative context-aware fashion.


Computer vision Visual question answering Deep learning Action recognition Image understanding Question relevance 



We appreciate assistance from George Mason University, which provided access to GPU-based servers. These experiments were run on ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University, VA. (URL:


  1. 1.
    Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433Google Scholar
  2. 2.
    de Vries H, Strub F, Chandar S, Pietquin O, Larochelle H, Courville A (2016) Guesswhat?! visual object discovery through multi-modal dialogue. arXiv:1611.08481
  3. 3.
    Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. Proc Natl Acad Sci 112(12):3618–3623Google Scholar
  4. 4.
    Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448Google Scholar
  5. 5.
    Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2017) LSTM: a search space odyssey. IEEE Trans Neural Netw Learn Syst 28 (10):2222–2232MathSciNetCrossRefGoogle Scholar
  6. 6.
    He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
  7. 7.
    Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991
  8. 8.
    Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332
  9. 9.
    Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755Google Scholar
  10. 10.
    Mallya A, Lazebnik S (2016) Learning models for actions and person-object interactions with transfer to question answering. In: European conference on computer vision. Springer, pp 414–428Google Scholar
  11. 11.
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
  12. 12.
    Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4594–4602Google Scholar
  13. 13.
    Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543.
  14. 14.
    Ray A, Christie G, Bansal M, Batra D, Parikh D (2016) Question relevance in vqa: identifying non-visual and false-premise questions. arXiv:1606.06622
  15. 15.
    Ronchi MR, Perona P (2015) Describing common human visual actions in images. In: Xianghua Xie MWJ, Tam GKL (eds) Proceedings of the British machine vision conference (BMVC 2015). BMVA Press, pp 52.1–52.12Google Scholar
  16. 16.
    Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813Google Scholar
  17. 17.
    Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613–4621Google Scholar
  18. 18.
    Toor AS, Wechsler H (2017) Biometrics and forensics integration using deep multi-modal semantic alignment and joint embedding. Pattern Recogn Lett, ISSN: 0167-8655
  19. 19.
    Toor AS, Wechsler H, Nappi M (2017) Question part relevance and editing for cooperative and context-aware vqa (c2vqa). In: Proceedings of the 15th international workshop on content-based multimedia indexing. ACM, p 4Google Scholar
  20. 20.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164Google Scholar
  21. 21.
    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659Google Scholar
  22. 22.
    Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Conference on computer vision and pattern recognition, vol 1, p 8Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Andeep S. Toor
    • 1
    Email author
  • Harry Wechsler
    • 1
  • Michele Nappi
    • 2
  1. 1.Department of Computer ScienceGeorge Mason UniversityFairfaxUSA
  2. 2.Dipartimento di InformaticaUniversità di SalernoFiscianoItaly

Personalised recommendations