Improving Visual Reasoning with Attention Alignment

  • Komal SharanEmail author
  • Ashwinkumar Ganesan
  • Tim Oates
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11844)


Since attention mechanisms were introduced, they have become an important component of neural network architectures. This is because they mimic how humans reason about visual stimuli by focusing on important parts of the input. In visual tasks like image captioning and visual question answering (VQA), networks can generate the correct answer or a comprehensible caption despite attending to wrong part of an image or text. This lack of synchronization between human and network attention hinders the model’s ability to generalize. To improve human-like reasoning capabilities of the model, it is necessary to align what the network and a human will focus on, given the same input. We propose a mechanism to correct visual attention in the network by explicitly training the model to learn the salient parts of an image available in the VQA-HAT dataset. The results show an improvement in the visual question answering task across different types of questions.


Neural attention Neural networks Visual Question Answering (VQA) Supervised learning Artificial intelligence 


  1. 1.
    Antol, S., et al.: Vqa: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention (2014)Google Scholar
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2014)Google Scholar
  4. 4.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  5. 5.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)CrossRefGoogle Scholar
  6. 6.
    Cerf, M., Frady, E.P., Koch, C.: Faces and text attract gaze independent of the task: experimental data and computer model. J. Vis. 9(12), 10 (2009)CrossRefGoogle Scholar
  7. 7.
    Chaudhari, S., Polatkan, G., Ramanath, R., Mithal, V.: An attentive survey of attention models (2019)Google Scholar
  8. 8.
    Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? Comput. Vis. Image Underst. 163, 90–100 (2017)CrossRefGoogle Scholar
  9. 9.
    Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling (2016)Google Scholar
  10. 10.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  11. 11.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems, pp. 289–297 (2016)Google Scholar
  12. 12.
    Papadopoulos, D.P., Clarke, A.D.F., Keller, F., Ferrari, V.: Training object class detectors from eye tracking data. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 361–376. Springer, Cham (2014). Scholar
  13. 13.
    Park, D.H., Hendricks, L.A., Akata, Z., Schiele, B., Darrell, T., Rohrbach, M.: Attentive explanations: Justifying decisions and pointing to the evidence (2016)Google Scholar
  14. 14.
    Qiao, T., Dong, J., Xu, D.: Exploring human-like attention supervision in visual question answering. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  15. 15.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  16. 16.
    TingAnChien: san-vqa-tensorflow. (2016)
  17. 17.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)Google Scholar
  18. 18.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  19. 19.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)Google Scholar
  20. 20.
    Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.H., Kim, G.: Supervising neural attention models for video captioning by human gaze data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 490–498 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University Of Maryland Baltimore County (UMBC)BaltimoreUSA

Personalised recommendations