Advertisement

MoQA – A Multi-modal Question Answering Architecture

  • Monica HauriletEmail author
  • Ziad Al-Halah
  • Rainer Stiefelhagen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)

Abstract

Multi-Modal Machine Comprehension (M3C) deals with extracting knowledge from multiple modalities such as figures, diagrams and text. Particularly, Textbook Question Answering (TQA) focuses on questions based on the school curricula, where the text and diagrams are extracted from textbooks. A subset of questions cannot be answered solely based on diagrams, but requires external knowledge of the surrounding text. In this work, we propose a novel deep model that is able to handle different knowledge modalities in the context of the question answering task. We compare three different information representations encountered in TQA: a visual representation learned from images, a graph representation of diagrams and a language-based representation learned from accompanying text. We evaluate our model on the TQA dataset that contains text and diagrams from the sixth grade material. Even though our model obtains competing results compared to state-of-the-art, we still witness a significant gap in performance compared to humans. We discuss in this work the shortcomings of the model and show the reason behind the large gap to human performance, by exploring the distribution of the multiple classes of mistakes that the model makes.

Supplementary material

478824_1_En_9_MOESM1_ESM.pdf (903 kb)
Supplementary material 1 (pdf 902 KB)

References

  1. 1.
    Antol, S., et al.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2017)Google Scholar
  3. 3.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)Google Scholar
  4. 4.
    Jauhar, S.K., Turney, P., Hovy, E.: Tables as semi-structured knowledge for question answering. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 474–483 (2016)Google Scholar
  5. 5.
    Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678 (2015)Google Scholar
  6. 6.
    Kahou, S.E., Atkinson, A., Michalski, V., Kádár, Á., Trischler, A., Bengio, Y.: Figureqa: an annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 (2017)
  7. 7.
    Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part IV. LNCS, vol. 9908, pp. 235–251. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_15CrossRefGoogle Scholar
  8. 8.
    Kembhavi, A., Seo, M., Schwenk, D., Choi, J., Farhadi, A., Hajishirzi, H.: Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  9. 9.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference for Learning Representations (2014)Google Scholar
  10. 10.
    Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)Google Scholar
  11. 11.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)Google Scholar
  13. 13.
    Li, J., Su, H., Zhu, J., Wang, S., Zhang, B.: Textbook question answering under instructor guidance with memory networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3655–3663 (2018)Google Scholar
  14. 14.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems, pp. 289–297 (2016)Google Scholar
  15. 15.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1–9. IEEE Computer Society (2015)Google Scholar
  16. 16.
    Reddy, R., Ramesh, R., Deshpande, A., Khapra, M.M.: A question-answering framework for plots using deep learning. arXiv preprint arXiv:1806.04655 (2018)
  17. 17.
    Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, pp. 2953–2961 (2015)Google Scholar
  18. 18.
    Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_28CrossRefGoogle Scholar
  19. 19.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  20. 20.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)Google Scholar
  21. 21.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Monica Haurilet
    • 1
    Email author
  • Ziad Al-Halah
    • 1
  • Rainer Stiefelhagen
    • 1
  1. 1.Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations