Advertisement

International Journal of Computer Vision

, Volume 127, Issue 1, pp 38–60 | Cite as

Combining Multiple Cues for Visual Madlibs Question Answering

  • Tatiana TommasiEmail author
  • Arun Mallya
  • Bryan  Plummer
  • Svetlana Lazebnik
  • Alexander C. Berg
  • Tamara L. Berg
Article
  • 356 Downloads

Abstract

This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction.

Keywords

Visual question answering Cue integration Region phrase correspondence Computer vision Language 

Notes

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grants 1302438, 1563727, 1405822, 1444234, 1562098, 1633295, 1452851, Xerox UAC, Microsoft Research Faculty Fellowship, and the Sloan Foundation Fellowship. T.T. was partially supported by the ERC Grant 637076 - RoboExNovo.

References

  1. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Deep compositional question answering with neural module networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  2. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual Question Answering. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
  4. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). DBpedia: A Nucleus for a Web of Open Data. In International Semantic Web Conference, Asian Semantic Web Conference (ISWC + ASWC).Google Scholar
  5. Bourdev, L., Maji, S., & Malik, J. (2011). Describing people: Poselet-based attribute classification. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
  6. Chao, YW., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A Benchmark for Recognizing Human-Object Interactions in Images. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
  7. Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. In International Conference on Machine Learning (ICML).Google Scholar
  8. Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
  9. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? dataset and methods for multilingual image question answering. In Neural Information Processing Systems (NIPS).Google Scholar
  10. Geman, D., Geman, S., Hallonquist, N., & Younes, L. (2015). Visual turing test for computer vision systems. PNAS, 112(12), 3618–23.Google Scholar
  11. Girshick, R. (2015). Fast R-CNN. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
  12. Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2014). A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106(2), 210–233.CrossRefGoogle Scholar
  13. Hardoon, D., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12), 2639–2664.CrossRefzbMATHGoogle Scholar
  14. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770–778.Google Scholar
  15. Hotelling, H. (1936). Relations between two sets of variables. Biometrika, 28, 312–377.CrossRefzbMATHGoogle Scholar
  16. Ilievski, I., Yan, S., & Feng, J. (2016). A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485.
  17. Lassila, O., & Swick, R. R. (1999). Resource Description Framework (RDF) Model and Syntax Specification. Tech. rep., W3C, http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/.
  18. Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053.
  19. Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV).Google Scholar
  20. Liu, H., & Singh, P. (2004). Conceptnet—a practical commonsense reasoning tool-kit. BT Technology Journal, 22(4), 211–226.CrossRefGoogle Scholar
  21. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2015). SSD: Single shot multibox detector. arXiv preprint arXiv:1512.02325.
  22. Lyu, S. (2005). Mercer kernels for object recognition with local features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  23. Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.Google Scholar
  24. Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In Neural Information Processing Systems (NIPS).Google Scholar
  25. Mallya, A., & Lazebnik, S. (2016). Learning models for actions and person-object interactions with transfer to question answering. In European Conference on Computer Vision (ECCV).Google Scholar
  26. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems (NIPS).Google Scholar
  27. Mokarian, A., Malinowski, M., & Fritz, M. (2016). Mean box pooling: A rich image representation and output embedding for the visual madlibs task. In British Machine Vision Conference (BMVC).Google Scholar
  28. Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In German Conference on Pattern Recognition (GCPR).Google Scholar
  29. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2017). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123(1), 74–93.MathSciNetCrossRefGoogle Scholar
  30. Ren, M., Kiros, R., & Zemel, R. (2015a). Exploring models and data for image question answering. In Neural Information Processing Systems (NIPS).Google Scholar
  31. Ren, S., He, K., Girshick, R., & Sun, J. (2015b). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS).Google Scholar
  32. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Image net large scale visual recognition challenge. IJCV, 115(3), 211–252.CrossRefGoogle Scholar
  33. Saito, K., Shin, A., Ushiku, Y., & Harada, T. (2017). Dualnet: Domain-invariant network for visual question answering. In IEEE International Conference on Multimedia and Expo (ICME), pp 829–834.Google Scholar
  34. Shih, K. J., Singh, S., & Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  35. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  36. Socher, R., Bauer, J., Manning, C. D., & Ng, A. Y. (2013). Parsing With Compositional Vector Grammars. In ACL.Google Scholar
  37. Sudowe, P., Spitzer, H., & Leibe, B. (2015). Person attribute recognition with a jointly-trained holistic cnn model. In ICCV’15 ChaLearn Looking at People Workshop.Google Scholar
  38. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
  39. Tandon, N., de Melo, G., Suchanek, F., & Weikum, G. (2014). Webchild: Harvesting and organizing commonsense knowledge from the web. In ACM International Conference on Web Search and Data Mining.Google Scholar
  40. Tommasi, T., Mallya, A., Plummer, B., Lazebnik, S., Berg, AC., & Berg, TL. (2016). Solving visual madlibs with multiple cues. In British Machine Vision Conference (BMVC).Google Scholar
  41. Wang, P., Wu, Q., Shen, C., Dick, A., & van den Hengel, A. (2017a). FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).Google Scholar
  42. Wang, P., Wu, Q., Shen, C., & van den Hengel, A. (2017b). The VQA-machine: Learning how to use existing vision algorithms to answer new questions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  43. Wu, Q., Shen, C., Hengel, Avd., Wang, P., & Dick, A. (2016a). Image captioning and visual question answering based on attributes and their related external knowledge. arXiv preprint arXiv:1603.02814.
  44. Wu, Q., Wang, P., Shen, C., Dick, A. R., & van den Hengel, A. (2016b). Ask me anything: Free-form visual question answering based on knowledge from external sources. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4622–4630.Google Scholar
  45. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). SUN database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  46. Xu, H., & Saenko, K. (2015). Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. arXiv preprint arXiv:1511.05234.
  47. Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. J. (2016). Stacked attention networks for image question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  48. Yu, D., Fu, J., Mei, T., & Rui, Y. (2017). Multi-level attention networks for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  49. Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual Madlibs: Fill in the blank Image Generation and Question Answering. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
  50. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Neural Information Processing Systems (NIPS).Google Scholar
  51. Zhu, Y., Zhang, C., Ré, C., & Fei-Fei, L. (2015). Building a large-scale multimodal knowledge base for visual question answering. arXiv preprint arXiv:1507.05670.
  52. Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded Question Answering in Images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Italian Institute of TechnologyMilanItaly
  2. 2.University of Illinois at Urbana ChampaignUrbanaUSA
  3. 3.University of North Carolina at Chapel HillChapel HillUSA

Personalised recommendations