VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering

Bolaños, Marc; Peris, Álvaro; Casacuberta, Francisco; Radeva, Petia

doi:10.1007/978-3-319-58838-4_41

Marc Bolaños^16,17,
Álvaro Peris¹⁸,
Francisco Casacuberta¹⁸ &
…
Petia Radeva^16,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10255))

Included in the following conference series:

Iberian Conference on Pattern Recognition and Image Analysis

1834 Accesses
2 Citations

Abstract

In this paper, we address the problem of visual question answering by proposing a novel model, called VIBIKNet. Our model is based on integrating Kernelized Convolutional Neural Networks and Long-Short Term Memory units to generate an answer given a question about an image. We prove that VIBIKNet is an optimal trade-off between accuracy and computational load, in terms of memory and time consumption. We validate our method on the VQA challenge dataset and compare it to the top performing methods in order to illustrate its performance and speed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/MarcBS/VIBIKNet.
2.
The VQA Challenge leaderboard is available at http://visualqa.org/roe.html.

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh., D.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
Google Scholar
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server. arXiv:1504.00325 (2015)
Cheng, G., Zhou, P., Han, J.: RIFD-CNN: rotation-invariant and fisher discriminative convolutional neural networks for object detection. In: CVPR, pp. 2884–2893 (2016)
Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847 (2016)
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kim, J.-H., Lee, S.-W., Kwak, D.-H., Heo, M.-O., Kim, J., Ha, J.-W., Zhang, B.-T.: Multimodal residual learning for visual QA. arXiv:1606.01455 (2016)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
Liu, Z.: Kernelized deep convolutional neural network for describing complex images. arXiv:1509.04581 (2015)
Nam, H., Ha, J.-W., Kim, J.: Dual attention networks for multimodal reasoning and matching. arXiv:1611.00471 (2016)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Peris, Á., Bolaños, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 3–11. Springer, Cham (2016). doi:10.1007/978-3-319-44781-0_1
Chapter Google Scholar
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15561-1_11
Chapter Google Scholar
Sivic, J., Zisserman, A.: Efficient visual search of videos cast as text retrieval. PAMI 31(4), 591–606 (2009)
Article Google Scholar
Specia, L., Frank, S., Sima’an, K., Elliott, D.: A shared task on multimodal machine translation and crosslingual image description. In: Proceedings of the First Conference on Machine Translation, pp. 543–553. ACL (2016)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, vol. 27, pp. 3104–3112 (2014)
Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)
Google Scholar
Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv:1502.03044 (2015)
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). doi:10.1007/978-3-319-10602-1_26
Google Scholar

Download references

Acknowledgments

This work was partially funded by TIN2015-66951-C2-1-R, SGR 1219, CERCA Programme/Generalitat de Catalunya, CoMUN-HaT - TIN2015-70924-C2-1-R (MINECO/FEDER), PrometeoII/2014/030 and R-MIPRCV. P. Radeva is partially supported by ICREA Academia2014. We acknowledge NVIDIA Corporation for the donation of a GPU used in this work.

Author information

Authors and Affiliations

Universitat de Barcelona, Barcelona, Spain
Marc Bolaños & Petia Radeva
Computer Vision Center, Bellaterra, Spain
Marc Bolaños & Petia Radeva
PRHLT Research Center, Universitat Politècnica de València, Valencia, Spain
Álvaro Peris & Francisco Casacuberta

Authors

Marc Bolaños
View author publications
You can also search for this author in PubMed Google Scholar
Álvaro Peris
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Casacuberta
View author publications
You can also search for this author in PubMed Google Scholar
Petia Radeva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marc Bolaños .

Editor information

Editors and Affiliations

Universidade da Beira Interior , Covilhã, Portugal
Luís A. Alexandre
University Jaume I , Castellón, Spain
José Salvador Sánchez
University of the Algarve , Faro, Portugal
João M. F. Rodrigues

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bolaños, M., Peris, Á., Casacuberta, F., Radeva, P. (2017). VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering. In: Alexandre, L., Salvador Sánchez, J., Rodrigues, J. (eds) Pattern Recognition and Image Analysis. IbPRIA 2017. Lecture Notes in Computer Science(), vol 10255. Springer, Cham. https://doi.org/10.1007/978-3-319-58838-4_41

Download citation

DOI: https://doi.org/10.1007/978-3-319-58838-4_41
Published: 12 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58837-7
Online ISBN: 978-3-319-58838-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics