Advertisement

Sequential image encoding for vision-to-language problems

  • Jicheng Wang
  • Yuanen Zhou
  • Zhenzhen HuEmail author
  • Xu Zhang
  • Meng Wang
Article
  • 1 Downloads

Abstract

The combination of visual recognition and language understanding is aim to build a commonly shared space between heterogeneous data of vision and text, such as the tasks of image captioning and visual question answering (VQA). Most existing approaches convert an image into a semantic visual feature vector via deep convolutional neural networks (CNN), while keep the sequential property of text data and represent it with Recurrent Neural Networks(RNN). The key to analyse multi-source heterogeneous data is to construct the inherent correlations between data. In order to reduce the heterogeneous gap among the vision and language, in this work, we represent the image in a sequential way as well as the text. We utilize the objects in the visual scenes and convert the image to a sequence of detected objects and their locations. Then we analogize a sequence of objects(visual language) to a sequence of words(natural language). We take the order of objects into account and evaluate different permutations and combinations of objects. Experimental results on the image captioning and VQA benchmarks demonstrate our hypothesis it’s beneficial to appropriately arrange objects sequence on the Vision-to-Language(V2L) problems.

Keywords

Image captioning Visual question answering Object detection 

Notes

References

  1. 1.
    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPRGoogle Scholar
  2. 2.
    Andreas J, Rohrbach M, Darrell T, Klein D (2016) Learning to compose neural networks for question answering. arXiv:1601.01705
  3. 3.
    Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433Google Scholar
  4. 4.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
  5. 5.
    Dai B, Ye D, Lin D (2018) Rethinking the form of latent states in image captioning ECCVGoogle Scholar
  6. 6.
    Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1292–1302Google Scholar
  7. 7.
    Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
  8. 8.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neur Comput 9(8):1735–1780CrossRefGoogle Scholar
  9. 9.
    Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K (2017) Learning to reason: end-to-end module networks for visual question answering. arXiv:1704.05526, 3
  10. 10.
    Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding long-short term memory for image caption generation. arXiv:1509.04942
  11. 11.
    Johnson J, Hariharan B, van der Maaten L, Hoffman J, Fei-Fei L, Zitnick CL, Girshick RB (2017) Inferring and executing programs for visual reasoning. In: ICCV, pp 3008–3017Google Scholar
  12. 12.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137Google Scholar
  13. 13.
    Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016) Multimodal residual learning for visual qa. In: Advances in neural information processing systems, pp 361–369Google Scholar
  14. 14.
    Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1. Association for Computational Linguistics, pp 359–368Google Scholar
  15. 15.
    Li L, Tang S, Deng L, Zhang Y, Tian Q (2017) Image caption with global-local attention. In: AAAI, pp 4133–4139Google Scholar
  16. 16.
    Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755Google Scholar
  17. 17.
    Lin Y, Pang Z, Wang D, Zhuang Y (2018) Feature enhancement in attention for visual question answering. In: IJCAI, pp 4216–4222Google Scholar
  18. 18.
    Liu C, Sun F, Wang C, Wang F, Yuille A (2017) Mat: a multimodal attentive translator for image captioning. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp 4033–4039.  https://doi.org/10.24963/ijcai.2017/563
  19. 19.
    Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297Google Scholar
  20. 20.
    Lu P, Ji L, Zhang W, Duan N, Zhou M, Wang J (2018) R-vqa: learning visual relation facts with semantic attention for visual question answering. arXiv:1805.09701
  21. 21.
    Lu P, Li H, Zhang W, Wang J, Wang X (2018) Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In: AAAI 2018, pp 7218–7225Google Scholar
  22. 22.
    Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632
  23. 23.
    Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), vol 2, pp 592–598Google Scholar
  24. 24.
    Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics. Association for Computational Linguistics, pp 747–756Google Scholar
  25. 25.
    Mun J, Cho M, Han B (2017) Text-guided attention model for image captioning. In: AAAI, pp 4233–4239Google Scholar
  26. 26.
    Ordonez V, Han X, Kuznetsova P, Kulkarni G, Mitchell M, Yamaguchi K, Stratos K, Goyal A, Dodge J, Mensch A et al (2016) Large scale retrieval and generation of image descriptions. Int J Comput Vis 119(1):46–59MathSciNetCrossRefGoogle Scholar
  27. 27.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99Google Scholar
  28. 28.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  29. 29.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112Google Scholar
  30. 30.
    Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566– 4575Google Scholar
  31. 31.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164Google Scholar
  32. 32.
    Wang P, Wu Q, Shen C, Hengel Avd, Dick A (2015) Explicit knowledge-based reasoning for visual question answering. arXiv:1511.02570
  33. 33.
    Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2017) Fvqa: fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine IntelligenceGoogle Scholar
  34. 34.
    Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212Google Scholar
  35. 35.
    Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016) Ask me anything: free-form visual question answering based on knowledge from external sources. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4622–4630Google Scholar
  36. 36.
    Wu Q, Teney D, Wang P, Shen C, Dick A, van den Hengel A (2017) Visual question answering: a survey of methods and datasets. Comput Vis Image Underst 163:21–40CrossRefGoogle Scholar
  37. 37.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057Google Scholar
  38. 38.
    Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29Google Scholar
  39. 39.
    Yin X, Ordonez V (2017) Obj2text: generating visually descriptive language from object layouts. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 177–187Google Scholar
  40. 40.
    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659Google Scholar
  41. 41.
    Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: 2017 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 4187–4195Google Scholar
  42. 42.
    Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proc. IEEE int. conf. comp. vis, vol 3Google Scholar
  43. 43.
    Zhang Y, Hare J, Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. arXiv:1802.05766
  44. 44.
    Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv:1512.02167

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Hefei University of TechnologyHefeiChina
  2. 2.Suzhou Vocational UniversitySuzhouChina

Personalised recommendations