Multi-modal gated recurrent units for image description

Li, Xuelong; Yuan, Aihong; Lu, Xiaoqiang

doi:10.1007/s11042-018-5856-1

Multi-modal gated recurrent units for image description

Published: 15 March 2018

Volume 77, pages 29847–29869, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

584 Accesses
17 Citations
1 Altmetric
Explore all metrics

Abstract

Using a natural language sentence to describe the content of an image is a challenging but very important task. It is challenging because a description must not only capture objects contained in the image and the relationships among them, but also be relevant and grammatically correct. In this paper a multi-modal embedding model based on gated recurrent units (GRU) which can generate variable-length description for a given image. In the training step, we apply the convolutional neural network (CNN) to extract the image feature. Then the feature is imported into the multi-modal GRU as well as the corresponding sentence representations. The multi-modal GRU learns the inter-modal relations between image and sentence. And in the testing step, when an image is imported to our multi-modal GRU model, a sentence which describes the image content is generated. The experimental results demonstrate that our multi-modal GRU model obtains the state-of-the-art performance on Flickr8K, Flickr30K and MS COCO datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping

Deep Convolutional Neural Network for Correlating Images and Sentences

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and VQA. CoRR arXiv:1707.07998
Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the international conference on machine learning, pp 1247–1255
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol 29, pp 65–72
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Chen X, Zitnick CL (2014) Learning a recurrent visual representation for image caption generation. arXiv:1411.5654
Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431
Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 3981–3987
Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: benchmark and state of the art. Proc IEEE 105(10):1865–1883
Article Google Scholar
Cheng G, Li Z, Yao X, Guo L, Wei Z (2017) Remote sensing image scene classification using bag of convolutional features. IEEE Geosci Remote Sens Lett
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
Chung J, Gülċehre C, Cho K, Bengio Y (2015) Gated feedback recurrent neural networks. Computing Research Repository (CoRR) arXiv:1502.02367
Ding G, Guo Y, Zhou J, Gao Y (2016) Large-scale cross-modality search via collective matrix factorization hashing. IEEE Trans Image Process 25(11):5427–5440
Article MathSciNet Google Scholar
Ding G, Zhou J, Guo Y, Lin Z, Zhao S, Han J (2017) Large-scale image retrieval with sparse embedded hashing. Neurocomputing 257:24–36
Article Google Scholar
Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the international conference on computational linguistics. Association for Computational Linguistics, p 350
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Everingham M, Van Gool L, Williams CKI, Winn J (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Fairbank M, Alonso E, Prokhorov D (2013) An equivalence between adaptive dynamic programming with a critic and backpropagation through time. IEEE Trans Neural Netw Learn Syst 24(12):2088–2100
Article Google Scholar
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proceedings of the European conference on computer vision. Springer, pp 15–29
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of the advances in neural information processing systems, pp 2121–2129
Gatignon H (2014) Canonical correlation analysis. In: Statistical analysis of management data. Springer, pp 217–230
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) Lstm: a search space odyssey. arXiv:1503.04069
Guo Y, Ding G, Han J, Gao Y (2017) Zero-shot learning with transferred samples. IEEE Trans Image Process 26(7):3277–3290
Article MathSciNet Google Scholar
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet Google Scholar
Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: Proceedings of the international conference on computer vision. IEEE, pp 2407–2414
Jin J (2015) A c++ library for multimodal deep learning. Comput Res Repos (CoRR)
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of the advances in neural information processing systems, pp 1889–1897
Karpathy A, Johnson J, Li FF (2016) Visualizing and understanding recurrent networks. In: Proceedings of the international conference on learning representations
Kiros R, Salakhutdinov R, Zemel RS (2015) Unifying visual-semantic embeddings with multimodal neural language models. Trans Assoc Comput Linguis
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Proceedings of the advances in neural information processing systems, pp 3294–3302
Kombrink S, Mikolov T, Karafiát M, Burget L (2011) Recurrent neural network based language modeling in meeting recognition. In: Interspeech, vol 11, pp 2877–2880
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the advances in neural information processing systems, pp 1097–1105
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Linguis 2(10):351–362
Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision. Springer, pp 740–755
Liu C, Sun F, Wang C, Wang F, Yuille AL (2017) MAT: a multimodal attentive translator for image captioning. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, pp 4033–4039
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of the international conference on learning representations
Mei S, Zhu X (2015) The security of latent Dirichlet allocation. In: Proceedings of the international conference on artificial intelligence and statistics
Mikolov T, Deoras A, Povey D, Burget L, Ċernockỳ J (2011) Strategies for training large scale neural network language models. In: Proceedings of the automatic speech recognition and understanding, pp 196–201
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mnih A, Hinton G (2007) Three new graphical models for statistical language modelling. In: Proceedings of the international conference on machine learning. ACM, pp 641–648
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the association for computational linguistics. Association for Computational Linguistics, pp 311–318
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical Turk. Association for Computational Linguistics, pp 139–147
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg CA, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Salakhutdinov R, Hinton GE (2009) Deep boltzmann machines. In: Proceedings of the international conference on artificial intelligence and statistics, pp 448–455
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representations
Socher R, Huang EH, Pennin J, Manning CD, Ng AY (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the advances in neural information processing systems, pp 801–809
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguis 2:207–218
Google Scholar
Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep Boltzmann machines. In: Proceedings of the advances in neural information processing systems, pp 2222–2230
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the advances in neural information processing systems, pp 3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Article Google Scholar
Wu Q, Shen C, Liu L, Dick AR, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3441–3450
Yao X, Han J, Cheng G, Qian X, Guo L (2016) Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Trans Geosci Remote Sens 54(6):3660–3671
Article Google Scholar
Yao X, Han J, Zhang D, Nie F (2017) Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE Trans Image Process 26(7):3196–3209
Article MathSciNet Google Scholar
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguis 2:67–78
Google Scholar
Zhang D, Han J, Han J, Shao L (2016) Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining. IEEE Trans Neural Netw Learn Syst 27(6):1163–1176
Article MathSciNet Google Scholar
Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120(2):215–232
Article MathSciNet Google Scholar
Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales TM (2017) Actor-critic sequence training for image captioning. CoRR arXiv:1706.09601
Zhao S, Yao H, Zhang Y, Wang Y, Liu S (2015) View-based 3d object retrieval via multi-modal graph learning. Signal Process 112:110–118
Article Google Scholar
Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimed 19(3):632–645
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61761130079, in part by the Key Research Program of Frontier Sciences, CAS under Grant QYZDY-SSW-JSC044, in part by the National Natural Science Foundation of China under Grant 61472413, in part by the National Natural Science Foundation of China under Grant 61772510, and in part by the Young Top-notch Talent Program of Chinese Academy of Sciences under Grant QYZDB-SSW-JSC015.

Author information

Authors and Affiliations

Center for OPTical IMagery Analysis and Learning (OPTIMAL), Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, 710119, Shaanxi, People’s Republic of China
Xuelong Li, Aihong Yuan & Xiaoqiang Lu
University of Chinese Academy of Sciences, 19A Yuquanlu, Beijing, 100049, People’s Republic of China
Xuelong Li & Aihong Yuan

Authors

Xuelong Li
View author publications
You can also search for this author in PubMed Google Scholar
Aihong Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqiang Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoqiang Lu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, X., Yuan, A. & Lu, X. Multi-modal gated recurrent units for image description. Multimed Tools Appl 77, 29847–29869 (2018). https://doi.org/10.1007/s11042-018-5856-1

Download citation

Received: 21 May 2017
Revised: 04 January 2018
Accepted: 01 March 2018
Published: 15 March 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11042-018-5856-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal gated recurrent units for image description

Abstract

Access this article

Similar content being viewed by others

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping

Deep Convolutional Neural Network for Correlating Images and Sentences

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-modal gated recurrent units for image description

Abstract

Access this article

Similar content being viewed by others

Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning

Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping

Deep Convolutional Neural Network for Correlating Images and Sentences

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation