Abstract
Using a natural language sentence to describe the content of an image is a challenging but very important task. It is challenging because a description must not only capture objects contained in the image and the relationships among them, but also be relevant and grammatically correct. In this paper a multi-modal embedding model based on gated recurrent units (GRU) which can generate variable-length description for a given image. In the training step, we apply the convolutional neural network (CNN) to extract the image feature. Then the feature is imported into the multi-modal GRU as well as the corresponding sentence representations. The multi-modal GRU learns the inter-modal relations between image and sentence. And in the testing step, when an image is imported to our multi-modal GRU model, a sentence which describes the image content is generated. The experimental results demonstrate that our multi-modal GRU model obtains the state-of-the-art performance on Flickr8K, Flickr30K and MS COCO datasets.
Similar content being viewed by others
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and VQA. CoRR arXiv:1707.07998
Andrew G, Arora R, Bilmes JA, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the international conference on machine learning, pp 1247–1255
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol 29, pp 65–72
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Chen X, Zitnick CL (2014) Learning a recurrent visual representation for image caption generation. arXiv:1411.5654
Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431
Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 3981–3987
Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: benchmark and state of the art. Proc IEEE 105(10):1865–1883
Cheng G, Li Z, Yao X, Guo L, Wei Z (2017) Remote sensing image scene classification using bag of convolutional features. IEEE Geosci Remote Sens Lett
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
Chung J, Gülċehre C, Cho K, Bengio Y (2015) Gated feedback recurrent neural networks. Computing Research Repository (CoRR) arXiv:1502.02367
Ding G, Guo Y, Zhou J, Gao Y (2016) Large-scale cross-modality search via collective matrix factorization hashing. IEEE Trans Image Process 25(11):5427–5440
Ding G, Zhou J, Guo Y, Lin Z, Zhao S, Han J (2017) Large-scale image retrieval with sparse embedded hashing. Neurocomputing 257:24–36
Dolan B, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the international conference on computational linguistics. Association for Computational Linguistics, p 350
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Everingham M, Van Gool L, Williams CKI, Winn J (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Fairbank M, Alonso E, Prokhorov D (2013) An equivalence between adaptive dynamic programming with a critic and backpropagation through time. IEEE Trans Neural Netw Learn Syst 24(12):2088–2100
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proceedings of the European conference on computer vision. Springer, pp 15–29
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of the advances in neural information processing systems, pp 2121–2129
Gatignon H (2014) Canonical correlation analysis. In: Statistical analysis of management data. Springer, pp 217–230
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) Lstm: a search space odyssey. arXiv:1503.04069
Guo Y, Ding G, Han J, Gao Y (2017) Zero-shot learning with transferred samples. IEEE Trans Image Process 26(7):3277–3290
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: Proceedings of the international conference on computer vision. IEEE, pp 2407–2414
Jin J (2015) A c++ library for multimodal deep learning. Comput Res Repos (CoRR)
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of the advances in neural information processing systems, pp 1889–1897
Karpathy A, Johnson J, Li FF (2016) Visualizing and understanding recurrent networks. In: Proceedings of the international conference on learning representations
Kiros R, Salakhutdinov R, Zemel RS (2015) Unifying visual-semantic embeddings with multimodal neural language models. Trans Assoc Comput Linguis
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Proceedings of the advances in neural information processing systems, pp 3294–3302
Kombrink S, Mikolov T, Karafiát M, Burget L (2011) Recurrent neural network based language modeling in meeting recognition. In: Interspeech, vol 11, pp 2877–2880
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the advances in neural information processing systems, pp 1097–1105
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Linguis 2(10):351–362
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision. Springer, pp 740–755
Liu C, Sun F, Wang C, Wang F, Yuille AL (2017) MAT: a multimodal attentive translator for image captioning. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, pp 4033–4039
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: Proceedings of the international conference on learning representations
Mei S, Zhu X (2015) The security of latent Dirichlet allocation. In: Proceedings of the international conference on artificial intelligence and statistics
Mikolov T, Deoras A, Povey D, Burget L, Ċernockỳ J (2011) Strategies for training large scale neural network language models. In: Proceedings of the automatic speech recognition and understanding, pp 196–201
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mnih A, Hinton G (2007) Three new graphical models for statistical language modelling. In: Proceedings of the international conference on machine learning. ACM, pp 641–648
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the association for computational linguistics. Association for Computational Linguistics, pp 311–318
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with amazon’s mechanical Turk. Association for Computational Linguistics, pp 139–147
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg CA, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Salakhutdinov R, Hinton GE (2009) Deep boltzmann machines. In: Proceedings of the international conference on artificial intelligence and statistics, pp 448–455
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representations
Socher R, Huang EH, Pennin J, Manning CD, Ng AY (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Proceedings of the advances in neural information processing systems, pp 801–809
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguis 2:207–218
Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep Boltzmann machines. In: Proceedings of the advances in neural information processing systems, pp 2222–2230
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of the advances in neural information processing systems, pp 3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
Wu Q, Shen C, Liu L, Dick AR, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3441–3450
Yao X, Han J, Cheng G, Qian X, Guo L (2016) Semantic annotation of high-resolution satellite images via weakly supervised learning. IEEE Trans Geosci Remote Sens 54(6):3660–3671
Yao X, Han J, Zhang D, Nie F (2017) Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering. IEEE Trans Image Process 26(7):3196–3209
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguis 2:67–78
Zhang D, Han J, Han J, Shao L (2016) Cosaliency detection based on intrasaliency prior transfer and deep intersaliency mining. IEEE Trans Neural Netw Learn Syst 27(6):1163–1176
Zhang D, Han J, Li C, Wang J, Li X (2016) Detection of co-salient objects by looking deep and wide. Int J Comput Vis 120(2):215–232
Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales TM (2017) Actor-critic sequence training for image captioning. CoRR arXiv:1706.09601
Zhao S, Yao H, Zhang Y, Wang Y, Liu S (2015) View-based 3d object retrieval via multi-modal graph learning. Signal Process 112:110–118
Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimed 19(3):632–645
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61761130079, in part by the Key Research Program of Frontier Sciences, CAS under Grant QYZDY-SSW-JSC044, in part by the National Natural Science Foundation of China under Grant 61472413, in part by the National Natural Science Foundation of China under Grant 61772510, and in part by the Young Top-notch Talent Program of Chinese Academy of Sciences under Grant QYZDB-SSW-JSC015.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, X., Yuan, A. & Lu, X. Multi-modal gated recurrent units for image description. Multimed Tools Appl 77, 29847–29869 (2018). https://doi.org/10.1007/s11042-018-5856-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5856-1