Abstract
Despite the progress, generating natural language descriptions for images is still a challenging task. Most state-of-the-art methods for solving this problem apply existing deep convolutional neural network (CNN) models to extract a visual representation of the entire image, based on which the parallel structures between images and sentences are exploited using recurrent neural networks. However, there is an inherent drawback that their models may attend to a partial view of a visual element or a conglomeration of several concepts. In this paper, we present a fine-grained attention based model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation. The model contains three sub-networks: a deep recurrent neural network for sentences, a deep convolutional network for images, and a region proposal network for nearly cost-free region proposals. Our model is able to automatically learn to fix its gaze on salient region proposals. The process of generating the next word, given the previously generated ones, is aligned with this visual perception experience. We validate the effectiveness of the proposed model on three benchmark datasets (Flickr 8K, Flickr 30K and MS COCO). The experimental results confirm the effectiveness of the proposed system.
Similar content being viewed by others
References
Ba J, Grosse R B, Salakhutdinov R, Frey BJ (2015) Learning wake-sleep recurrent attention models. In: NIPS
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: ICLR
Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learning Syst 27 (7):1502–1513
Chang X, Nie F, Yang Y, Zhang C, Huang H (2016) Convex sparse PCA for unsupervised feature learning. TKDD 11(1):3:1–3:16
Chang X, Yang Y, Xing EP, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. In: Proceedings of the 32nd International conference on machine learning, ICML 2015. Lille, pp 1348–1357
Chang X, Yu Y, Yang Y, Hauptmann AG (2015) Searching persuasively: joint event detection and evidence recounting with limited supervision. In: Proceedings of the 23rd annual ACM conference on multimedia conference, MM ’15. Brisbane, pp 581–590
Chang X, Yu Y, Yang Y, Xing EP (2016) They are not equally reliable: Semantic event search using differentiated concept classifiers. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016. Las Vegas, pp 1884–1893
Chen X, Zitnick C L (2014) Learning a recurrent visual representation for image caption generation. CoRR arXiv:abs/1411.5654
Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR
Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp 1724–1734
Denil M, Bazzani L, Larochelle H, de Freitas N (2012) Learning where to attend with deep architectures for image tracking. Neural Comput 24(8):2151–2184
Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: ACL
Devlin J, Gupta S, Girshick RB, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. CoRR arXiv:abs/1505.04467
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR
Elliott D, Keller F (2013) Image description using visual dependency representations. In: EMNLP
Fang H, Gupta S, Iandola FN, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: CVPR
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: NIPS
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS
Karpathy A, Li F (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR
Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. CoRR arXiv:abs/1412.6980
Kiros R, Salakhutdinov R, Zemel RS (2014) Multimodal neural language models. In: ICML
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models CoRR arXiv:abs/1411.2539
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg A C, Berg T L (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: ACL
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2013) Generalizing image captions for image-text parallel corpus. In: ACL
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) TREETALK: composition and compression of trees for image descriptions. TACL 2:351–362
Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order boltzmann machine. In: NIPS
Lavie M D A (2014) Meteor universal: Language specific translation evaluation for any target language. ACL 2014:376
Li S, Kulkarni G, Berg T L, Berg A C, Choi Y (2011) Composing simple image descriptions using web-scale n-grams: In: Computational natural language learning. ACL
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: ECCV
Mao J, Xu W, Yang Y, Wang J, Yuille A L (2014) Explain images with multimodal recurrent neural networks. CoRR arXiv:abs/1410.1090
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: ICLR
Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: ACL
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg AC, Berg TL, III HD (2012) Midge: Generating image descriptions from computer vision detections. In: EACL, pp 747–756
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: ACL, pp 311–318
Pascanu R, Gu̇lċehre, Cho K, Bengio Y (2014) How to construct deep recurrent neural networks. In: ICLR
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: NAACL-HLT workshop
Ren S, He K, Girshick R B, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR arXiv:abs/1506.01497
Rensink R A (2000) The dynamic representation of scenes. Vis Cogn 7(1–3):17–42
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: ICLR
Tang Y, Srivastava N, Salakhutdinov R (2014) Learning generative models with visual attention. In: NIPS
Tieleman T, Hinton G (2012) Lecture 6.5 - rmsprop. Tech. rep.
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR
Wang S, Chang X, Li X, Long G, Yao L, Sheng Q Z (2016) Diagnosis code assignment using sparsity-based disease correlation embedding. IEEE Trans Knowl Data Eng 28(12):3191–3202
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: ICML
Yang Y, Ma Z, Nie F, Chang X, Hauptmann A G (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127
Yang Y, Teo CL, III HD, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: ACL
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2:67–78
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:14092329
Acknowledgments
The research is supported by Science Foundation of The China(Xi’an) Institute for Silk Road Research (2016SY10, 2016SY18), the scientific research program of Shaanxi Provincial Department of Education(2013JK1141) and Research Foundation of XAUFE (15XCK14).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chang, YS. Fine-grained attention for image caption generation. Multimed Tools Appl 77, 2959–2971 (2018). https://doi.org/10.1007/s11042-017-4593-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4593-1