Fine-grained attention for image caption generation

Chang, Yan-Shuo

doi:10.1007/s11042-017-4593-1

Fine-grained attention for image caption generation

Published: 30 March 2017

Volume 77, pages 2959–2971, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yan-Shuo Chang^1,2

1064 Accesses
17 Citations
Explore all metrics

Abstract

Despite the progress, generating natural language descriptions for images is still a challenging task. Most state-of-the-art methods for solving this problem apply existing deep convolutional neural network (CNN) models to extract a visual representation of the entire image, based on which the parallel structures between images and sentences are exploited using recurrent neural networks. However, there is an inherent drawback that their models may attend to a partial view of a visual element or a conglomeration of several concepts. In this paper, we present a fine-grained attention based model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation. The model contains three sub-networks: a deep recurrent neural network for sentences, a deep convolutional network for images, and a region proposal network for nearly cost-free region proposals. Our model is able to automatically learn to fix its gaze on salient region proposals. The process of generating the next word, given the previously generated ones, is aligned with this visual perception experience. We validate the effectiveness of the proposed model on three benchmark datasets (Flickr 8K, Flickr 30K and MS COCO). The experimental results confirm the effectiveness of the proposed system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Caption with Endogenous–Exogenous Attention

Article 09 January 2019

Teng Wang, Haifeng Hu & Chen He

Image captioning with adaptive incremental global context attention

Article 13 September 2021

Changzhi Wang & Xiaodong Gu

Neural Image Caption Generation with Global Feature Based Attention Scheme

Notes

References

Ba J, Grosse R B, Salakhutdinov R, Frey BJ (2015) Learning wake-sleep recurrent attention models. In: NIPS
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: ICLR
Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learning Syst 27 (7):1502–1513
Article MathSciNet Google Scholar
Chang X, Nie F, Yang Y, Zhang C, Huang H (2016) Convex sparse PCA for unsupervised feature learning. TKDD 11(1):3:1–3:16
Article Google Scholar
Chang X, Yang Y, Xing EP, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. In: Proceedings of the 32nd International conference on machine learning, ICML 2015. Lille, pp 1348–1357
Chang X, Yu Y, Yang Y, Hauptmann AG (2015) Searching persuasively: joint event detection and evidence recounting with limited supervision. In: Proceedings of the 23rd annual ACM conference on multimedia conference, MM ’15. Brisbane, pp 581–590
Chang X, Yu Y, Yang Y, Xing EP (2016) They are not equally reliable: Semantic event search using differentiated concept classifiers. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016. Las Vegas, pp 1884–1893
Chen X, Zitnick C L (2014) Learning a recurrent visual representation for image caption generation. CoRR arXiv:abs/1411.5654
Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR
Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp 1724–1734
Denil M, Bazzani L, Larochelle H, de Freitas N (2012) Learning where to attend with deep architectures for image tracking. Neural Comput 24(8):2151–2184
Article MathSciNet Google Scholar
Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: ACL
Devlin J, Gupta S, Girshick RB, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. CoRR arXiv:abs/1505.04467
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR
Elliott D, Keller F (2013) Image description using visual dependency representations. In: EMNLP
Fang H, Gupta S, Iandola FN, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: CVPR
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: NIPS
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS
Karpathy A, Li F (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR
Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. CoRR arXiv:abs/1412.6980
Kiros R, Salakhutdinov R, Zemel RS (2014) Multimodal neural language models. In: ICML
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models CoRR arXiv:abs/1411.2539
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg A C, Berg T L (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: ACL
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2013) Generalizing image captions for image-text parallel corpus. In: ACL
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) TREETALK: composition and compression of trees for image descriptions. TACL 2:351–362
Google Scholar
Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order boltzmann machine. In: NIPS
Lavie M D A (2014) Meteor universal: Language specific translation evaluation for any target language. ACL 2014:376
Google Scholar
Li S, Kulkarni G, Berg T L, Berg A C, Choi Y (2011) Composing simple image descriptions using web-scale n-grams: In: Computational natural language learning. ACL
Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: ECCV
Mao J, Xu W, Yang Y, Wang J, Yuille A L (2014) Explain images with multimodal recurrent neural networks. CoRR arXiv:abs/1410.1090
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: ICLR
Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: ACL
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg AC, Berg TL, III HD (2012) Midge: Generating image descriptions from computer vision detections. In: EACL, pp 747–756
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: ACL, pp 311–318
Pascanu R, Gu̇lċehre, Cho K, Bengio Y (2014) How to construct deep recurrent neural networks. In: ICLR
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: NAACL-HLT workshop
Ren S, He K, Girshick R B, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR arXiv:abs/1506.01497
Rensink R A (2000) The dynamic representation of scenes. Vis Cogn 7(1–3):17–42
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: ICLR
Tang Y, Srivastava N, Salakhutdinov R (2014) Learning generative models with visual attention. In: NIPS
Tieleman T, Hinton G (2012) Lecture 6.5 - rmsprop. Tech. rep.
Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR
Wang S, Chang X, Li X, Long G, Yao L, Sheng Q Z (2016) Diagnosis code assignment using sparsity-based disease correlation embedding. IEEE Trans Knowl Data Eng 28(12):3191–3202
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: ICML
Yang Y, Ma Z, Nie F, Chang X, Hauptmann A G (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127
Article MathSciNet Google Scholar
Yang Y, Teo CL, III HD, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: ACL
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2:67–78
Google Scholar
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:14092329

Download references

Acknowledgments

The research is supported by Science Foundation of The China(Xi’an) Institute for Silk Road Research (2016SY10, 2016SY18), the scientific research program of Shaanxi Provincial Department of Education(2013JK1141) and Research Foundation of XAUFE (15XCK14).

Author information

Authors and Affiliations

China(Xi’an) Institute for Silk Road Research, Xi’an, 710100, China
Yan-Shuo Chang
School of Information, Xi’an University of Finance and Economics, Xi’an, 710100, China
Yan-Shuo Chang

Authors

Yan-Shuo Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan-Shuo Chang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, YS. Fine-grained attention for image caption generation. Multimed Tools Appl 77, 2959–2971 (2018). https://doi.org/10.1007/s11042-017-4593-1

Download citation

Received: 25 February 2017
Revised: 03 March 2017
Accepted: 08 March 2017
Published: 30 March 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11042-017-4593-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-grained attention for image caption generation

Abstract

Access this article

Similar content being viewed by others

Image Caption with Endogenous–Exogenous Attention

Image captioning with adaptive incremental global context attention

Neural Image Caption Generation with Global Feature Based Attention Scheme

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fine-grained attention for image caption generation

Abstract

Access this article

Similar content being viewed by others

Image Caption with Endogenous–Exogenous Attention

Image captioning with adaptive incremental global context attention

Neural Image Caption Generation with Global Feature Based Attention Scheme

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation