Skip to main content
Log in

Fine-grained attention for image caption generation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Despite the progress, generating natural language descriptions for images is still a challenging task. Most state-of-the-art methods for solving this problem apply existing deep convolutional neural network (CNN) models to extract a visual representation of the entire image, based on which the parallel structures between images and sentences are exploited using recurrent neural networks. However, there is an inherent drawback that their models may attend to a partial view of a visual element or a conglomeration of several concepts. In this paper, we present a fine-grained attention based model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation. The model contains three sub-networks: a deep recurrent neural network for sentences, a deep convolutional network for images, and a region proposal network for nearly cost-free region proposals. Our model is able to automatically learn to fix its gaze on salient region proposals. The process of generating the next word, given the previously generated ones, is aligned with this visual perception experience. We validate the effectiveness of the proposed model on three benchmark datasets (Flickr 8K, Flickr 30K and MS COCO). The experimental results confirm the effectiveness of the proposed system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://cs.standford.edu/people/karpathy/deepimagesent/.

  2. https://github.com/tylin/coco-caption/.

References

  1. Ba J, Grosse R B, Salakhutdinov R, Frey BJ (2015) Learning wake-sleep recurrent attention models. In: NIPS

  2. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: ICLR

  3. Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learning Syst 27 (7):1502–1513

    Article  MathSciNet  Google Scholar 

  4. Chang X, Nie F, Yang Y, Zhang C, Huang H (2016) Convex sparse PCA for unsupervised feature learning. TKDD 11(1):3:1–3:16

    Article  Google Scholar 

  5. Chang X, Yang Y, Xing EP, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. In: Proceedings of the 32nd International conference on machine learning, ICML 2015. Lille, pp 1348–1357

  6. Chang X, Yu Y, Yang Y, Hauptmann AG (2015) Searching persuasively: joint event detection and evidence recounting with limited supervision. In: Proceedings of the 23rd annual ACM conference on multimedia conference, MM ’15. Brisbane, pp 581–590

  7. Chang X, Yu Y, Yang Y, Xing EP (2016) They are not equally reliable: Semantic event search using differentiated concept classifiers. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016. Las Vegas, pp 1884–1893

  8. Chen X, Zitnick C L (2014) Learning a recurrent visual representation for image caption generation. CoRR arXiv:abs/1411.5654

  9. Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: CVPR

  10. Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp 1724–1734

  11. Denil M, Bazzani L, Larochelle H, de Freitas N (2012) Learning where to attend with deep architectures for image tracking. Neural Comput 24(8):2151–2184

    Article  MathSciNet  Google Scholar 

  12. Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: ACL

  13. Devlin J, Gupta S, Girshick RB, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. CoRR arXiv:abs/1505.04467

  14. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR

  15. Elliott D, Keller F (2013) Image description using visual dependency representations. In: EMNLP

  16. Fang H, Gupta S, Iandola FN, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: CVPR

  17. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: NIPS

  18. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  19. Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS

  20. Karpathy A, Li F (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPR

  21. Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. CoRR arXiv:abs/1412.6980

  22. Kiros R, Salakhutdinov R, Zemel RS (2014) Multimodal neural language models. In: ICML

  23. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models CoRR arXiv:abs/1411.2539

  24. Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg A C, Berg T L (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  25. Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: ACL

  26. Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2013) Generalizing image captions for image-text parallel corpus. In: ACL

  27. Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) TREETALK: composition and compression of trees for image descriptions. TACL 2:351–362

    Google Scholar 

  28. Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order boltzmann machine. In: NIPS

  29. Lavie M D A (2014) Meteor universal: Language specific translation evaluation for any target language. ACL 2014:376

    Google Scholar 

  30. Li S, Kulkarni G, Berg T L, Berg A C, Choi Y (2011) Composing simple image descriptions using web-scale n-grams: In: Computational natural language learning. ACL

  31. Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: ECCV

  32. Mao J, Xu W, Yang Y, Wang J, Yuille A L (2014) Explain images with multimodal recurrent neural networks. CoRR arXiv:abs/1410.1090

  33. Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: ICLR

  34. Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: ACL

  35. Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg AC, Berg TL, III HD (2012) Midge: Generating image descriptions from computer vision detections. In: EACL, pp 747–756

  36. Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: ACL, pp 311–318

  37. Pascanu R, Gu̇lċehre, Cho K, Bengio Y (2014) How to construct deep recurrent neural networks. In: ICLR

  38. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: NAACL-HLT workshop

  39. Ren S, He K, Girshick R B, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR arXiv:abs/1506.01497

  40. Rensink R A (2000) The dynamic representation of scenes. Vis Cogn 7(1–3):17–42

    Article  Google Scholar 

  41. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: ICLR

  42. Tang Y, Srivastava N, Salakhutdinov R (2014) Learning generative models with visual attention. In: NIPS

  43. Tieleman T, Hinton G (2012) Lecture 6.5 - rmsprop. Tech. rep.

  44. Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPR

  45. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPR

  46. Wang S, Chang X, Li X, Long G, Yao L, Sheng Q Z (2016) Diagnosis code assignment using sparsity-based disease correlation embedding. IEEE Trans Knowl Data Eng 28(12):3191–3202

    Article  Google Scholar 

  47. Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: ICML

  48. Yang Y, Ma Z, Nie F, Chang X, Hauptmann A G (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127

    Article  MathSciNet  Google Scholar 

  49. Yang Y, Teo CL, III HD, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: ACL

  50. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2:67–78

    Google Scholar 

  51. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:14092329

Download references

Acknowledgments

The research is supported by Science Foundation of The China(Xi’an) Institute for Silk Road Research (2016SY10, 2016SY18), the scientific research program of Shaanxi Provincial Department of Education(2013JK1141) and Research Foundation of XAUFE (15XCK14).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan-Shuo Chang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, YS. Fine-grained attention for image caption generation. Multimed Tools Appl 77, 2959–2971 (2018). https://doi.org/10.1007/s11042-017-4593-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4593-1

Keywords

Navigation