Multimedia Tools and Applications

, Volume 77, Issue 3, pp 2959–2971 | Cite as

Fine-grained attention for image caption generation



Despite the progress, generating natural language descriptions for images is still a challenging task. Most state-of-the-art methods for solving this problem apply existing deep convolutional neural network (CNN) models to extract a visual representation of the entire image, based on which the parallel structures between images and sentences are exploited using recurrent neural networks. However, there is an inherent drawback that their models may attend to a partial view of a visual element or a conglomeration of several concepts. In this paper, we present a fine-grained attention based model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation. The model contains three sub-networks: a deep recurrent neural network for sentences, a deep convolutional network for images, and a region proposal network for nearly cost-free region proposals. Our model is able to automatically learn to fix its gaze on salient region proposals. The process of generating the next word, given the previously generated ones, is aligned with this visual perception experience. We validate the effectiveness of the proposed model on three benchmark datasets (Flickr 8K, Flickr 30K and MS COCO). The experimental results confirm the effectiveness of the proposed system.


Fine-grained attention Image caption generation Attention generation 



The research is supported by Science Foundation of The China(Xi’an) Institute for Silk Road Research (2016SY10, 2016SY18), the scientific research program of Shaanxi Provincial Department of Education(2013JK1141) and Research Foundation of XAUFE (15XCK14).


  1. 1.
    Ba J, Grosse R B, Salakhutdinov R, Frey BJ (2015) Learning wake-sleep recurrent attention models. In: NIPSGoogle Scholar
  2. 2.
    Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: ICLRGoogle Scholar
  3. 3.
    Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learning Syst 27 (7):1502–1513MathSciNetCrossRefGoogle Scholar
  4. 4.
    Chang X, Nie F, Yang Y, Zhang C, Huang H (2016) Convex sparse PCA for unsupervised feature learning. TKDD 11(1):3:1–3:16CrossRefGoogle Scholar
  5. 5.
    Chang X, Yang Y, Xing EP, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. In: Proceedings of the 32nd International conference on machine learning, ICML 2015. Lille, pp 1348–1357Google Scholar
  6. 6.
    Chang X, Yu Y, Yang Y, Hauptmann AG (2015) Searching persuasively: joint event detection and evidence recounting with limited supervision. In: Proceedings of the 23rd annual ACM conference on multimedia conference, MM ’15. Brisbane, pp 581–590Google Scholar
  7. 7.
    Chang X, Yu Y, Yang Y, Xing EP (2016) They are not equally reliable: Semantic event search using differentiated concept classifiers. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016. Las Vegas, pp 1884–1893Google Scholar
  8. 8.
    Chen X, Zitnick C L (2014) Learning a recurrent visual representation for image caption generation. CoRR arXiv:abs/1411.5654
  9. 9.
    Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: CVPRGoogle Scholar
  10. 10.
    Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp 1724–1734Google Scholar
  11. 11.
    Denil M, Bazzani L, Larochelle H, de Freitas N (2012) Learning where to attend with deep architectures for image tracking. Neural Comput 24(8):2151–2184Google Scholar
  12. 12.
    Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: ACLGoogle Scholar
  13. 13.
    Devlin J, Gupta S, Girshick RB, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. CoRR arXiv:abs/1505.04467
  14. 14.
    Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPRGoogle Scholar
  15. 15.
    Elliott D, Keller F (2013) Image description using visual dependency representations. In: EMNLPGoogle Scholar
  16. 16.
    Fang H, Gupta S, Iandola FN, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: CVPRGoogle Scholar
  17. 17.
    Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: NIPSGoogle Scholar
  18. 18.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  19. 19.
    Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPSGoogle Scholar
  20. 20.
    Karpathy A, Li F (2015) Deep visual-semantic alignments for generating image descriptions. In: CVPRGoogle Scholar
  21. 21.
    Kingma D P, Ba J (2014) Adam: a method for stochastic optimization. CoRR arXiv:abs/1412.6980
  22. 22.
    Kiros R, Salakhutdinov R, Zemel RS (2014) Multimodal neural language models. In: ICMLGoogle Scholar
  23. 23.
    Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models CoRR arXiv:abs/1411.2539
  24. 24.
    Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg A C, Berg T L (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903CrossRefGoogle Scholar
  25. 25.
    Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: ACLGoogle Scholar
  26. 26.
    Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2013) Generalizing image captions for image-text parallel corpus. In: ACLGoogle Scholar
  27. 27.
    Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) TREETALK: composition and compression of trees for image descriptions. TACL 2:351–362Google Scholar
  28. 28.
    Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order boltzmann machine. In: NIPSGoogle Scholar
  29. 29.
    Lavie M D A (2014) Meteor universal: Language specific translation evaluation for any target language. ACL 2014:376Google Scholar
  30. 30.
    Li S, Kulkarni G, Berg T L, Berg A C, Choi Y (2011) Composing simple image descriptions using web-scale n-grams: In: Computational natural language learning. ACLGoogle Scholar
  31. 31.
    Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: ECCVGoogle Scholar
  32. 32.
    Mao J, Xu W, Yang Y, Wang J, Yuille A L (2014) Explain images with multimodal recurrent neural networks. CoRR arXiv:abs/1410.1090
  33. 33.
    Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). In: ICLRGoogle Scholar
  34. 34.
    Mason R, Charniak E (2014) Nonparametric method for data-driven image captioning. In: ACLGoogle Scholar
  35. 35.
    Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg AC, Berg TL, III HD (2012) Midge: Generating image descriptions from computer vision detections. In: EACL, pp 747–756Google Scholar
  36. 36.
    Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: ACL, pp 311–318Google Scholar
  37. 37.
    Pascanu R, Gu̇lċehre, Cho K, Bengio Y (2014) How to construct deep recurrent neural networks. In: ICLRGoogle Scholar
  38. 38.
    Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: NAACL-HLT workshopGoogle Scholar
  39. 39.
    Ren S, He K, Girshick R B, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR arXiv:abs/1506.01497
  40. 40.
    Rensink R A (2000) The dynamic representation of scenes. Vis Cogn 7(1–3):17–42CrossRefGoogle Scholar
  41. 41.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: ICLRGoogle Scholar
  42. 42.
    Tang Y, Srivastava N, Salakhutdinov R (2014) Learning generative models with visual attention. In: NIPSGoogle Scholar
  43. 43.
    Tieleman T, Hinton G (2012) Lecture 6.5 - rmsprop. Tech. rep.Google Scholar
  44. 44.
    Vedantam R, Zitnick CL, Parikh D (2015) Cider: Consensus-based image description evaluation. In: CVPRGoogle Scholar
  45. 45.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: CVPRGoogle Scholar
  46. 46.
    Wang S, Chang X, Li X, Long G, Yao L, Sheng Q Z (2016) Diagnosis code assignment using sparsity-based disease correlation embedding. IEEE Trans Knowl Data Eng 28(12):3191–3202CrossRefGoogle Scholar
  47. 47.
    Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: ICMLGoogle Scholar
  48. 48.
    Yang Y, Ma Z, Nie F, Chang X, Hauptmann A G (2015) Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vis 113(2):113–127MathSciNetCrossRefGoogle Scholar
  49. 49.
    Yang Y, Teo CL, III HD, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: ACLGoogle Scholar
  50. 50.
    Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2:67–78Google Scholar
  51. 51.
    Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:14092329

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.China(Xi’an) Institute for Silk Road ResearchXi’anChina
  2. 2.School of InformationXi’an University of Finance and EconomicsXi’anChina

Personalised recommendations