Multimedia Tools and Applications

, Volume 77, Issue 23, pp 31159–31175 | Cite as

Looking deeper and transferring attention for image captioning

  • Fang Fang
  • Hanli WangEmail author
  • Yihao Chen
  • Pengjie Tang


Image captioning is a challenging task which requires not only to extract semantic information but also to generate descriptions with correct sentences. Most of the previous researches employ one-layer or two-layer Recurrent Neural Network (RNN) as the language model to predict sentence words. The language model may easily deal with the word information for a noun or an object, however, it may not be able to learn a verb or an adjective. To address this issue, a deep attention based language model is proposed to learn more abstract word information and three stacked approaches are designed to process attention. The proposed model makes full use of the Long Short Term Memory (LSTM) network and employs the transferred current attention to enhance extra spatial information. The experimental results on the benchmark MSCOCO and Flickr30K datasets have verified the effectiveness of the proposed model.


Image captioning Attention LSTM Stacked attention 



This work was supported in part by National Natural Science Foundation of China under Grants 61622115 and 61472281, Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing (17DZ2251600), and IBM Shared University Research Awards Program.


  1. 1.
    Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proc. ACL Workshop IEEMMTS’05, vol 29, pp 65–72Google Scholar
  2. 2.
    Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929CrossRefGoogle Scholar
  3. 3.
    Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proc. ECCV’10, pp 15–29CrossRefGoogle Scholar
  4. 4.
    Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proc. ECCV’14, pp 529–545Google Scholar
  5. 5.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc. CVPR’16, pp 770–778Google Scholar
  6. 6.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  7. 7.
    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefGoogle Scholar
  8. 8.
    Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proc. ICCV’15, pp 2407–2415Google Scholar
  9. 9.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proc. CVPR’15, pp 3128–3137Google Scholar
  10. 10.
    Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proc. ECCV’14, pp 740–755Google Scholar
  11. 11.
    Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proc. CVPR’17, pp 375–383Google Scholar
  12. 12.
    Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proc. ICLR’15Google Scholar
  13. 13.
    Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: Proc. ACL’02, pp 311–318Google Scholar
  14. 14.
    Prakash A, Hasan S, Lee K, Datla V, Qadir A, Liu J, Farri O (2016) Neural paraphrase generation with stacked residual LSTM networks. In: Proc. COLING’16, pp 2923–2934Google Scholar
  15. 15.
    Reddy DR (1997) Speech understanding systems: a summary of results of the five-year research effort. Tech. rep., Carnegie-Mellon University. Computer Science DeptGoogle Scholar
  16. 16.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proc. ICLR’15Google Scholar
  17. 17.
    Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958MathSciNetzbMATHGoogle Scholar
  18. 18.
    Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Proc. NIPS’15, pp 2377–2385Google Scholar
  19. 19.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proc. NIPS’14, pp 3104–3112Google Scholar
  20. 20.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proc. CVPR’15, pp 1–9Google Scholar
  21. 21.
    Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: Proc. CVPR’15, pp 4566–4575Google Scholar
  22. 22.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proc. CVPR’15, pp 3156–3164Google Scholar
  23. 23.
    Wu Q, Shen C, Liu L, Dick A, Hengel AVD (2016) What value do explicit high level concepts have in vision to language problems? In: Proc. CVPR’16, pp 203–212Google Scholar
  24. 24.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proc. ICML’15, pp 2048– 2057Google Scholar
  25. 25.
    Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proc. ICCV’17, pp 4894–4902Google Scholar
  26. 26.
    You Q, Jin H, Wang Z, Fang C, Luo J (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proc. CVPR’15, pp 2625–2634Google Scholar
  27. 27.
    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proc. CVPR’16, pp 4651–4659Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Fang Fang
    • 1
    • 2
    • 3
  • Hanli Wang
    • 1
    • 2
    • 3
    Email author
  • Yihao Chen
    • 1
    • 2
    • 3
  • Pengjie Tang
    • 1
    • 2
    • 3
  1. 1.Department of Computer Science and TechnologyTongji UniversityShanghaiChina
  2. 2.Key Laboratory of Embedded System and Service Computing, Ministry of EducationTongji UniversityShanghaiChina
  3. 3.Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent ComputingShanghaiChina

Personalised recommendations