Multimedia Tools and Applications

, Volume 78, Issue 24, pp 35329–35350 | Cite as

An image caption method based on object detection

  • Danyang CaoEmail author
  • Menggui Zhu
  • Lei Gao


How to represent image information more effectively is the key to the task of image caption. In the existing research, a large number of image caption methods are proposed. Most of them use the global information of the image, and the information in the image that is not related to the caption generation also participates in the calculation, caused a certain amount of waste of resources. In order to solve this problem, a method of generating image caption based on object detection is proposed in this paper. Firstly, the object detection algorithm is used to extract image feature, only the features of meaningful regions in the image are used, and then image caption is generated by combining the spatial attention mechanism with the caption generation network. Experiments show that the image feature of the object region and the salient region are sufficient to represent the information of the entire image in the image caption task. For better convergence of the model, this paper also uses a new strategy for model training. The experimental results show that the proposed model in this paper work well on the test dataset of image caption, and it has created a precedent for new technology to a large extent.


Image caption Attention mechanism Object detection Deep learning 



The work was supported by Yuyou Talent Support Plan of North China University of Technology (107051360019XN132/017), the Fundamental Research Funds for Beijing Universities (110052971803/037), Special Research Foundation of North China University of Technology (PXM2017_014212_000014).


  1. 1.
    Bahdanau D, Cho K, Bengio Y (2014) neural machine translation by jointly learning to align and translate. Computer ScienceGoogle Scholar
  2. 2.
    Bin J, Gardiner B, Liu Z et al (2019) Attention-based multi-modal fusion for improved real estate appraisal: a case study in Los Angeles. Multimed Tools Appl: 1–22. doi: CrossRefGoogle Scholar
  3. 3.
    Chen L, Zhang H, Xiao J, et al (2017) SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning: 6298–6306Google Scholar
  4. 4.
    Cho K, Merrienboer B V, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. Computer ScienceGoogle Scholar
  5. 5.
    Fang F, Li Q, Wang H, et al (2018) Refining attention: a sequential attention model for image captioning. 2018 IEEE international conference on multimedia and expo (ICME): 1–6Google Scholar
  6. 6.
    Ge H, Yan Z, Yu W et al (2019) An attention mechanism based convolutional LSTM network for video action recognition. Multimed Tools Appl 78(14):20533–20556. CrossRefGoogle Scholar
  7. 7.
    Guo Y, Liu Y, De Boer MHT, Liu L, Michael S (2018) A dual prediction network for image captioning, 2018 IEEE international conference on multimedia and expo (ICME): 1–6Google Scholar
  8. 8.
    He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916CrossRefGoogle Scholar
  9. 9.
    He K, Zhang X, Ren S, et al (2015) Deep residual learning for image recognition: 770–778Google Scholar
  10. 10.
    Jia X, Gavves E, Fernando B, et al (2015) Guiding long-short term memory for image caption generationGoogle Scholar
  11. 11.
    Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. Computer Vision and Pattern Recognition IEEE: 3128–3137Google Scholar
  12. 12.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. International conference on neural information processing systems. Curran Associates Inc: 1097–1105Google Scholar
  13. 13.
    Kulkarni G, Premraj V, Ordonez V et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903CrossRefGoogle Scholar
  14. 14.
    Kuznetsova P, Ordonez V, Berg AC, et al (2012) Collective generation of natural image descriptions. Meeting of the Association for Computational Linguistics: Long Papers Association for Computational Linguistics: 359–368Google Scholar
  15. 15.
    Kuznetsova P, Ordonez V, Berg A, et al (2013) Generalizing image captions for image-text parallel Corpus. Meeting of the Association for Computational Linguistics: 790–796Google Scholar
  16. 16.
    Lecun Y, Boser B, Denker JS et al (1989) Back propagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551CrossRefGoogle Scholar
  17. 17.
    Lipton Z C, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. Computer ScienceGoogle Scholar
  18. 18.
    Lu J, Xiong C, Parikh D, et al (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning: 3242–3250Google Scholar
  19. 19.
    Mitchell M, Han X, Dodge J, et al (2012) Midge: generating image descriptions from computer vision detections. Conference of the European chapter of the Association for Computational Linguistics. Association for Computational Linguistics: 747–756Google Scholar
  20. 20.
    Ren S, He K, Girshick R, et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. International conference on neural information processing systems. MIT Press: 91–99Google Scholar
  21. 21.
    Sadeghi MA, Sadeghi MA, Sadeghi MA, et al (2010) Every picture tells a story: generating sentences from images. European conference on computer vision. Springer-Verlag: 15–29Google Scholar
  22. 22.
    Sak H, Senior A, Beaufays F (2014) Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Computer Science: 338–342Google Scholar
  23. 23.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer ScienceGoogle Scholar
  24. 24.
    Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. Interspeech: 601–608Google Scholar
  25. 25.
    Szegedy C, Liu W, Jia Y, et al (2014) Going deeper with convolutions: 1–9Google Scholar
  26. 26.
    Vinyals O, Toshev A, Bengio S, et al (2014) Show and tell: a neural image caption generator: 3156–3164Google Scholar
  27. 27.
    Wu Q, Shen C, Liu L, et al (2016) What value do explicit high level concepts have in vision to language problems?, Computer Science: 203–212Google Scholar
  28. 28.
    Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. Computer Science: 2048–2057Google Scholar
  29. 29.
    Yang Y, Teo CL, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. Conference on empirical methods in natural language processing. Association for Computational Linguistics: 444–454Google Scholar
  30. 30.
    Yang Z, Yuan Y, Wu Y, et al (2016) Encode, review, and decode: reviewer module for caption generationGoogle Scholar
  31. 31.
    Yao T, Pan Y, Li Y, et al (2016) Boosting image captioning with attributes: 4904–4912Google Scholar
  32. 32.
    Ye S, Liu N, Han J (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process: 5514–5524MathSciNetCrossRefGoogle Scholar
  33. 33.
    Zhou Y, Zhenzhen H, Ye Z, Liu X, Hong R (2018) Enhanced text-guided attention model for image captioning. 2018 IEEE fourth international conference on multimedia big data (BigMM): 1–5Google Scholar
  34. 34.
    Zhu Z, Xue Z, Yuan Z (2018) Topic-guided attention for image captioning: 2615–2619Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Information Science and TechnologyNorth China University of TechnologyBeijingChina
  2. 2.Beijing Key Laboratory on Integration and Analysis of Large-scale Stream DataBeijingChina

Personalised recommendations