Generation of Image Caption Using CNN-LSTM Based Approach

  • S. AravindkumarEmail author
  • P. Varalakshmi
  • M. Hemalatha
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 940)


Image captioning is gaining attention due to the recent developments in the deep neural architectures. But the gap between semantic concepts and the visual features is a major challenge in image caption generation. In this paper we have developed a method to use both visual features and semantic features for the caption generation. We discuss briefly about the various architectures used for visual feature extraction and Long Short Term Memory (LSTM) for caption generation. An object recognition model has been developed to identify the semantic tags in the images. These tags are encoded along with the visual features for the captioning task. We have developed an Encoder-Decoder architecture using the semantic details along with the language model for the caption generation. We evaluated our model with standard datasets like Flickr8k, Flickr30k and MSCOCO using standard metrics like BLEU and METEOR.


LSTM CNN Caption Semantic tags 


  1. 1.
    Jaing, W., Ma, L., Chen, X., Zhang, H., Liu, W.: Learning to guide decoding for image captioning. In: Thirty Second AAAI Conference on Artificial Intelligence (AAAI – 2018), pp. 6959–6966 (2018)Google Scholar
  2. 2.
    Kinghorn, P., Zhang, L., Shao, L.: A hierarchical and regional deep learning architecture for image description generation. Pattern Recogin. Lett. 119, 1–9 (2017)Google Scholar
  3. 3.
    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision–ECCV 2014, pp. 740–755 (2014)Google Scholar
  4. 4.
    Tariq, A., Foroosh, H.: A context - driven extractive framework for generating realistic image descriptions. IEEE Trans. Image Process. 26(2), 619–632 (2017)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation (2014)Google Scholar
  6. 6.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123(1), 74–93 (2017)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Proceedings Advantages Neural Information Processing Systems, pp. 487–495 (2014)Google Scholar
  8. 8.
    Gan, Z., Gan, C., He, X., Pu, Y.: Semantic compositional networks for visual captioning. In: CVPR, pp. 1–10 (2017)Google Scholar
  9. 9.
    Yao, T., Pan, Y., Li, Y., Mei, T.: Incorporating copying mechanism in image captioning for learning novel objects. In: CVPR, pp. 6580–6588 (2017)Google Scholar
  10. 10.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  11. 11.
    Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks. In: ICLR (2015)Google Scholar
  12. 12.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringJeppiaar SRR Engineering CollegeChennaiIndia
  2. 2.Department of Computer TechnologyAnna University, MIT CampusChennaiIndia
  3. 3.Department of Information TechnologyAnna University, MIT CampusChennaiIndia

Personalised recommendations