Topic-Based Image Caption Generation

  • Sandeep Kumar Dash
  • Shantanu Acharya
  • Partha PakrayEmail author
  • Ranjita Das
  • Alexander Gelbukh
Research Article - Computer Engineering and Computer Science


Image captioning is to generate captions for a given image based on the content of the image. To describe an image efficiently, it requires extracting as much information from it as possible. Apart from detecting the presence of objects and their relative orientation, the respective purpose intending the topic of the image is another vital information which can be incorporated with the model to improve the efficiency of the caption generation system. The sole aim is to put extra thrust on the context of the image imitating human approach, as the mere presence of objects which may not be related to the context representing the image should not be a part of the generated caption. In this work, the focus is on detecting the topic concerning the image so as to guide a novel deep learning-based encoder–decoder framework to generate captions for the image. The method is compared with some of the earlier state-of-the-art models based on the result obtained from MSCOCO 2017 training data set. BLEU, CIDEr, ROGUE-L, METEOR scores are used to measure the efficacy of the model which show improvement in performance of the caption generation process.


Image caption generation Deep learning Topic modelling 


  1. 1.
    Blei, D.M.; Ng, A.Y.; Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  2. 2.
    Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  3. 3.
    Lee, D.D.; Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
  4. 4.
    Yang, Y.; Teo, C.L.; DaumÃl’ III, H.; Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics (2011)Google Scholar
  5. 5.
    Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi,Y.; Berg, A.C.; Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of 24th CVPR, pp. 1601–1608 (2011)Google Scholar
  6. 6.
    Mitchell, M.; Dodge,X. J.; Mensch,A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; DaumÃl’ III, H.: Midge: generating image descriptions from computer vision detections. In: Proceedings of 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756 (2012)Google Scholar
  7. 7.
    Oliva, A.; Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)CrossRefGoogle Scholar
  8. 8.
    Torralba, A.; Fergus, R.; Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRefGoogle Scholar
  9. 9.
    Ordonez, V.; Kulkarni, G.; Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1143–1151 (2011)Google Scholar
  10. 10.
    Dash, S.K.; Saha, S.; Pakray, P.; Gelbukh, A.: Generating image captions through multimodal embedding. J. Intell. Fuzzy Syst. 36(5), 4787–4796 (2019)CrossRefGoogle Scholar
  11. 11.
    Kiros, R.; Salakhutdinov, R.; Zemel, R.: Multimodal neural language models. In: Proceedings of International Conference on Machine Learning, pp. 595–603 (2014)Google Scholar
  12. 12.
    Zhu, Z.; Xue, Z.; Yuan, Z.: Topic-guided attention for image captioning. In: 25th IEEE In Proceedings of International Conference on Image Processing, pp. 2615–2619 (2018)Google Scholar
  13. 13.
    Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollr, P.; Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision, pp. 740–755 (2014)CrossRefGoogle Scholar
  14. 14.
    Liu, F.; Ren, X.; Liu, Y.; Wang, H.; Sun, X.: simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:1808.08732. (2018)
  15. 15.
    Ding, S.; Qu, S.; Xi, Y.; Wan, S.: Stimulus-driven and concept-driven analysis for image caption generation. In: Proceedings of Neurocomputing (2019)Google Scholar
  16. 16.
    Gomez, L.; Patel, Y.; Rusiñol, M.; Karatzas, D.; Jawahar, C.V.: Self-supervised learning of visual features through embedding images into text topic spaces. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4230–4239 (2017)Google Scholar
  17. 17.
    Tsikrika, T.; Popescu, A.; Kludas, J.: Overview of the Wikipedia image retrieval task at ImageCLEF 2011. In: Proceedings of CLEF (2011)Google Scholar
  18. 18.
    Dash, S.K.; Kumar, R.; Pakray, P.; Gelbukh, A.: Visually aligned text embedding model for identification of spatial roles. In: Proceedings of 1st International Conference on Recent Trends on Electronics and Computer Science (ICRTECS 2019)(2019) (accepted)Google Scholar
  19. 19.
    Miller, A.G.: WordNet: a lexical database for English. Proc. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  20. 20.
    Li, W.; Liu, X.; Liu, J.; Chen, P.; Wan, S.; Cui, X.: On improving the accuracy with auto-encoder on conjunctivitis. Proc. Appl. Soft Comput. 81, 105489 (2019)CrossRefGoogle Scholar
  21. 21.
    Blei, D.M.; Jordan, M.I.: Modeling annotated data. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134 (2003)Google Scholar
  22. 22.
    Wang, Y.; Mori, G.: Max-margin latent Dirichlet allocation for image classification and annotation. Proc. BMVC 2(6), 7 (2011)Google Scholar
  23. 23.
    Rasiwasia, N.; Vasconcelos, N.: Latent Dirichlet allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2665–2679 (2013)CrossRefGoogle Scholar
  24. 24.
    Putthividhy, D.; Attias, H.T.; Nagarajan, S.S.: Topic regression multi-modal latent Dirichlet allocation for image annotation. In: Proceedings of CVPR (2010)Google Scholar
  25. 25.
    Yu, N.; Hu, X.; Song, B.; Yang, J.; Zhang, J.: Topic-oriented image captioning based on order-embedding. IEEE Trans. Image Process. 28(6), 2743–2754 (2018)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Vendrov, I.; Kiros, R.; Fidler, S.; Urtasun, R.: Order-embeddings of images and language. In: Proceedings of ICLR (2016)Google Scholar
  27. 27.
    Mao, Y.; Zhou, C.; Wang, X.; Li, R.: Show and tell more: topic-oriented multi-sentence image captioning. In: Proceedings of IJCAI, pp. 4258–4264 (2018)Google Scholar
  28. 28.
    Horn, R.A.: The hadamard product. Proc. Symp. Appl. Math. 40, 87–169 (1990)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Zhou, C.; Mao, Y.; Wang, X.: Topic-specific image caption generation. In: Proceedings of Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 321–332 (2017)Google Scholar
  30. 30.
    Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A.K.; Wan, S.: Image caption generation with high-level image features. Proc. Pattern Recognit. Lett. 123, 89–95 (2019)CrossRefGoogle Scholar
  31. 31.
    Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L.: Semantic compositional networks for visual captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1141–1150 (2017)Google Scholar
  32. 32.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefGoogle Scholar
  33. 33.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  34. 34.
    Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z.: Re-thinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567 (2015)
  35. 35.
    Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  36. 36.
    Karpathy, A.; Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)Google Scholar
  37. 37.
    Bird, S.; Klein, E.; Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)zbMATHGoogle Scholar
  38. 38.
    Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vander- plas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  39. 39.
    Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  41. 41.
    Kingma, D.P.; Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  42. 42.
    Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting on Association for Computational linguistics, pp. 311–318 (2002)Google Scholar
  43. 43.
    Vedantam, R.; Zitnick, C.L.; Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 566–4575 (2015)Google Scholar
  44. 44.
    Denkowski, M.; Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)Google Scholar
  45. 45.
    Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out: ACL-04 Workshop, p. 8 (2004)Google Scholar
  46. 46.
    Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  47. 47.
    Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar

Copyright information

© King Fahd University of Petroleum & Minerals 2019

Authors and Affiliations

  1. 1.Department of CSENIT MizoramAizawlIndia
  2. 2.Department of CSENIT SilcharSilcharIndia
  3. 3.CIC, IPNMexicoMexico

Personalised recommendations