Topic-Based Image Caption Generation
- 20 Downloads
Abstract
Image captioning is to generate captions for a given image based on the content of the image. To describe an image efficiently, it requires extracting as much information from it as possible. Apart from detecting the presence of objects and their relative orientation, the respective purpose intending the topic of the image is another vital information which can be incorporated with the model to improve the efficiency of the caption generation system. The sole aim is to put extra thrust on the context of the image imitating human approach, as the mere presence of objects which may not be related to the context representing the image should not be a part of the generated caption. In this work, the focus is on detecting the topic concerning the image so as to guide a novel deep learning-based encoder–decoder framework to generate captions for the image. The method is compared with some of the earlier state-of-the-art models based on the result obtained from MSCOCO 2017 training data set. BLEU, CIDEr, ROGUE-L, METEOR scores are used to measure the efficacy of the model which show improvement in performance of the caption generation process.
Keywords
Image caption generation Deep learning Topic modellingReferences
- 1.Blei, D.M.; Ng, A.Y.; Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
- 2.Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
- 3.Lee, D.D.; Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)CrossRefGoogle Scholar
- 4.Yang, Y.; Teo, C.L.; DaumÃl’ III, H.; Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics (2011)Google Scholar
- 5.Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi,Y.; Berg, A.C.; Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of 24th CVPR, pp. 1601–1608 (2011)Google Scholar
- 6.Mitchell, M.; Dodge,X. J.; Mensch,A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; DaumÃl’ III, H.: Midge: generating image descriptions from computer vision detections. In: Proceedings of 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756 (2012)Google Scholar
- 7.Oliva, A.; Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)CrossRefGoogle Scholar
- 8.Torralba, A.; Fergus, R.; Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)CrossRefGoogle Scholar
- 9.Ordonez, V.; Kulkarni, G.; Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1143–1151 (2011)Google Scholar
- 10.Dash, S.K.; Saha, S.; Pakray, P.; Gelbukh, A.: Generating image captions through multimodal embedding. J. Intell. Fuzzy Syst. 36(5), 4787–4796 (2019)CrossRefGoogle Scholar
- 11.Kiros, R.; Salakhutdinov, R.; Zemel, R.: Multimodal neural language models. In: Proceedings of International Conference on Machine Learning, pp. 595–603 (2014)Google Scholar
- 12.Zhu, Z.; Xue, Z.; Yuan, Z.: Topic-guided attention for image captioning. In: 25th IEEE In Proceedings of International Conference on Image Processing, pp. 2615–2619 (2018)Google Scholar
- 13.Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollr, P.; Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision, pp. 740–755 (2014)CrossRefGoogle Scholar
- 14.Liu, F.; Ren, X.; Liu, Y.; Wang, H.; Sun, X.: simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:1808.08732. (2018)
- 15.Ding, S.; Qu, S.; Xi, Y.; Wan, S.: Stimulus-driven and concept-driven analysis for image caption generation. In: Proceedings of Neurocomputing (2019)Google Scholar
- 16.Gomez, L.; Patel, Y.; Rusiñol, M.; Karatzas, D.; Jawahar, C.V.: Self-supervised learning of visual features through embedding images into text topic spaces. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4230–4239 (2017)Google Scholar
- 17.Tsikrika, T.; Popescu, A.; Kludas, J.: Overview of the Wikipedia image retrieval task at ImageCLEF 2011. In: Proceedings of CLEF (2011)Google Scholar
- 18.Dash, S.K.; Kumar, R.; Pakray, P.; Gelbukh, A.: Visually aligned text embedding model for identification of spatial roles. In: Proceedings of 1st International Conference on Recent Trends on Electronics and Computer Science (ICRTECS 2019)(2019) (accepted)Google Scholar
- 19.Miller, A.G.: WordNet: a lexical database for English. Proc. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
- 20.Li, W.; Liu, X.; Liu, J.; Chen, P.; Wan, S.; Cui, X.: On improving the accuracy with auto-encoder on conjunctivitis. Proc. Appl. Soft Comput. 81, 105489 (2019)CrossRefGoogle Scholar
- 21.Blei, D.M.; Jordan, M.I.: Modeling annotated data. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134 (2003)Google Scholar
- 22.Wang, Y.; Mori, G.: Max-margin latent Dirichlet allocation for image classification and annotation. Proc. BMVC 2(6), 7 (2011)Google Scholar
- 23.Rasiwasia, N.; Vasconcelos, N.: Latent Dirichlet allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2665–2679 (2013)CrossRefGoogle Scholar
- 24.Putthividhy, D.; Attias, H.T.; Nagarajan, S.S.: Topic regression multi-modal latent Dirichlet allocation for image annotation. In: Proceedings of CVPR (2010)Google Scholar
- 25.Yu, N.; Hu, X.; Song, B.; Yang, J.; Zhang, J.: Topic-oriented image captioning based on order-embedding. IEEE Trans. Image Process. 28(6), 2743–2754 (2018)MathSciNetCrossRefGoogle Scholar
- 26.Vendrov, I.; Kiros, R.; Fidler, S.; Urtasun, R.: Order-embeddings of images and language. In: Proceedings of ICLR (2016)Google Scholar
- 27.Mao, Y.; Zhou, C.; Wang, X.; Li, R.: Show and tell more: topic-oriented multi-sentence image captioning. In: Proceedings of IJCAI, pp. 4258–4264 (2018)Google Scholar
- 28.Horn, R.A.: The hadamard product. Proc. Symp. Appl. Math. 40, 87–169 (1990)MathSciNetCrossRefGoogle Scholar
- 29.Zhou, C.; Mao, Y.; Wang, X.: Topic-specific image caption generation. In: Proceedings of Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 321–332 (2017)Google Scholar
- 30.Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A.K.; Wan, S.: Image caption generation with high-level image features. Proc. Pattern Recognit. Lett. 123, 89–95 (2019)CrossRefGoogle Scholar
- 31.Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L.: Semantic compositional networks for visual captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1141–1150 (2017)Google Scholar
- 32.Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefGoogle Scholar
- 33.Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
- 34.Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z.: Re-thinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567 (2015)
- 35.Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
- 36.Karpathy, A.; Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)Google Scholar
- 37.Bird, S.; Klein, E.; Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)zbMATHGoogle Scholar
- 38.Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vander- plas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
- 39.Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
- 40.Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
- 41.Kingma, D.P.; Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- 42.Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting on Association for Computational linguistics, pp. 311–318 (2002)Google Scholar
- 43.Vedantam, R.; Zitnick, C.L.; Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 566–4575 (2015)Google Scholar
- 44.Denkowski, M.; Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)Google Scholar
- 45.Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out: ACL-04 Workshop, p. 8 (2004)Google Scholar
- 46.Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
- 47.Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar