Topic sensitive image descriptions

Abstract

The objective of description models is to generate image captions to elaborate contents. Despite recent advancements in machine learning and computer vision, generating discriminative captions still remains a challenging problem. Traditional approaches imitate frequent language patterns without considering the semantic alignment of words. In this work, an image captioning framework is proposed that generates topic sensitive descriptions. The model captures the semantic relation and polysemous nature of the words that describe the images and resultantly generates superior descriptions for the target images. The efficacy of the proposed model is indicated by the evaluation on the state-of-the-art captioning datasets. The model shows promising performance compared to the existing description models proposed in the recent literature.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. 1.

    Ramisa A, Yan F, Moreno-Noguer F, Mikolajczyk K (2018) Breakingnews: article annotation by image and text processing. IEEE Trans Pattern Anal Mach Intell 40(5):1072–1085

    Article  Google Scholar 

  2. 2.

    Ling H, Fidler S (2017) Teaching machines to describe images via natural language feedback. In: NIPS

  3. 3.

    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston MA, pp 3156–3164

  4. 4.

    Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt J, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE conference on computer vision and pattern recognition

  5. 5.

    Elliott D, de Vries AP (2015) Describing images using inferred visual dependency representations. In: Annual meeting of the association for computational linguistics

  6. 6.

    Tan YH, Chan CS (2016) phi-LSTM: a phrase-based hierarchical LSTM model for image captioning. In: ACCV

  7. 7.

    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659

  8. 8.

    Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676

    Article  Google Scholar 

  9. 9.

    Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Article  Google Scholar 

  10. 10.

    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: ICLR Workshop

  11. 11.

    Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532-1543

  12. 12.

    Blei D, Lafferty J (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35

    MathSciNet  Article  Google Scholar 

  13. 13.

    Chen B (2009) Latent topic modelling of word co-occurence information for spoken document retrieval. In: 2009 IEEE international conference on acoustics, speech and signal processing. Taipei, pp 3961–3964

  14. 14.

    Socher R, Karpathy A, Le QV, Manning CD, Ng A (2014) Grounded compositional semantics for finding and describing images with sentences. In: Transactions of the association for computational linguistics, pp 207218

  15. 15.

    Deng J, Dong W, Socher R, Li LJ, Li Kai, Fei-Fei Li (2009) ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255

  16. 16.

    Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition

  17. 17.

    Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: 2015 IEEE international conference on computer vision (ICCV), pp. 2407–2415

  18. 18.

    Chen X, Zitnick CL (2015) Minds eye: a recurrent visual representation for image caption generation. In: IEEE Conference on computer vision and pattern recognition

  19. 19.

    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057

  20. 20.

    Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: CVPR

  21. 21.

    Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272

  22. 22.

    Uijlings J, van de Sande K, Gevers T, Smeulders A (2013) Selective search for object recognition. IJCV 104(2):154–171

    Article  Google Scholar 

  23. 23.

    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR

  24. 24.

    Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen W (2016) Encode, review, and decode: reviewer module for caption generation. In: NIPS

  25. 25.

    Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: CVPR

  26. 26.

    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR

  27. 27.

    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2014) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    MathSciNet  Article  Google Scholar 

  28. 28.

    Garten J, Sagae K, Ustun V, Dehghani M (2015) Combining distributed vector representations for words. In: Proceedings of the 1st workshop on vector space modeling for natural language processing, pp 95–101

  29. 29.

    Fadaee M, Bisazza A, Monz C (2017) Learning topic-sensitive word representations. In: Proceedings of the 55th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 441–447

  30. 30.

    Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: 2010 ACM/IEEE 32nd international conference on software engineering, pp 95–104

  31. 31.

    Aldous DJ (1985) Exchangeability and related topics. In: École d’Été de Probabilités de Saint-Flour XIII—1983, pp 1–198

  32. 32.

    Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868

    Article  Google Scholar 

  33. 33.

    Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication association

  34. 34.

    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell 47:853–899

    MathSciNet  MATH  Google Scholar 

  35. 35.

    Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Lin 2:67–78

    Google Scholar 

  36. 36.

    Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663

    Article  Google Scholar 

  37. 37.

    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105

    Google Scholar 

  38. 38.

    Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics

  39. 39.

    Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  40. 40.

    Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575

  41. 41.

    Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325

  42. 42.

    Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Explain images with multimodal recurrent neural networks. In: ICLR

  43. 43.

    Aneja J, Aditya D, Alexander SG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  44. 44.

    Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton-attribute decomposition. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  45. 45.

    Wang Q, Chan AB (2018) CNN+ CNN: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Abdul Ghafoor.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zia, U., Riaz, M.M., Ghafoor, A. et al. Topic sensitive image descriptions. Neural Comput & Applic 32, 10471–10479 (2020). https://doi.org/10.1007/s00521-019-04587-x

Download citation

Keywords

  • Image caption
  • Topic model
  • Convolutional
  • Deep network
  • Language model