Advertisement

Topic sensitive image descriptions

  • Usman Zia
  • M. Mohsin Riaz
  • Abdul GhafoorEmail author
  • Syed Sohaib Ali
Original Article

Abstract

The objective of description models is to generate image captions to elaborate contents. Despite recent advancements in machine learning and computer vision, generating discriminative captions still remains a challenging problem. Traditional approaches imitate frequent language patterns without considering the semantic alignment of words. In this work, an image captioning framework is proposed that generates topic sensitive descriptions. The model captures the semantic relation and polysemous nature of the words that describe the images and resultantly generates superior descriptions for the target images. The efficacy of the proposed model is indicated by the evaluation on the state-of-the-art captioning datasets. The model shows promising performance compared to the existing description models proposed in the recent literature.

Keywords

Image caption Topic model Convolutional Deep network Language model 

Notes

References

  1. 1.
    Ramisa A, Yan F, Moreno-Noguer F, Mikolajczyk K (2018) Breakingnews: article annotation by image and text processing. IEEE Trans Pattern Anal Mach Intell 40(5):1072–1085CrossRefGoogle Scholar
  2. 2.
    Ling H, Fidler S (2017) Teaching machines to describe images via natural language feedback. In: NIPSGoogle Scholar
  3. 3.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston MA, pp 3156–3164Google Scholar
  4. 4.
    Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt J, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE conference on computer vision and pattern recognitionGoogle Scholar
  5. 5.
    Elliott D, de Vries AP (2015) Describing images using inferred visual dependency representations. In: Annual meeting of the association for computational linguisticsGoogle Scholar
  6. 6.
    Tan YH, Chan CS (2016) phi-LSTM: a phrase-based hierarchical LSTM model for image captioning. In: ACCVGoogle Scholar
  7. 7.
    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659Google Scholar
  8. 8.
    Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676CrossRefGoogle Scholar
  9. 9.
    Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691CrossRefGoogle Scholar
  10. 10.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: ICLR WorkshopGoogle Scholar
  11. 11.
    Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532-1543Google Scholar
  12. 12.
    Blei D, Lafferty J (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35MathSciNetCrossRefGoogle Scholar
  13. 13.
    Chen B (2009) Latent topic modelling of word co-occurence information for spoken document retrieval. In: 2009 IEEE international conference on acoustics, speech and signal processing. Taipei, pp 3961–3964Google Scholar
  14. 14.
    Socher R, Karpathy A, Le QV, Manning CD, Ng A (2014) Grounded compositional semantics for finding and describing images with sentences. In: Transactions of the association for computational linguistics, pp 207218Google Scholar
  15. 15.
    Deng J, Dong W, Socher R, Li LJ, Li Kai, Fei-Fei Li (2009) ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255Google Scholar
  16. 16.
    Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognitionGoogle Scholar
  17. 17.
    Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: 2015 IEEE international conference on computer vision (ICCV), pp. 2407–2415Google Scholar
  18. 18.
    Chen X, Zitnick CL (2015) Minds eye: a recurrent visual representation for image caption generation. In: IEEE Conference on computer vision and pattern recognitionGoogle Scholar
  19. 19.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057Google Scholar
  20. 20.
    Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: CVPRGoogle Scholar
  21. 21.
    Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272
  22. 22.
    Uijlings J, van de Sande K, Gevers T, Smeulders A (2013) Selective search for object recognition. IJCV 104(2):154–171CrossRefGoogle Scholar
  23. 23.
    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLRGoogle Scholar
  24. 24.
    Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen W (2016) Encode, review, and decode: reviewer module for caption generation. In: NIPSGoogle Scholar
  25. 25.
    Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: CVPRGoogle Scholar
  26. 26.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPRGoogle Scholar
  27. 27.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2014) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  28. 28.
    Garten J, Sagae K, Ustun V, Dehghani M (2015) Combining distributed vector representations for words. In: Proceedings of the 1st workshop on vector space modeling for natural language processing, pp 95–101Google Scholar
  29. 29.
    Fadaee M, Bisazza A, Monz C (2017) Learning topic-sensitive word representations. In: Proceedings of the 55th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 441–447Google Scholar
  30. 30.
    Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: 2010 ACM/IEEE 32nd international conference on software engineering, pp 95–104Google Scholar
  31. 31.
    Aldous DJ (1985) Exchangeability and related topics. In: École d’Été de Probabilités de Saint-Flour XIII—1983, pp 1–198Google Scholar
  32. 32.
    Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868CrossRefGoogle Scholar
  33. 33.
    Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication associationGoogle Scholar
  34. 34.
    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell 47:853–899MathSciNetCrossRefGoogle Scholar
  35. 35.
    Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Lin 2:67–78Google Scholar
  36. 36.
    Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663CrossRefGoogle Scholar
  37. 37.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105Google Scholar
  38. 38.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguisticsGoogle Scholar
  39. 39.
    Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72Google Scholar
  40. 40.
    Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575Google Scholar
  41. 41.
    Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325
  42. 42.
    Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Explain images with multimodal recurrent neural networks. In: ICLRGoogle Scholar
  43. 43.
    Aneja J, Aditya D, Alexander SG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  44. 44.
    Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton-attribute decomposition. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  45. 45.
    Wang Q, Chan AB (2018) CNN+ CNN: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  • Usman Zia
    • 1
  • M. Mohsin Riaz
    • 2
  • Abdul Ghafoor
    • 1
    Email author
  • Syed Sohaib Ali
    • 2
  1. 1.National University of Sciences and TechnologyIslamabadPakistan
  2. 2.COMSATS UniversityIslamabadPakistan

Personalised recommendations