aiTPR: Attribute Interaction-Tensor Product Representation for Image Caption

Abstract

Region visual features enhance the generative capability of the machines based on features. However, they lack proper interaction-based attentional perceptions and end up with biased or uncorrelated sentences or pieces of misinformation. In this work, we propose Attribute Interaction-Tensor Product Representation (aiTPR), which is a convenient way of gathering more information through orthogonal combination and learning the interactions as physical entities (tensors) and improving the captions. Compared to previous works, where features add up to undefined feature spaces, TPR helps maintain sanity in combinations, and orthogonality helps define familiar spaces. We have introduced a new concept layer that defines the objects and their interactions that can play a crucial role in determining different descriptions. The interaction portions have contributed heavily to better caption quality and have out-performed various previous works on this domain and MSCOCO dataset. For the first time, we introduced the notion of combining regional image features and abstracted interaction likelihood embedding for image captioning.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. 1.

    Ren S, He K, Girshick R, Sun J (2015). Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  2. 2.

    Gan Z et al (2016) Semantic compositional networks for visual captioning. arXiv preprint arXiv:1611.08002

  3. 3.

    Sur C (2019) Representations for vision language intelligence using tensor product representation. Ph.D. dissertation, University of Florida

  4. 4.

    Sur C (2019) Survey of deep learning and architectures for visual captioning-transitioning between media and natural languages. Multimedia Tools Appl 78:1–51

    Article  Google Scholar 

  5. 5.

    Sur C (2019) GSIAR: gene-subcategory interaction-based improved deep representation learning for breast cancer subcategorical analysis using gene expression, applicable for precision medicine. Med Biol Eng Comput 57(11):2483–2515

    Article  Google Scholar 

  6. 6.

    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  7. 7.

    Devlin J et al (2015) Language models for image captioning: the quirks and what works. arXiv preprint arXiv:1505.01809

  8. 8.

    Vinyals O et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  9. 9.

    Chen X, Lawrence Zitnick C (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  10. 10.

    Devlin, J, Gupta, S, Girshick, R, Mitchell, M, Zitnick, C. L. (2015). Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467

  11. 11.

    Chiranjib S (2019) A Multi-Modular System-Genetics (MMSG) approach for deep representation learning for personalized treatment of cancer using sensitivity analysis of precision drugs and gene expression data. Data-Enabled Discov Appl 3(1):11

    Article  Google Scholar 

  12. 12.

    Xu K et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning

  13. 13.

    Mao J et al (2014) Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632

  14. 14.

    Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: IEEE international conference on computer vision, ICCV, pp 22–29

  15. 15.

    Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, vol 1, no 2, p 3

  16. 16.

    Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: attribute-driven attention model for image captioning. In: IJCAI, pp 606–612

  17. 17.

    Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699

  18. 18.

    Paul S (1990) Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif Intell 46(1–2):159–216

    MathSciNet  MATH  Google Scholar 

  19. 19.

    Lu D, Whitehead S, Huang L, Ji H, Chang SF (2018) Entity-aware image caption generation. arXiv preprint arXiv:1804.07889

  20. 20.

    Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228

  21. 21.

    You Q, Jin H, Luo J (2018) Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121

  22. 22.

    Melnyk I, Sercu T, Dognin PL, Ross J, Mroueh Y (2018) Improved image captioning with adversarial semantic alignment. arXiv preprint arXiv:1805.00063

  23. 23.

    Wu J, Hu Z, Mooney RJ (2018) Joint image captioning and question answering. arXiv preprint arXiv:1805.08389

  24. 24.

    Chen F, Ji R, Su J, Wu Y, Wu Y (2017) Structcap: structured semantic embedding for image captioning. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 46–54

  25. 25.

    Jiang W, Ma L, Chen X, Zhang H, Liu W (2018) Learning to guide decoding for image captioning. arXiv preprint arXiv:1804.00887

  26. 26.

    Wu C, Wei Y, Chu X, Su F, Wang L (2018) Modeling visual and word-conditional semantic attention for image captioning. Signal Process: Image Commun 67:100–107

    Google Scholar 

  27. 27.

    Fu K, Li J, Jin J, Zhang C (2018) Image-text surgery: efficient concept learning in image captioning by generating pseudopairs. IEEE Trans Neural Netw Learn Syst 99:1–12

    Google Scholar 

  28. 28.

    Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: image captioning with saliency and context attention. ACM Trans Multimedia Comput Commun Appl (TOMM) 14(2):48

    Google Scholar 

  29. 29.

    Zhao W, Wang B, Ye J, Yang M, Zhao Z, Luo R, Qiao Y (2018) A multi-task learning approach for image captioning. In: IJCAI, pp 1205–1211

  30. 30.

    Li X, Wang X, Xu C, Lan W, Wei Q, Yang G, Xu J (2018) COCO-CN for cross-lingual image tagging, captioning and retrieval. arXiv preprint arXiv:1805.08661

  31. 31.

    Sur C, Liu P, Zhou Y, Wu D (2019) Semantic tensor product for image captioning. In: 2019 5th international conference on big data computing and communications (BIGCOM). IEEE, pp 33–37

  32. 32.

    Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: AAAI, pp 3981–3987

  33. 33.

    Chen H, Zhang H, Chen PY, Yi J, Hsieh CJ (2017) Show-and-fool: crafting adversarial examples for neural image captioning. arXiv preprint arXiv:1712.02051

  34. 34.

    Sur C (2020) GenAtSeq GAN with heuristic reforms for knowledge centric network with browsing characteristics learning, individual tracking and malware detection with Website2Vec. SN Comput Sci 1:228. https://doi.org/10.1007/s42979-020-00234-8

    Article  Google Scholar 

  35. 35.

    Ye S, Liu N, Han J (2018) Attentive linear transformation for image captioning. IEEE Trans Image Process 27:5514–5524

    MathSciNet  Article  Google Scholar 

  36. 36.

    Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton-attribute decomposition. arXiv preprint arXiv:1704.06972

  37. 37.

    Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, Luo J (2018) “Factual” or “Emotional”: stylized image captioning with adaptive learning and attention. arXiv preprint arXiv:1807.03871

  38. 38.

    Chen F, Ji R, Sun X, Wu Y, Su J (2018) GroupCap: group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1345–1353

  39. 39.

    Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: a multimodal attentive translator for image captioning. arXiv preprint arXiv:1702.05658

  40. 40.

    Harzig P, Brehm S, Lienhart R, Kaiser C, Schallner R (2018) Multimodal image captioning for marketing analysis. arXiv preprint arXiv:1802.01958

  41. 41.

    Liu X, Li H, Shao J, Chen D, Wang X (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. arXiv preprint arXiv:1803.08314

  42. 42.

    Chunseong Park C, Kim B, Kim G (2017) Attend to you: personalized image captioning with context sequence memory networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 895–903

  43. 43.

    Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, Hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics, vol 1. Long Papers, pp 2556–2565

  44. 44.

    Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5263–5271

  45. 45.

    Chiranjib S (2019) DeepSeq: learning browsing log data based personalized security vulnerabilities and counter intelligent measures. J Ambient Intell Human Comput 10(9):3573–3602

    Article  Google Scholar 

  46. 46.

    Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales TM (2017) Actor-critic sequence training for image captioning. arXiv preprint arXiv:1706.09601

  47. 47.

    Sur C (2019) UCRLF: unified constrained reinforcement learning framework for phase-aware architectures for autonomous vehicle signaling and trajectory optimization. Evolut Intell 12:1–24

    Article  Google Scholar 

  48. 48.

    Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334

    Article  Google Scholar 

  49. 49.

    Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. arXiv preprint arXiv:1704.03899

  50. 50.

    Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings on IEEE international conference on computer vision, vol 3, p 3

  51. 51.

    Cohn-Gordon R, Goodman N, Potts C (2018) Pragmatically informative image captioning with character-level reference. arXiv preprint arXiv:1804.05417

  52. 52.

    Liu C, Mao J, Sha F, Yuille AL (2017) Attention correctness in neural image captioning. In: AAAI, pp 4176–4182

  53. 53.

    Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), vol 6, p 2

  54. 54.

    Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663

    Article  Google Scholar 

  55. 55.

    Zhang M, Yang Y, Zhang H, Ji Y, Shen HT, Chua TS (2018) More is better: precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans Image Process 28:32–44

    MathSciNet  Article  Google Scholar 

  56. 56.

    Park CC, Kim B, Kim G (2018) Towards personalized image captioning via multimodal memory networks. IEEE Trans Pattern Anal Mach Intell 41:999–1012

    Article  Google Scholar 

  57. 57.

    Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40:1367–1381

    Article  Google Scholar 

  58. 58.

    Gan C et al (2017) Stylenet: generating attractive visual captions with styles. In: CVPR

  59. 59.

    Jin J et al (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272

  60. 60.

    Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  61. 61.

    Pu Y et al (2016) Variational autoencoder for deep learning of images, labels and captions. In: Advances in neural information processing systems

  62. 62.

    Socher R et al (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218

    Article  Google Scholar 

  63. 63.

    Sutskever I, Martens J, Hinton GE (2011) Generating text with recurrent neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11)

  64. 64.

    Ilya S, Oriol V, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems

  65. 65.

    LTran D et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision

  66. 66.

    Tran K et al (2016) Rich image captioning in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

  67. 67.

    You Q et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  68. 68.

    Girshick R et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  69. 69.

    Jia X et al (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision

  70. 70.

    Kulkarni G et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  71. 71.

    Kuznetsova P et al (2014) TREETALK: composition and compression of trees for image descriptions. TACL 2(10):351–362

    Article  Google Scholar 

  72. 72.

    Mao J et al (2015) Learning like a child: fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE international conference on computer vision

  73. 73.

    Mathews A, Xie L, He X (2016) SentiCap: generating image descriptions with sentiments. In: AAAI

  74. 74.

    Yang Y et al (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics

  75. 75.

    Sur C (2020) SACT: self-aware multi-space feature composition transformer for multinomial attention for video captioning. arXiv preprint arXiv:2006.14262

  76. 76.

    Sur C (2020) Self-segregating and coordinated-segregating transformer for focused deep multi-modular network for visual question answering. arXiv preprint arXiv:2006.14264

  77. 77.

    Sur C (2020) ReLGAN: generalization of consistency for GAN with disjoint constraints and relative learning of generative processes for multiple transformation learning. arXiv preprint arXiv:2006.07809

  78. 78.

    Sur C (2020) AACR: feature fusion effects of algebraic amalgamation composed representation on (de)compositional network for caption generation for images. SN Comput Sci 1:229. https://doi.org/10.1007/s42979-020-00238-4

    Article  Google Scholar 

  79. 79.

    Sur C (2020) Gaussian Smoothen Semantic Features (GSSF)—exploring the linguistic aspects of visual captioning in Indian Languages (Bengali) using MSCOCO framework. arXiv preprint arXiv:2002.06701

  80. 80.

    Sur C (2020) MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC). arXiv preprint arXiv:2002.06436

  81. 81.

    Sur C (2019) CRUR: coupled-recurrent unit for unification, conceptualization and context capture for language representation—a generalization of bi directional LSTM. arXiv preprint arXiv:1911.10132

  82. 82.

    Chiranjib S (2020) RBN: enhancement in language attribute prediction using global representation of natural language transfer learning technology like Google BERT. SN Appl Sci 2(1):22

    Google Scholar 

  83. 83.

    Sur C (2019) Tpsgtr: neural-symbolic tensor product scene-graph-triplet representation for image captioning. arXiv preprint arXiv:1911.10115

  84. 84.

    Sur C (2018) Feature fusion effects of tensor product representation on (de) compositional network for caption generation for images. arXiv preprint arXiv:1812.06624

  85. 85.

    Donahue J et al. (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  86. 86.

    Fang H et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  87. 87.

    Wang C, Yang H, Bartz C, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimedia Comput Commun Appl (TOMM) 14(2s):40

    Google Scholar 

  88. 88.

    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol 3, no 5, p 6

  89. 89.

    Sur C (2020) MRECN: mixed representation enhanced (de) compositional network for caption generation from visual features, modeling as pseudo tensor product representation. Int J Multimedia Inf Retr 9:1–26

    Article  Google Scholar 

Download references

Acknowledgements

The author has used University of Florida HiperGator, equipped with NVIDIA Tesla K80 GPU, extensively for the experiments. The author acknowledges University of Florida Research Computing for providing computational resources and support that have contributed to the research results reported in this publication http://researchcomputing.ufl.edu.

Funding

This research work was not funded by any agency.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Chiranjib Sur.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sur, C. aiTPR: Attribute Interaction-Tensor Product Representation for Image Caption. Neural Process Lett (2021). https://doi.org/10.1007/s11063-021-10438-5

Download citation

Keywords

  • Language modeling
  • Representation learning
  • Tensor product representation
  • Image description
  • Sequence generation
  • Image understanding
  • Automated textual feature extraction