An angle-based method for measuring the semantic similarity between visual and textual features

Tang, Chenwei; Lv, Jiancheng; Chen, Yao; Guo, Jixiang

doi:10.1007/s00500-018-3051-y

An angle-based method for measuring the semantic similarity between visual and textual features

Methodologies and Application
Published: 06 February 2018

Volume 23, pages 4041–4050, (2019)
Cite this article

Soft Computing Aims and scope Submit manuscript

Chenwei Tang¹,
Jiancheng Lv¹,
Yao Chen¹ &
…
Jixiang Guo¹

542 Accesses
8 Citations
Explore all metrics

Abstract

The main challenge for most image–text tasks, such as zero-shot, is the way to measure the semantic similarity between visual and textual feature vectors. The common solution is to map the image feature vectors and text feature vectors into the Hilbert space and then rank the similarity by the inner product between feature vectors. In this paper, we learn the feature representation of images and their sentence descriptions by different deep neural networks to learn about the inner-modal correspondences between visual and language data. We then use a joint embedding structure based on angle calculation for measuring the semantic similarity between visual and textual features. In the proposed method, a constant factor b keeps the similarities of positive samples and negative samples at a certain distance. Since the proposed cosine similarity method involves both normalization and vectors computation, we also develop the learning algorithm on neural networks for expressing the semantic features of texts and images. We applied the angle-based method to the challenging Caltech-UCSD Birds and the Oxford-102 Flowers datasets. The experiments demonstrate good performances on both recognition and retrieval tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visualizing and Understanding Convolutional Networks

A survey of the vision transformers and their CNN-transformer based variants

Article 04 October 2023

VSEM-SAMMI: An Explainable Multimodal Learning Approach to Predict User-Generated Image Helpfulness and Product Sales

Article Open access 18 April 2024

References

Akata Z, Perronnin F, Harchaoui Z, Schmid C (2015a) Label-embedding for image classification. IEEE Trans Softw Eng 38(7):1425–1438
Google Scholar
Akata Z, Reed S, Walter D, Lee H (2015b) Evaluation of output embeddings for fine-grained image classification. In: IEEE Computer Vision and Pattern Recognition, pp 2927–2936
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. In: IEEE International Conference on Computer Vision, pp 2425–2433
Baioletti M, Coletti G, Petturiti D (2012) Weighted attribute combinations based similarity measures. Springer, Berlin, pp 211–220
MATH Google Scholar
Chen CH, Lin CJ, Lin CT (2008) An efficient quantum neuro-fuzzy classifier based on fuzzy entropy and compensatory operation. Soft Comput 12(6):567–583
Article MATH Google Scholar
Chen D, Lv JC, Yi Z (2014) A local non-negative pursuit method for intrinsic manifold structure preservation. In: The 28th AAAI Conference on Artificial Intelligence (AAAI), vol 3, pp 1745–1751
Dehak N, Dehak R, Glass J, Reynolds D, Kenny P (2010) Cosine similarity scoring without score normalization techniques. In: Proceedings of Odyssey 2010—The Speaker and Language Recognition Workshop
Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp 248–255
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: a deep convolutional activation feature for generic visual recognition. Comput Sci 50(1):815–830
Google Scholar
Fang H, Gupta S, Iandola F, Srivastava RK (2015) From captions to visual concepts and back. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1473–1482
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question answering. Computer science, pp 2296–2304
Gong Y, Ke Q, Isard M, Lazebnik S (2014a) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Article Google Scholar
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014b) Improving image-sentence embeddings using large weakly annotated photo collections. Springer, Berlin
Book Google Scholar
Goyal MM, Agrawal N, Sarma MK, Kalita NJ (2015) Comparison clustering using cosine and fuzzy set based similarity measures of text documents. Computer science
Graves A (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on International Conference on Machine Learning, pp 448–456
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. Eprint Arxiv, pp 3128–3137
Kempf A (1994) Hilbert space representation of the minimal length uncertainty relation. Phys Rev D Part Fields 52(2):1108–1118
Article MathSciNet Google Scholar
Kulis B, Saenko K, Darrell T (2011) What you saw is not what you get: domain adaptation using asymmetric kernel transforms. In: Computer Vision and Pattern Recognition, pp 1785–1792
Lampert CH, Nickisch H, Harmeling S (2014) Attribute-based classification for zero-shot visual object categorization. IEEE Trans Pattern Anal Mach Intell 36(3):453–465
Article Google Scholar
Larochelle H, Erhan D, Bengio Y (2008) Zero-data learning of new tasks. In: Proceedings of the National Conference on Artificial Intelligence. vol 2, pp 46–651
Liao SH, Hsieh JG, Chang JY, Lin CT (2015) Training neural networks via simplified hybrid algorithm mixing Nelder–Mead and particle swarm optimization methods. Soft Comput 19(3):679–689
Article Google Scholar
Lv JC, Yi Z, Tan KK (2007) Global convergence of GHA learning algorithm with nonzero-approaching learning rates. IEEE Trans Neural Netw TNN 18(6):1557–1571
Article Google Scholar
Lv JC, Yi Z, Zhou J (2010) Subspace learning of neural networks, vol 42. CRC Press, Boca Raton
MATH Google Scholar
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2015) Deep captioning with multimodal recurrent neural networks (m-rnn). Eprint Arxiv
Nair V, Hinton GE (2015) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the ICML, pp 807–814
Nguyen HV, Bai L (2010) Cosine similarity metric learning for face verification. Springer, Berlin, pp 709–720
Google Scholar
Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. Computer Vision, Graphics & Image Processing, 2008. ICVGIP ’08. Sixth Indian Conference on, pp 722–729
Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: Advances in neural information processing systems. International Conference on Neural Information Processing Systems, pp 1410–1418
Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. Computer Vision and Pattern Recognition, pp 49–58
Romera-Paredes B, Torr PHS (2015) An embarrassingly simple approach to zero-shot learning. In: International Conference on Machine Learning, pp 2152–2161
Shum S, Dehak N, Dehak R, Glass JR (2010) Unsupervised speaker adaptation based on the cosine similarity for textindependent speaker verification. In: Proceedings of Odyssey
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: Computer vision and pattern recognition, pp 1–9
Tsochantaridis I, Joachims T, Hofmann T, Altun Y (2005) Large margin methods for structured and interdependent output variables. J Mach Learn Res 6(2):1453–1484
MathSciNet MATH Google Scholar
Visa S, Ramsay B, Ralescu AL, Knaap EVD (2011) Confusion matrix-based feature selection. In: Midwest Artificial Intelligence and Cognitive Science Conference 2011, Cincinnati, Ohio, USA, April, pp 120–127
Wang L, Li Y, Lazebnik S (2015) Learning deep structure-preserving image–text embeddings. Computer Science
Wei J, Lv JC, Yi Z (2015) Robust classifier using distance-based representation with square weights. Soft Comput 19(2):507–515
Article MATH Google Scholar
Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, Perona P (2010) Caltech-UCSD birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology
Xie C, Lv J, Li X (2016) Finding a good initial configuration of parameters for restricted Boltzmann machine pre-training. Soft Computing, pp 1–9
Ye J (2011) Cosine similarity measures for intuitionistic fuzzy sets and their applications. Math Comput Model 53(1):91–97
Article MathSciNet MATH Google Scholar
Zhang X, Zhao J, Lecun Y (2015) Character-level convolutional networks for text classification. In: NIPS’15 Proceedings of the 28th International Conference on Neural Information Processing Systems. vol 1, pp 649–657

Download references

Acknowledgements

This work is supported by National Key R&D Program of China under contract No. 2017YFB1002201 and supported by National Natural Science Fund for Distinguished Young Scholar (Grant No. 61625204) and partially supported by the State Key Program of National Science Foundation of China (Grant Nos. 61432012 and 61432014).

Author information

Authors and Affiliations

Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu, 610065, People’s Republic of China
Chenwei Tang, Jiancheng Lv, Yao Chen & Jixiang Guo

Authors

Chenwei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jiancheng Lv
View author publications
You can also search for this author in PubMed Google Scholar
Yao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jixiang Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiancheng Lv.

Ethics declarations

Conflict of interest

We declare that we have no conflict of interest.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, C., Lv, J., Chen, Y. et al. An angle-based method for measuring the semantic similarity between visual and textual features. Soft Comput 23, 4041–4050 (2019). https://doi.org/10.1007/s00500-018-3051-y

Download citation

Published: 06 February 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s00500-018-3051-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An angle-based method for measuring the semantic similarity between visual and textual features

Abstract

Access this article

Similar content being viewed by others

Visualizing and Understanding Convolutional Networks

A survey of the vision transformers and their CNN-transformer based variants

VSEM-SAMMI: An Explainable Multimodal Learning Approach to Predict User-Generated Image Helpfulness and Product Sales

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An angle-based method for measuring the semantic similarity between visual and textual features

Abstract

Access this article

Similar content being viewed by others

Visualizing and Understanding Convolutional Networks

A survey of the vision transformers and their CNN-transformer based variants

VSEM-SAMMI: An Explainable Multimodal Learning Approach to Predict User-Generated Image Helpfulness and Product Sales

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation