Advertisement

Multimedia Tools and Applications

, Volume 78, Issue 1, pp 389–412 | Cite as

Towards learning a semantic-consistent subspace for cross-modal retrieval

  • Meixiang Xu
  • Zhenfeng ZhuEmail author
  • Yao Zhao
Article
  • 87 Downloads

Abstract

A great many of approaches have been developed for cross-modal retrieval, among which subspace learning based ones dominate the landscape. Concerning whether using the semantic label information or not, subspace learning based approaches can be categorized into two paradigms, unsupervised and supervised. However, for multi-label cross-modal retrieval, supervised approaches just simply exploit multi-label information towards a discriminative subspace, without considering the correlations between multiple labels shared by multi-modalities, which often leads to an unsatisfactory retrieval performance. To address this issue, in this paper we propose a general framework, which jointly incorporates semantic correlations into subspace learning for multi-label cross-modal retrieval. By introducing the HSIC-based regularization term, the correlation information among multiple labels can be not only leveraged but also the consistency between the modality similarity from each modality is well preserved. Besides, based on the semantic-consistency projection, the semantic gap between the low-level feature space of each modality and the shared high-level semantic space can be balanced by a mid-level consistent one, where multi-label cross-modal retrieval can be performed effectively and efficiently. To solve the optimization problem, an effective iterative algorithm is designed, along with its convergence analysis theoretically and experimentally. Experimental results on real-world datasets have shown the superiority of the proposed method over several existing cross-modal subspace learning methods.

Keywords

Cross-modal Semantic-correlation Subspace learning Multi-label 

Notes

Acknowledgements

This work was jointly supported by National Natural Science Foundation of China (NO.61572068, NO.61532005), National Key Research and Development of China (NO.2016YFB0800404) and the Fundamental Research Funds for the Central Universities (No.2018JBZ001).

References

  1. 1.
    Akaho S (2007) A kernel method for canonical correlation analysis. In: The international meeting of the psychometric society (IMPS)Google Scholar
  2. 2.
    Carneiro G, Chan AB, Moreno PJ, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410CrossRefGoogle Scholar
  3. 3.
    Chen X, Yuan X, Chen Q, Yan S, Chua TS (2011) Multi-label visual classification with label exclusive context. In: IEEE international conference on computer vision (ICCV)Google Scholar
  4. 4.
    Chen Y, Wang L, Wang W, Zhang Z (2012) Continuum regression for cross-modal multimedia retrieval. In: IEEE international conference on image processing (ICIP)Google Scholar
  5. 5.
    Chua TS, Tang J, Hong R, Li H, Luo Z, Zhang Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: ACM international conference on image and videoGoogle Scholar
  6. 6.
    Cui C, Lin P, Nie X, Yin Y, Zhu Q (2017) Hybrid textual-visual relevance learning for content-based image retrieval. J Vis Commun Image Represent 48:367–374CrossRefGoogle Scholar
  7. 7.
    Diethe T, Hardoon DR, Shawe-Taylor J (2008) Multiview fisher discriminative analysis. In: NIPS workshop on learning from multiple sourcesGoogle Scholar
  8. 8.
    Everingham M, Gool LV, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338CrossRefGoogle Scholar
  9. 9.
    Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems (NIPS)Google Scholar
  10. 10.
    Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233CrossRefGoogle Scholar
  11. 11.
    Gretton A, Bousquet O, Smola A, Scholkopf B (2005) Measuring statistical dependence with Hilbert-Schmidt norms. In: International conference on algorithmic learning theory. Springer, BerlinGoogle Scholar
  12. 12.
    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefGoogle Scholar
  13. 13.
    He R, Zhang M, Wang L, Ji Y, Yin Q (2015) Cross-modal subspace learning via pairwise constraints. IEEE Trans Image Process 24(12):5543–5556MathSciNetCrossRefGoogle Scholar
  14. 14.
    Higham NJ (2002) Accuracy and stability of numerical algorithms. Society for Industrial and Applied MathematicsGoogle Scholar
  15. 15.
    Hotelling H (1936) Relations between two sets of variates. Biometrika 28 (3/4):321–377CrossRefGoogle Scholar
  16. 16.
    Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: Proceedings of the British machine vision conference (BMVC)Google Scholar
  17. 17.
    Ji S, Yu S, Ye J (2010) A shared-subspace learning framework for multi-label classification. ACM Trans Knowl Discov Data (TKDD) 4(2):1–29CrossRefGoogle Scholar
  18. 18.
    Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: IEEE international conference on computer vision (ICCV)Google Scholar
  19. 19.
    Jiang S, Song X, Huang Q (2014) Relative image similarity learning with contextual information for internet cross-media retrieval. Multimed Syst 20(6):645–657CrossRefGoogle Scholar
  20. 20.
    Kan M, Shan S, Zhang H, Lao S, Chen X (2016) Multi-view discrinative analysis. IEEE Trans Pattern Anal Mach Intell 38(1):188–194CrossRefGoogle Scholar
  21. 21.
    Kang F, Jin R, Sukthankar R (2006) Correlated label propagation with application to multi-label learning. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  22. 22.
    Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng (TKDE) 26(9):2138–2150CrossRefGoogle Scholar
  23. 23.
    Liao R, Zhu J, Qin Z (2014) Nonparametric bayesian upstream supervised multi-modal topic models. In: ACM international conference on web search and data miningGoogle Scholar
  24. 24.
    Liu Y, Jin R, Yang L (2006) Semi-supervised multi-label learning by constrained non-negative matrix factorization. In: Proceedings of the thirty-first AAAI conference on artificial intelligenceGoogle Scholar
  25. 25.
    Pereira JC, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535CrossRefGoogle Scholar
  26. 26.
    Ranjan V, Rasiwasia N, Jawahar C (2015) Multi-label cross-modal retrieval. In: IEEE international conference on computer vision (ICCV)Google Scholar
  27. 27.
    Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: The NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical TurkGoogle Scholar
  28. 28.
    Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, ahd Nuno Vasconcelos RL (2010) A new appraoch to cross-modal multimedia retrieval. In: International conference on machine learning (international conference on machine learning (ICML))Google Scholar
  29. 29.
    Rosipal R, Trejo LJ (2003) Kernel partial least square regression in reproducing kernel Hilbert space. Pattern Recognit 36(9):1961–1971CrossRefGoogle Scholar
  30. 30.
    Sharma A, Jacobs DW (2011) Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  31. 31.
    Sharma A, Kumar A, Daume H III (2012) Generalized multi-view analysis: a discriminative latent space. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  32. 32.
    Shu X, Qi G, Tang J, Wang J (2015) Weakly-shared deep transfer newworks for heterogeneous-domain knowledge propagation. In: ACM international conference on multimediaGoogle Scholar
  33. 33.
    Song G, Wang S, Huang Q, Tian Q (2017) Multimodal similarity gaussian process latent variable model. IEEE Trans Image Process 26(9):4168–4181MathSciNetCrossRefGoogle Scholar
  34. 34.
    Tae-Kyun K, Kittler J, Cipolla R (2007) Discriminative learning and recognition of image set classes using canonical correlation. IEEE Trans Pattern Anal Mach Intell 29(6):1005–1018CrossRefGoogle Scholar
  35. 35.
    Tang J, Shu X, Li Z, Qi G, Wang J (2016) Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans Multimed Comput Commun Appl 12(4s):1–22CrossRefGoogle Scholar
  36. 36.
    Tenenbaum JB, Freeman WT (2000) Separating style and content with bilinear models. Neural Comput 12(6):1247–1283CrossRefGoogle Scholar
  37. 37.
    Udupa R, Khapra M (2010) Improving the multilingual user experience of wikipedia using cross-language name search. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguisticsGoogle Scholar
  38. 38.
    Wang S, Jiang S (2015) Instre:a new benchmark for instance-level object retrieval and recognition. ACM Trans Multimed Comput Commun Appl 11(3):1–37CrossRefGoogle Scholar
  39. 39.
    Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: ACM international conference on multimediaGoogle Scholar
  40. 40.
    Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023CrossRefGoogle Scholar
  41. 41.
    Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv:1607.06215 [cs.MM]
  42. 42.
    Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2017) Cross-modal retrieval with cnn visual features: a new baseline. IEEE Trans Cybern 47(2):449–460Google Scholar
  43. 43.
    Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  44. 44.
    Xu D, Yan S (2009) Semi-supervised bilinear subspace learning. IEEE Trans Image Process 18(7):1671–1676MathSciNetCrossRefGoogle Scholar
  45. 45.
    Yang J, Yan S, Huang TS (2008) Ubiquitously supervised subspace learning. IEEE Trans Image Process 18(2):241–249MathSciNetCrossRefGoogle Scholar
  46. 46.
    Zhang Y, Schneider JG (2011) Multi-label output codes using canonical correlation analysis. In: The 14th international conference on artificial intelligence and statisticsGoogle Scholar
  47. 47.
    Zhang Y, Zhou Z (2010) Multilabel dimensionality reduction via dependence maximization. ACM Trans Knowl Discov Data (TKDD) 4(3):14Google Scholar
  48. 48.
    Zhang X, Yu Y, White M, Huang R, Schuurmans D (2011) Convex sparse coding, subspace learning and semi-supervised extensions. In: Proceedings of the thirty-first AAAI conference on artificial intelligenceGoogle Scholar
  49. 49.
    Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recognit 45:346–362CrossRefGoogle Scholar
  50. 50.
    Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans Multimedia 19 (6):1220–1233CrossRefGoogle Scholar
  51. 51.
    Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  52. 52.
    Zheng Y, Zhang Y, Larochelle H (2014) Topic modeling of multimodal data: an autoregressive approach. In: IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  53. 53.
    Zhu S, Ji X, Xu W, Gong Y (2005) Multi-labelled classification using maximum entropy method. In: The 28th annual international ACM SIGIR conference on research and development in information retrievalGoogle Scholar
  54. 54.
    Zhu Z, Cheng J, Zhao Y, Ye J (2016) Lsslp-local structure sensitive label propagation. Inf Sci 332:19–32CrossRefGoogle Scholar
  55. 55.
    Zhuang Y, Wang Y, Wu F, Zhang Y, Lu W (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Proceedings of the thirty-first AAAI conference on artificial intelligenceGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Institute of Information ScienceBeijing Jiaotong UniversityBeijingChina
  2. 2.Beijing Key Laboratory of Advanced Information Science and Network TechnologyBeijingChina

Personalised recommendations