Multimedia Tools and Applications

, Volume 77, Issue 17, pp 22455–22473 | Cite as

Cross-media retrieval based on semi-supervised regularization and correlation learning

  • Hong ZhangEmail author
  • Gang Dai
  • Du Tang
  • Xin Xu


As large scale multimedia data in heterogeneous spaces is flooding into the Internet, cross-media retrieval is becoming increasingly significant. In cross-media retrieval, users can retrieve the results containing various types of media by submitting a query of any media type. However, most existing cross-media retrieval methods are restricted to the retrieval between two types of media, which ignores the semantic consistency of different media data. In addition, although some methods consider the similarity between same semantic category data in different media, they neglect the dissimilarity between different semantic category data in different media. To solve the above problems, we propose a novel feature learning algorithm for cross-media retrieval, called semi-supervised regularization and correlation learning (SSRCL), which is capable of modeling multiple types of media simultaneously. More importantly, SSRCL considers both semantic category similarity and dissimilarity simultaneously, and utilizes both labeled and unlabeled data to learn the projection matrices for different media types. The experimental results show that our proposed approach, compared with four state-of-the-art methods, has better performance on two extensively used datasets.


Cross-media retrieval Common space Semi-supervised regularization Correlation learning Semantic level 



This research is supported by the National Natural Science Foundation of China (No. 61373109, No. 61602349), the Educational Research Project from the Educational Commission of Hubei Province (2016234).


  1. 1.
    Battiato S, Farinella GM, Giuffrida G, Tribulato G (2007) Data mining learning bootstrap through semantic thumbnail analysis. In: Proceedings of Spie 6506, pp 1–8Google Scholar
  2. 2.
    Battiato S, Farinella GM, Giuffrida G, Sismeiro C, Tribulato G (2009) Using visual and text features for direct marketing on multimedia messaging services domain. Multimedia Tools and Applications 42(1):5–30CrossRefGoogle Scholar
  3. 3.
    Blaschko M, Lampert C (2008) Correlational spectral clustering. In: IEEE conference on computer vision and pattern recognition, pp 1–8Google Scholar
  4. 4.
    Belkin M, Niyogi P, Sindhwani V (2004) Manifold regularization: a geometric framework for learning from examples. J Mach Learn Res 7(1):2399–2434MathSciNetzbMATHGoogle Scholar
  5. 5.
    Chen D, Tian X, Shen Y, Ouhyoung M (2010) On visual similarity based 3D model retrieval. Comput Graphics Forum 22(3):223–232CrossRefGoogle Scholar
  6. 6.
    Clinchant S, Ah-Pine J, Csurka G (2011) Semantic combination of textual and visual information in multimedia retrieval. In: ACM international conference on multimedia retrieval, pp 44P.1–44P.8Google Scholar
  7. 7.
    Daras P, Manolopoulou S, Axenopoulos A (2012) Search and retrieval of rich media objects supporting multiple multimodal queries. IEEE Trans Multimedia 14 (3):734–746CrossRefGoogle Scholar
  8. 8.
    Escalante HJ, Hérnadez CA, Sucar LE, Montes M (2008) Late fusion of heterogeneous methods for multimedia image retrieval. In: ACM international conference on multimedia information retrieval, pp 172–179Google Scholar
  9. 9.
    Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2014) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112(C):83–97Google Scholar
  10. 10.
    Gao Z, Li SH, Zhu YJ, Wang C, Zhang H (2017) Collaborative sparse representation leaning model for RGBD action recognition. J Vis Commun Image Represent 48(C):442–452CrossRefGoogle Scholar
  11. 11.
    Gong D, Li Z, Liu J, Qiao Y (2013) Multi-feature canonical correlation analysis for face photo-sketch image retrieval. In: ACM international conference on multimedia, pp 617–620Google Scholar
  12. 12.
    Greenspan H, Goldberger J, Mayer A (2004) Probabilistic space-time video modeling via piecewise Gmm. IEEE Trans Pattern Anal Mach Intell 26(3):384–396CrossRefGoogle Scholar
  13. 13.
    Hardoon DR, Szedmak SR, Shawe-Taylor JR (2014) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16 (12):2639–2664CrossRefzbMATHGoogle Scholar
  14. 14.
    Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: International ACM SIGIR conference on research and development in information retrieval, pp 119–126Google Scholar
  15. 15.
    Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Eleventh ACM international conference on multimedia, pp 604–611Google Scholar
  16. 16.
    Li B, Du J, Zhang XP (2016) Feature extraction using maximum nonparametric margin projection. Neurocomputing 188:225–232CrossRefGoogle Scholar
  17. 17.
    Li B, Lei L, Zhang XP (2016) Constrained discriminant neighborhood embedding for high dimensional data feature extraction. Neurocomputing 173:137–144CrossRefGoogle Scholar
  18. 18.
    Liu Y, Zhao WL, Ngo CW, Xu CS, Lu HQ (2010) Coherent bag of audio words model for efficient large-scale video copy detection. In: ACM international conference on image and video retrieval, pp 89–96Google Scholar
  19. 19.
    Moffat A, Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Trans Inf Syst (TOIS) 14(4):349–379CrossRefGoogle Scholar
  20. 20.
    Mroueh Y, Marcheret E, Goel V (2016) Multimodal retrieval with asymmetrically weighted regularized canonical correlation analysis. Computer ScienceGoogle Scholar
  21. 21.
    Peng Y, Ngo CW (2006) Clip-based similarity measure for query-dependent clip retrieval and video summarization. IEEE Trans Circuits Syst Video Technol 16 (5):612–627CrossRefGoogle Scholar
  22. 22.
    Peng Y, Zhai X, Zhao Y, Huang X (2016) Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Trans Circuits Syst Video Technol 26(3):583–596CrossRefGoogle Scholar
  23. 23.
    Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet G, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535CrossRefGoogle Scholar
  24. 24.
    Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: ACM international conference on multimedia, pp 251–260Google Scholar
  25. 25.
    Sindhwani V, Niyogi P, Belkin M (2005) Beyond the point cloud: from transductive to semi-supervised learning. In: International conference on machine learning, pp 824–831Google Scholar
  26. 26.
    Typke R, Wiering F, Veltkamp RC (2005) A survey of music information retrieval systems. In: The international society for music information retrieval (ISMIR), pp 153–160Google Scholar
  27. 27.
    Wang Y, Guan L, Venetsanopoulos AN (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimedia 14(3):597–607CrossRefGoogle Scholar
  28. 28.
    Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023CrossRefGoogle Scholar
  29. 29.
    Yan Y, Nie F, Li W, Gao C, Yang Y, Xu D (2016) Image classification by cross-media active learning with privileged information. IEEE Trans Multimedia 18 (12):2494–2502CrossRefGoogle Scholar
  30. 30.
    Yang Y, Zhuang YT, Wu F, Pan YH (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimedia 10(3):437–446CrossRefGoogle Scholar
  31. 31.
    Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on Semi-Supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742CrossRefGoogle Scholar
  32. 32.
    Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15 (3):661–669CrossRefGoogle Scholar
  33. 33.
    Yu J, Tian Q (2008) Semantic subspace projection and its applications in image retrieval. IEEE Trans Circuits Syst Video Technol (TCSVT) 18(4):544–548CrossRefGoogle Scholar
  34. 34.
    Zhai X, Peng Y, Xiao J (2012) Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval. In: International conference on advances in multimedia modeling, pp 312–322Google Scholar
  35. 35.
    Zhai X, Peng Y, Xiao J (2012) Cross-modality correlation propagation for cross-media retrieval. In: IEEE international conference on acoustics, speech and signal processing, pp 2337–2340Google Scholar
  36. 36.
    Zhai X, Peng Y, Xiao J (2013) Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In: Twenty-seventh AAAI conference on artificial intelligence, pp 1198–1204Google Scholar
  37. 37.
    Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans Circuits Syst Video Technol 24 (6):965–978CrossRefGoogle Scholar
  38. 38.
    Zhang H, Zha ZJ, Yang Y, Yan S, Gao Y, Chua TS (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: ACM international conference on multimedia, pp 33–42Google Scholar
  39. 39.
    Zhang H, Zha ZJ, Yang Y, Yan S, Chua TS (2014) Robust (semi) nonnegative graph embedding. IEEE Trans Image Process 23(7):2996–3012MathSciNetCrossRefzbMATHGoogle Scholar
  40. 40.
    Zhang H, Shang X, Luan H, Wang M, Chua TS (2016) Learning from collective intelligence: feature learning using social images and tags. ACM Trans Multimed Comput Commun Appl 13(1)Google Scholar
  41. 41.
    Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Cross-Modal retrieval using multi-ordered discriminative structured subspace learning. IEEE Trans Multimedia 19 (6):1220–1233CrossRefGoogle Scholar
  42. 42.
    Zheng L, Yang Y, Tian Q (2017) SIFT meets CNN: a decade survey of instance retrieval. IEEE Trans Pattern Anal Mach Intell PP(99):1–1Google Scholar
  43. 43.
    Zhou D, Bousquet O, Lal T, Weston J (2003) Learning with local and global consistency. In: International conference on neural information processing systems, pp 321–328Google Scholar
  44. 44.
    Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised Learning using gaussian fields and harmonic functions. In: Twentieth international conference on international conference on machine learning, pp 912–919Google Scholar
  45. 45.
    Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124(3):409–421MathSciNetCrossRefGoogle Scholar
  46. 46.
    Zhuang Y, Wang Y, Wu F, Zhang Y, Lu W (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: American association for artificial intelligence (AAAI)Google Scholar
  47. 47.
    Znaidia A, Shabou A, Le Borgne H, Hudelot C, Paragios N (2012) Bag-of-multimedia-words for image classification. In: International conference on pattern recognition, pp 1509–1512Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.College of Computer Science & TechnologyWuhan University of Science & TechnologyWuhanChina
  2. 2.Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial SystemWuhanChina
  3. 3.FIRST, BNP ParibasLondonUK

Personalised recommendations