Multimedia Tools and Applications

, Volume 77, Issue 17, pp 22247–22266 | Cite as

Cross-media retrieval with collective deep semantic learning

  • Bin Zhang
  • Lei ZhuEmail author
  • Jiande SunEmail author
  • Huaxiang ZhangEmail author


Cross-media retrieval is becoming a new trend of information retrieval technique. It has been received great attentions from both academia and industry. In this paper, we propose an effective retrieval method, dubbed as Cross-media Retrieval with Collective Deep Semantic Learning (CR-CDSL), to solve the problem. Two complementary deep neural networks are first learned to collectively project image and text samples into a joint semantic representation. Based on it, weak semantic labels are then generated accordingly for unlabeled images and texts. They are exploited further with the pre-labeled training samples to retrain the retrieval model, which can discover a discriminative shared semantic space for achieving cross-media retrieval. Specifically, Deep Restricted Boltzmann Machines (DRBM) is employed to initialize the weights of two deep neural networks. With the weak labels generated from collective deep semantic learning, the discriminative capability of retrieval model can be enhanced. Thus, the retrieval performance of the model could be improved. Experiments are evaluated on several publicly available cross-media datasets. The obtained experimental results demonstrate the superior performance of the proposed approach compared with several state-of-the-art techniques.


Cross-media retrieval Collective deep semantic learning Deep neural network Deep restricted boltzmann machines 



The work is partially supported by the National Natural Science Foundation of China (Nos. 61572298, 61772322, 61601268), the Key Research and Development Foundation of Shandong Province (No. 2016GGX101009) and the Natural Science Foundation of Shandong China (No. 2017GGX10117). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X GPU used for this research.


  1. 1.
    Belhumeur PN, Hespanha JP, Kriegman DJ (2002) Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Trans Pattern Anal Mach Intell 19(7):711–720CrossRefGoogle Scholar
  2. 2.
    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  3. 3.
    Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920MathSciNetCrossRefGoogle Scholar
  4. 4.
    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Transactions on Cybernetics 47(5):1180CrossRefGoogle Scholar
  5. 5.
    Chang X, Yu YL, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632CrossRefGoogle Scholar
  6. 6.
    Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: ACM International conference on image and video retrieval, p 48Google Scholar
  7. 7.
    Dong X, Sun J, Duan P, Meng L, Tan Y, Wan W, Wu H, Zhang B, Zhang H (2018) Semi-supervised modality-dependent cross-media retrieval. Multimedia Tools and Applications 77(3):3579–3595CrossRefGoogle Scholar
  8. 8.
    Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2014) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112(C):83–97Google Scholar
  9. 9.
    Gao Z, Zhang LF, Chen MY, Hauptmann A, Zhang H, Cai AN (2014) Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimedia Tools and Applications 68(3):641–657CrossRefGoogle Scholar
  10. 10.
    Gao Z, Li SH, Zhu YJ, Wang C, Zhang H (2017) Collaborative sparse representation leaning model for rgbd action recognition. J Vis Commun Image Represent 48(C):442–452CrossRefGoogle Scholar
  11. 11.
    Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233CrossRefGoogle Scholar
  12. 12.
    Hinton GE (2007) Learning multiple layers of representation. Trends Cogn Sci 11(10):428CrossRefGoogle Scholar
  13. 13.
    Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Hinton GE, Osindero S, Teh YW (2014) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  16. 16.
    Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444CrossRefGoogle Scholar
  17. 17.
    Lee H, Ekanadham C, Ng AY (2007) Sparse deep belief net model for visual area v2. In: Advances in neural information processing systems, pp 873–880Google Scholar
  18. 18.
    Liu B, Wang M, Foroosh H, Tappen M (2015) Sparse convolutional neural networks. In: Computer vision and pattern recognition, pp 806–814Google Scholar
  19. 19.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  20. 20.
    Luo M, Chang X, Li Z, Nie L, Hauptmann AG, Zheng Q (2017) Simple to complex cross-modal learning to rank. Comput Vis Image Underst 163:67–77CrossRefGoogle Scholar
  21. 21.
    Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges. IEEE Trans Circuits Syst Video Technol PP(99):1–1CrossRefGoogle Scholar
  22. 22.
    Ranzato M, Boureau YL, Lecun Y (2008) Sparse feature learning for deep belief networks. In: Advances in neural information processing systems, pp 1185–1192Google Scholar
  23. 23.
    Ranzato M, Poultney CS, Chopra S, Lecun Y (2006) Efficient learning of sparse overcomplete representations with an energy-based model. In: Advances in Neural Information Processing Systems, pp 1137–1144Google Scholar
  24. 24.
    Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: International conference on multimedia, pp 251–260Google Scholar
  25. 25.
    Rosipal R, Krmer N (2005) Overview and recent advances in partial least squares. In: International conference on subspace, latent structure and feature selection, pp 34–51Google Scholar
  26. 26.
    Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: Computer vision and pattern recognition, pp 2160–2167Google Scholar
  27. 27.
    Sun J, Liu X, Wan W, Li J, Zhao D, Zhang H (2016) Video hashing based on appearance and attention features fusion via dbn. Neurocomputing 213:84–94CrossRefGoogle Scholar
  28. 28.
    Tenenbaum JB, Freeman WT (2014) Separating style and content with bilinear models. Neural Comput 12(6):1247–1283CrossRefGoogle Scholar
  29. 29.
    Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023CrossRefGoogle Scholar
  30. 30.
    Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv:1607.06215
  31. 31.
    Wei Y, Zhao Y, Zhu Z, Wei S, Xiao Y, Feng J, Yan S (2016) Modality-dependent cross-media retrieval. ACM Trans Intell Syst Technol 7(4):57CrossRefGoogle Scholar
  32. 32.
    Xie L, Zhu L, Chen G (2016) Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimedia Tools and Applications 75(15):9185–9204CrossRefGoogle Scholar
  33. 33.
    Xie L, Zhu L, Pan P, Lu Y (2016) Cross-modal self-taught hashing for large-scale image retrieval. Signal Process 124:81–92CrossRefGoogle Scholar
  34. 34.
    Xu P, Yin Q, Huang Y, Song YZ, Ma Z, Wang L, Xiang T, Kleijn WB, Guo J (2017) Cross-modal subspace learning for fine-grained sketch-based image retrieval. NeurocomputingGoogle Scholar
  35. 35.
    Yan J, Zhang H, Sun J, Wang Q, Guo P, Meng L, Wan W, Dong X (2018) Joint graph regularization based modality-dependent cross-media retrieval. Multimedia Tools and Applications 77(3):3009–3027CrossRefGoogle Scholar
  36. 36.
    Yan S, Xu D, Zhang B, Zhang HJ, Yang Q, Lin S (2006) Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Intell 29(1):40CrossRefGoogle Scholar
  37. 37.
    Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723CrossRefGoogle Scholar
  38. 38.
    Yuxin P, Wenwu Z, Changsheng Q, Huang H, Qinghua Z (2017) Cross-media analysis and reasoning: advances and directions. Frontiers of Information Technology and Electronic Engineering 18(1):44–57CrossRefGoogle Scholar
  39. 39.
    Zhang H, Shang X, Luan H, Wang M, Chua TS (2016) Learning from collective intelligence: feature learning using social images and tags. ACM Trans Multimed Comput Commun Appl 13(1):1CrossRefGoogle Scholar
  40. 40.
    Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Cross-modal retrieval using multi-ordered discriminative structured subspace learning. IEEE Trans Multimedia 19 (6):1220–1233CrossRefGoogle Scholar
  41. 41.
    Zhu L, Shen J, Xie L (2015) Topic hypergraph hashing for mobile image retrieval. In: MM, pp 843–846Google Scholar
  42. 42.
    Zhu L, She J, Liu X, Xie L, Nie L (2016) Learning compact visual representation with canonical views for robust mobile landmark search. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, pp 3959–3965Google Scholar
  43. 43.
    Zhu L, Shen J, Xie L, Cheng Z (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Transactions on Cybernetics 1–14.
  44. 44.
    Zhu L, Huang Z, Liu X, He X, Sun J, Zhou X (2017) Discrete multi-modal hashing with canonical views for robust mobile landmark search. IEEE Trans MultimediaGoogle Scholar
  45. 45.
    Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29(2):472–486CrossRefGoogle Scholar
  46. 46.
    Zhu L, Huang Z, Li Z, Xie L, Shen HT (2018) Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval. IEEE Transactions on Neural Networks and Learning Systems PP(99):1–13. Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Information Science and EngineeringShandong Normal UniversityJinanChina
  2. 2.Institute of Data Science and TechnologyShandong Normal UniversityJinanChina

Personalised recommendations