Multimedia Tools and Applications

, Volume 77, Issue 3, pp 3353–3368 | Cite as

Latent semantic factorization for multimedia representation learning

  • Hong Zhang
  • Yu Huang
  • Xin Xu
  • Ziqi Zhu
  • Chunhua Deng


Due to the rapid development of multimedia applications, cross-media semantics learning is becoming increasingly important nowadays. One of the most challenging issues for cross-media semantics understanding is how to mine semantic correlation between different modalities. Most traditional multimedia semantics analysis approaches are based on unimodal data cases and neglect the semantic consistency between different modalities. In this paper, we propose a novel multimedia representation learning framework via latent semantic factorization (LSF). First, the posterior probability under the learned classifiers is served as the latent semantic representation for different modalities. Moreover, we explore the semantic representation for a multimedia document, which consists of image and text, by latent semantic factorization. Besides, two projection matrices are learned to project images and text into a same semantic space which is more similar with the multimedia document. Experiments conducted on three real-world datasets for cross-media retrieval, demonstrate the effectiveness of our proposed approach, compared with state-of-the-art methods.


Posterior probability Latent semantic factorization Cross-modal retrieval 



This research is supported by the National Natural Science Foundation of China (No. 61373109, No. 61602349), the Hubei Chengguang Talented Youth Development Foundation (No. 2015B22), Natural Science Foundation Hubei Province (No.ZRMS2016000155) and Science and technology research project of Hubei Provincial Department of Education (No.Q20161113).


  1. 1.
    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  2. 2.
    Bouchard G, Yin D, Guo S (2013) Convex collective matrix factorization. In Artificial Intelligence and Statistics 31:144–152Google Scholar
  3. 3.
    Chang X, Yang Y (2016) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Transactions on Neural Networks and Learning Systems.
  4. 4.
    Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513Google Scholar
  5. 5.
    Chang X, Yu YL, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617-1632Google Scholar
  6. 6.
    Chang X, Nie F, Yang Y, Zhang C, Huang H (2016) Convex sparse pca for unsupervised feature learning. ACM Trans Knowl Discov Data 11(1):3CrossRefGoogle Scholar
  7. 7.
    Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197Google Scholar
  8. 8.
    Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process 26(8):3911–3920Google Scholar
  9. 9.
    Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRefGoogle Scholar
  10. 10.
    Huang L, Peng Y (2016) Cross-media retrieval via semantic entity projection. In: proceedings, part I, of the 22nd international conference on multimedia modeling, vol 9516, pp 276–288Google Scholar
  11. 11.
    Jacobs DW, Daume H, Kumar A, Sharma A (2012) Generalized multiview analysis: a discriminative latent space. IEEE Conf Comput Vis Pattern Recognit 157:2160–2167Google Scholar
  12. 12.
    Jiang A, Li H, Li Y, Wang M (2015) Learning discriminative representations for semantic cross media retrieval. Comput Sci 1511:1–11 Google Scholar
  13. 13.
    Krapac J, Allan M, Verbeek J, Jurie F (2010) Improving web image search results using query-relative classifiers. Comput Vis Pattern Recognit 119:1094–1101Google Scholar
  14. 14.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105Google Scholar
  15. 15.
    Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann A (2012) Double fusion for multimedia event detection. Advances in Multimed Model 7131:173–185CrossRefGoogle Scholar
  16. 16.
    Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: proceedings of the 11th ACM international conference on multimedia, ACM, pp 604–611Google Scholar
  17. 17.
    Li B, Li J, Zhang XP (2015) Nonparametric discriminant multi-manifold learning for dimensionality reduction. Neurocomputing 152(3):121–126Google Scholar
  18. 18.
    Li B, Du J, Zhang XP (2016) Feature extraction using maximum nonparametric margin projection. Neurocomputing 188(5):225–232Google Scholar
  19. 19.
    Liong VE, Lu J, Tan YP, Zhou J (2017) Deep coupled metric learning for cross-modal matching. IEEE Trans Multimed 19(6):1234–1244CrossRefGoogle Scholar
  20. 20.
    Ma Z, Nie F, Yang Y, Uijlings JRR (2012) Web image annotation via subspace-sparsity collaborated feature selection. IEEE Trans Multimed 14(4):1021–1030CrossRefGoogle Scholar
  21. 21.
    Mcgurk H, Macdonald J (1976) Hearing lips and seeing voices. Nature 264(5588):746–748CrossRefGoogle Scholar
  22. 22.
    Nie T, Shen D, Kou Y, Yu G, Yue D (2011) An entity relation extraction model based on semantic pattern matching. In: web information systems and applications conference (WISA), pp 7–12Google Scholar
  23. 23.
    Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R et al (2013) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535CrossRefGoogle Scholar
  24. 24.
    Putthividhy D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent Dirichlet allocation for image annotation. Comput Vis Pattern Recognit 238:3408–3415Google Scholar
  25. 25.
    Rafailidis D, Crestani F (2016) Cluster-based joint matrix factorization hashing for cross-modal retrieval. International ACM SIGIR conference on Research and Development in information retrieval, pp 781–784Google Scholar
  26. 26.
    Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: proceedings of the 18th ACM international conference on multimedia, ACM, pp 251–260Google Scholar
  27. 27.
    Singh AP, Kumar G, Gupta R (2008) Relational learning via collective matrix factorization. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 40(46):650–658Google Scholar
  28. 28.
    Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099CrossRefGoogle Scholar
  29. 29.
    Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: proceedings of 2013 I.E. international conference on computer vision IEEE, pp 2088–2095Google Scholar
  30. 30.
    Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: proceedings of the 22nd ACM international conference on multimedia, ACM, pp 307–316Google Scholar
  31. 31.
    Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 15(75):9255–9276CrossRefGoogle Scholar
  32. 32.
    Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 75(15):9255–9276CrossRefGoogle Scholar
  33. 33.
    Wei Y, Zhao, Y, Zhu Z, Wei S, Xiao Y, Feng J, et al (2015) Modality-dependent cross-media retrieval. ACM Trans Intell Syst Technol 7(4):57Google Scholar
  34. 34.
    Xie L, Zhu L, Chen G (2016) Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimed Tools Appl 75(15):9185–9204CrossRefGoogle Scholar
  35. 35.
    Xue Z, Li G, Zhang W, Pang J, Huang Q (2014) Topic detection in cross-media: a semi-supervised co-clustering approach. Int J Multimed Inf Retr 3(3):193–205Google Scholar
  36. 36.
    Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3441–3450Google Scholar
  37. 37.
    Yang Y, Zhuang YT, Wu F, Pan YH (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimedia 10(3):437–446Google Scholar
  38. 38.
    Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742Google Scholar
  39. 39.
    Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15(3):661–669Google Scholar
  40. 40.
    Zha ZJ, Wang M, Zheng YT, Yang Y, et al (2012) Interactive video indexing with statistical active learning. IEEE Trans Multimedia 14(1):17–27Google Scholar
  41. 41.
    Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: AAAI, vol 1, no. 2, pp 2177–2183Google Scholar
  42. 42.
    Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100–105CrossRefGoogle Scholar
  43. 43.
    Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval. Neurocomputing 119:10–16CrossRefGoogle Scholar
  44. 44.
    Zhang H, Yan Z, Sun C, Wei S (2015) Based on entities behavior patterns of heterogeneous data semantic conflict detection. In: web information system and application conference (WISA), pp 169–174Google Scholar
  45. 45.
    Zhang H, Zhang W, Liu W, Xu X, Fan H (2016) Multiple kernel visual-auditory representation learning for retrieval. Multimed Tools Appl 75(15):9169–9184CrossRefGoogle Scholar
  46. 46.
    Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93–101CrossRefGoogle Scholar
  47. 47.
    Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. International ACM SIGIR conference on Research & Development in information retrieval, pp 415–424Google Scholar
  48. 48.
    Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimedia 10(2):221–229Google Scholar
  49. 49.
    Zhuang Y, Wang Y, Wu F, Zhang Y, Lu W (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence, pp 1070–1076Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Hong Zhang
    • 1
    • 2
  • Yu Huang
    • 1
    • 2
  • Xin Xu
    • 1
    • 2
  • Ziqi Zhu
    • 1
    • 2
  • Chunhua Deng
    • 1
    • 2
  1. 1.College of Computer Science & TechnologyWuhan University of Science & TechnologyWuhanChina
  2. 2.Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial SystemWuhanChina

Personalised recommendations