Latent semantic factorization for multimedia representation learning

Zhang, Hong; Huang, Yu; Xu, Xin; Zhu, Ziqi; Deng, Chunhua

doi:10.1007/s11042-017-5135-6

Latent semantic factorization for multimedia representation learning

Published: 30 August 2017

Volume 77, pages 3353–3368, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Hong Zhang^1,2,
Yu Huang^1,2,
Xin Xu^1,2,
Ziqi Zhu^1,2 &
…
Chunhua Deng^1,2

298 Accesses
Explore all metrics

Abstract

Due to the rapid development of multimedia applications, cross-media semantics learning is becoming increasingly important nowadays. One of the most challenging issues for cross-media semantics understanding is how to mine semantic correlation between different modalities. Most traditional multimedia semantics analysis approaches are based on unimodal data cases and neglect the semantic consistency between different modalities. In this paper, we propose a novel multimedia representation learning framework via latent semantic factorization (LSF). First, the posterior probability under the learned classifiers is served as the latent semantic representation for different modalities. Moreover, we explore the semantic representation for a multimedia document, which consists of image and text, by latent semantic factorization. Besides, two projection matrices are learned to project images and text into a same semantic space which is more similar with the multimedia document. Experiments conducted on three real-world datasets for cross-media retrieval, demonstrate the effectiveness of our proposed approach, compared with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Cross-Modal Learning with Images, Texts and Their Semantics

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

A cross-media distance metric learning framework based on multi-view correlation mining and matching

Article 21 April 2015

References

Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Bouchard G, Yin D, Guo S (2013) Convex collective matrix factorization. In Artificial Intelligence and Statistics 31:144–152
Chang X, Yang Y (2016) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2016.2582746
Article MathSciNet Google Scholar
Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513
Chang X, Yu YL, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617-1632
Article Google Scholar
Chang X, Nie F, Yang Y, Zhang C, Huang H (2016) Convex sparse pca for unsupervised feature learning. ACM Trans Knowl Discov Data 11(1):3
Article Google Scholar
Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197
Article Google Scholar
Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process 26(8):3911–3920
Article MathSciNet MATH Google Scholar
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Article Google Scholar
Huang L, Peng Y (2016) Cross-media retrieval via semantic entity projection. In: proceedings, part I, of the 22nd international conference on multimedia modeling, vol 9516, pp 276–288
Chapter Google Scholar
Jacobs DW, Daume H, Kumar A, Sharma A (2012) Generalized multiview analysis: a discriminative latent space. IEEE Conf Comput Vis Pattern Recognit 157:2160–2167
Jiang A, Li H, Li Y, Wang M (2015) Learning discriminative representations for semantic cross media retrieval. Comput Sci 1511:1–11
Krapac J, Allan M, Verbeek J, Jurie F (2010) Improving web image search results using query-relative classifiers. Comput Vis Pattern Recognit 119:1094–1101
Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105
Google Scholar
Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann A (2012) Double fusion for multimedia event detection. Advances in Multimed Model 7131:173–185
Article Google Scholar
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: proceedings of the 11th ACM international conference on multimedia, ACM, pp 604–611
Li B, Li J, Zhang XP (2015) Nonparametric discriminant multi-manifold learning for dimensionality reduction. Neurocomputing 152(3):121–126
Article Google Scholar
Li B, Du J, Zhang XP (2016) Feature extraction using maximum nonparametric margin projection. Neurocomputing 188(5):225–232
Article Google Scholar
Liong VE, Lu J, Tan YP, Zhou J (2017) Deep coupled metric learning for cross-modal matching. IEEE Trans Multimed 19(6):1234–1244
Article Google Scholar
Ma Z, Nie F, Yang Y, Uijlings JRR (2012) Web image annotation via subspace-sparsity collaborated feature selection. IEEE Trans Multimed 14(4):1021–1030
Article Google Scholar
Mcgurk H, Macdonald J (1976) Hearing lips and seeing voices. Nature 264(5588):746–748
Article Google Scholar
Nie T, Shen D, Kou Y, Yu G, Yue D (2011) An entity relation extraction model based on semantic pattern matching. In: web information systems and applications conference (WISA), pp 7–12
Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R et al (2013) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535
Article Google Scholar
Putthividhy D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent Dirichlet allocation for image annotation. Comput Vis Pattern Recognit 238:3408–3415
Google Scholar
Rafailidis D, Crestani F (2016) Cluster-based joint matrix factorization hashing for cross-modal retrieval. International ACM SIGIR conference on Research and Development in information retrieval, pp 781–784
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: proceedings of the 18th ACM international conference on multimedia, ACM, pp 251–260
Singh AP, Kumar G, Gupta R (2008) Relational learning via collective matrix factorization. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 40(46):650–658
Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099
Article Google Scholar
Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: proceedings of 2013 I.E. international conference on computer vision IEEE, pp 2088–2095
Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: proceedings of the 22nd ACM international conference on multimedia, ACM, pp 307–316
Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 15(75):9255–9276
Article Google Scholar
Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 75(15):9255–9276
Article Google Scholar
Wei Y, Zhao, Y, Zhu Z, Wei S, Xiao Y, Feng J, et al (2015) Modality-dependent cross-media retrieval. ACM Trans Intell Syst Technol 7(4):57
Article Google Scholar
Xie L, Zhu L, Chen G (2016) Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimed Tools Appl 75(15):9185–9204
Article Google Scholar
Xue Z, Li G, Zhang W, Pang J, Huang Q (2014) Topic detection in cross-media: a semi-supervised co-clustering approach. Int J Multimed Inf Retr 3(3):193–205
Article Google Scholar
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3441–3450
Yang Y, Zhuang YT, Wu F, Pan YH (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimedia 10(3):437–446
Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742
Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15(3):661–669
Article Google Scholar
Zha ZJ, Wang M, Zheng YT, Yang Y, et al (2012) Interactive video indexing with statistical active learning. IEEE Trans Multimedia 14(1):17–27
Article Google Scholar
Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: AAAI, vol 1, no. 2, pp 2177–2183
Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100–105
Article Google Scholar
Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval. Neurocomputing 119:10–16
Article Google Scholar
Zhang H, Yan Z, Sun C, Wei S (2015) Based on entities behavior patterns of heterogeneous data semantic conflict detection. In: web information system and application conference (WISA), pp 169–174
Zhang H, Zhang W, Liu W, Xu X, Fan H (2016) Multiple kernel visual-auditory representation learning for retrieval. Multimed Tools Appl 75(15):9169–9184
Article Google Scholar
Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93–101
Article Google Scholar
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. International ACM SIGIR conference on Research & Development in information retrieval, pp 415–424
Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimedia 10(2):221–229
Article Google Scholar
Zhuang Y, Wang Y, Wu F, Zhang Y, Lu W (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence, pp 1070–1076

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China (No. 61373109, No. 61602349), the Hubei Chengguang Talented Youth Development Foundation (No. 2015B22), Natural Science Foundation Hubei Province (No.ZRMS2016000155) and Science and technology research project of Hubei Provincial Department of Education (No.Q20161113).

Author information

Authors and Affiliations

College of Computer Science & Technology, Wuhan University of Science & Technology, Wuhan, 430065, China
Hong Zhang, Yu Huang, Xin Xu, Ziqi Zhu & Chunhua Deng
Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China
Hong Zhang, Yu Huang, Xin Xu, Ziqi Zhu & Chunhua Deng

Authors

Hong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ziqi Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Huang, Y., Xu, X. et al. Latent semantic factorization for multimedia representation learning. Multimed Tools Appl 77, 3353–3368 (2018). https://doi.org/10.1007/s11042-017-5135-6

Download citation

Received: 14 March 2017
Revised: 21 July 2017
Accepted: 20 August 2017
Published: 30 August 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11042-017-5135-6

Keywords

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Latent semantic factorization for multimedia representation learning

Abstract

Access this article

Similar content being viewed by others

Cross-Modal Learning with Images, Texts and Their Semantics

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

A cross-media distance metric learning framework based on multi-view correlation mining and matching

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Latent semantic factorization for multimedia representation learning

Abstract

Access this article

Similar content being viewed by others

Cross-Modal Learning with Images, Texts and Their Semantics

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

A cross-media distance metric learning framework based on multi-view correlation mining and matching

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation