Deep Learning Generic Features for Cross-Media Retrieval

Shang, Xindi; Zhang, Hanwang; Chua, Tat-Seng

doi:10.1007/978-3-319-27671-7_22

Xindi Shang¹⁹,
Hanwang Zhang¹⁹ &
Tat-Seng Chua¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9516))

Included in the following conference series:

International Conference on Multimedia Modeling

3056 Accesses
4 Citations

Abstract

Cross-media retrieval is an imperative approach to handle the explosive growth of multimodal data on the web. However, how to effectively uncover the correlations between multimodal data has been a barrier to successful retrieval of cross-media data. The traditional approaches learn the connection between multiple modalities by direct utilization of hand-crafted low-level heterogeneous features and the learned correlation are merely constructed in terms of high-level feature representation. To well exploit the intrinsic structures of multimodal data, it is essential to build up an interpretable correlation between multimodal data. In this paper, we propose a deep model to learn the high-level feature representation shared by multiple modalities for cross-media retrieval. We learn the discriminative high-level feature representation in a data-driven manner before faithfully encoding the multimodal correlations. We use the large-scale multimodal data crawled from Internet to train our deep model and evaluate its effectiveness on cross-media retrieval based on NUS-WIDE dataset. The experimental results show that the proposed model outperforms other state-of-the-arts approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andrew, G., Arora, R., Bilmes, J., Livescu, K.: Deep canonical correlation analysis. In: ICML (2013)
Google Scholar
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al.: Greedy layer-wise training of deep networks. In: NIPS (2007)
Google Scholar
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y. Nus-wide: a real-world web image database from national university of singapore. In: CIVR (2009)
Google Scholar
Dong, J., Cheng, B., Chen, X., Chua, T.-S., Yan, S., Zhou, X.: Robust image annotation via simultaneous feature and sample outlier pursuit. In: TOMCCAP (2013)
Google Scholar
Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16, 2639–2664 (2004)
Article MATH Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)
Article MATH MathSciNet Google Scholar
Hsu, D., Kakade, S., Langford, J., Zhang, T.: Multi-label prediction via compressed sensing. In: NIPS (2009)
Google Scholar
Jia, Y., Salzmann, M., Darrell, T.: Factorized latent spaces with structured sparsity. In: NIPS (2010)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Le, Q.V., Karpenko, A., Ngiam, J., Ng, A.Y.: Ica with reconstruction cost for efficient overcomplete feature learning. In: NIPS (2011)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
Google Scholar
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACMMM (2010)
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 1–42 (2014)
Google Scholar
Smolensky, P.: Information processing in dynamical systems: Foundations of harmony theory (1986)
Google Scholar
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. In: NIPS (2012)
Google Scholar
Tieleman, T.: Training restricted boltzmann machines using approximations to the likelihood gradient. In: ICML (2008)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol,P.-A.: Extracting and composing robust features with denoising autoencoders. In: ICML (2008)
Google Scholar
Wu, F., Lu, X., Zhang, Z., Yan, S., Rui, Y., Zhuang, Y.: Cross-media semantic representation via bi-directional learning to rank. In: ACMMM (2013)
Google Scholar
Yuan, Z., Sang, J., Liu, Y., Xu, C.: Latent feature learning in social media network. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 253–262. ACM (2013)
Google Scholar
Zhang, Y., Schneider, J.G.: Multi-label output codes using canonical correlation analysis. In: AI Statistics (2011)
Google Scholar

Download references

Acknowledgments

This research is supported by the Singapore National Research Foundation under its IRC@Singapore Funding Initiative and administered by IDMPO.

Author information

Authors and Affiliations

School of Computing, National University of Singapore, Singapore, Singapore
Xindi Shang, Hanwang Zhang & Tat-Seng Chua

Authors

Xindi Shang
View author publications
You can also search for this author in PubMed Google Scholar
Hanwang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tat-Seng Chua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xindi Shang .

Editor information

Editors and Affiliations

University of Texas at San Antonio, San Antonio, USA
Qi Tian
Dept. of Information Engineering, University of Trento, Povo, Trento, Italy
Nicu Sebe
EECS, University of Central Florida, Orlando, Florida, USA
Guo-Jun Qi
EURECOM, Sophia-Antipolis, France
Benoit Huet
Hefei University of Technology, Hefei, Anhui, China
Richang Hong
School of Computing and Information, Hefei University of Technology, Hefei, Anhui, China
Xueliang Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shang, X., Zhang, H., Chua, TS. (2016). Deep Learning Generic Features for Cross-Media Retrieval. In: Tian, Q., Sebe, N., Qi, GJ., Huet, B., Hong, R., Liu, X. (eds) MultiMedia Modeling. MMM 2016. Lecture Notes in Computer Science(), vol 9516. Springer, Cham. https://doi.org/10.1007/978-3-319-27671-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-27671-7_22
Published: 03 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27670-0
Online ISBN: 978-3-319-27671-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics