Recursive Pyramid Network with Joint Attention for Cross-Media Retrieval

Yuan, Yuxin; Peng, Yuxin

doi:10.1007/978-3-319-73603-7_33

Yuxin Yuan²¹ &
Yuxin Peng²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10704))

Included in the following conference series:

International Conference on Multimedia Modeling

3197 Accesses
3 Citations

Abstract

Cross-media retrieval has raised wide attention in recent years, for its flexibility in retrieving results across different media types by a query of any media type. Besides studying on the global information of the samples, some recent works focus on the regions of the samples to mine local information for better correlation learning of different media types. However, these works focus on the correlations of regions and sample, while ignoring the correlations between regions, including the significance of each region among all of them, and the supplementary information between the region and its sub-regions, similar to the sample and its regions. For addressing this problem, this paper proposes a new recursive pyramid network with joint attention (RPJA) for cross-media retrieval, which has two main contributions: (1) We repeatedly partition the sample into increasingly fine regions in a pyramid structure, and the representation of sample is generated by modeling the supplementary information, which is provided by the regions and their sub-regions recursively from the bottom to top of pyramid. (2) We propose a joint attention model connecting different media types in each pyramid level, which mines the intra-media information and inter-media correlations to guide the learning of significance of each region, further improving the performance of correlation learning. Experiments on two widely-used datasets compared with state-of-the-art methods verify the effectiveness of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hu, Y., Cheng, X., Chia, L.T., et al.: Coherent phrase model for efficient image near-duplicate retrieval. IEEE Trans. Multimedia (TMM) 11(8), 1434–1445 (2009)
Article Google Scholar
Peng, Y., Ngo, C.W.: Clip-based similarity measure for query-dependent clip retrieval and video summarization. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 16(5), 612–627 (2006)
Article Google Scholar
Peng, Y., Huang, X., Zhao, Y.: An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) (2017)
Google Scholar
Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)
Article MATH Google Scholar
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 24(6), 965–978 (2014)
Article Google Scholar
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: 22nd ACM International Conference on Multimedia (ACM MM), pp. 7–16 (2014)
Google Scholar
Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 3846–3853 (2016)
Google Scholar
Wei, Y., Zhao, Y., Lu, C., et al.: Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern. (TCYB) 47(2), 449–460 (2017)
Google Scholar
Mnih, V., Heess, N., Graves, A.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems (NIPS), pp. 2204–2212 (2014)
Google Scholar
Peng, Y., Zhai, X., Zhao, Y., et al.: Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Trans. Circ. Syst. Video Technol. 26(3), 583–596 (2016)
Article Google Scholar
Peng, Y., Qi, J., Huang, X., et al.: CCL: cross-modal correlation learning with multi-grained fusion by hierarchical network. arXiv preprint arXiv:1704.02116 (2017)
Yang, Z., He, X., Gao, J., et al.: Stacked attention networks for image question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21–29 (2016)
Google Scholar
Rasiwasia, N., Costa Pereira, J., Coviello, E., et al.: A new approach to cross-modal multimedia retrieval. In: 18th ACM International Conference on Multimedia (ACM MM), pp. 251–260 (2010)
Google Scholar
Gong, Y., Ke, Q., Isard, M., et al.: A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. (IJCV) 106(2), 210–233 (2014)
Article Google Scholar
Li, D., Dimitrova, N., Li, M., et al.: Multimedia content processing through cross-modal association. In: 11th ACM International Conference on Multimedia (ACM MM), pp. 604–611 (2003)
Google Scholar
Rashtchian, C., Young, P., Hodosh, M., et al.: Collecting image annotations using Amazon’s mechanical turk. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147 (2010)
Google Scholar
Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3441–3450 (2015)
Google Scholar
Kang, C., Xiang, S., Liao, S., et al.: Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia (TMM) 17(3), 370–381 (2015)
Article Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014)
Google Scholar
Simon, M., Rodner, E., Denzler, J.: Imagenet pre-trained models with batch normalization. arXiv preprint arXiv:1612.01452 (2016)
Krause, J., Jin, H., Yang, J., et al.: Fine-grained recognition without part annotations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5546–5555 (2015)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Conference on Neural Information Processing Systems (NIPS), pp. 3111–3119 (2013)
Google Scholar
Kumar, A., Irsoy, O., Ondruska, P., et al.: Ask me anything: dynamic memory networks for natural language processing. In: International Conference on Machine Learning (ICML), pp. 1378–1387 (2016)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML), pp. 2048–2057 (2015)
Google Scholar

Download references

Acknowledgment

This work was supported by National Natural Science Foundation of China under Grant 61771025 and Grant 61532005.

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, China
Yuxin Yuan & Yuxin Peng

Authors

Yuxin Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxin Peng .

Editor information

Editors and Affiliations

Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria
Klaus Schoeffmann
Chulalongkorn University, Bangkok, Thailand
Thanarat H. Chalidabhongse
City University of Hong Kong, Hong Kong, China
Chong Wah Ngo
Chulalongkorn University, Bangkok, Thailand
Supavadee Aramvith
Dublin City University, Dublin, Ireland
Noel E. O’Connor
Gwangju Institute of Science and Technology, Gwangju, Korea (Republic of)
Yo-Sung Ho
Tampere University of Technology, Tampere, Finland
Moncef Gabbouj
Rutgers University, Piscataway, New Jersey, USA
Ahmed Elgammal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, Y., Peng, Y. (2018). Recursive Pyramid Network with Joint Attention for Cross-Media Retrieval. In: Schoeffmann, K., et al. MultiMedia Modeling. MMM 2018. Lecture Notes in Computer Science(), vol 10704. Springer, Cham. https://doi.org/10.1007/978-3-319-73603-7_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-73603-7_33
Published: 13 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73602-0
Online ISBN: 978-3-319-73603-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics