Progressive Cross-Media Correlation Learning

  • Xin Huang
  • Yuxin PengEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 875)


Cross-media retrieval aims to retrieve across different media types, such as image and text, whose key problem is to learn cross-media correlation from known training data. Existing methods indiscriminately take all data for model training, ignoring that there exist hard samples which lead to misleading and even noisy information, bringing negative effect especially in the early period of model training. Because cross-media training data is difficult to collect, the common challenge of small-scale training data makes this problem even severer to limit the robustness and accuracy of cross-media retrieval. For addressing the above problem, this paper proposes Progressive Cross-media Correlation Learning (PCCL) approach, which takes a large-scale cross-media dataset with general knowledge (reference data), to guide the correlation learning on another small-scale dataset (target data) via the progressive sample selection mechanism. Specifically, we first pre-train a hierarchical correlation learning network on reference data as reference model, which is used to assign samples in target data with different learning difficulties, via intra-media and inter-media relevance significance metric. Then, training samples in target data are selected with gradually ascending learning difficulties, so that the correlation learning process can progressively reduce the “heterogeneity gap” to enhance the model robustness and improve retrieval accuracy. We take our self-constructed large-scale XMediaNet dataset as the reference data, and the cross-media retrieval experiments on 2 widely-used datasets show PCCL outperforms 9 state-of-the-art methods.



This work was supported by National Natural Science Foundation of China under Grants 61771025 and 61532005.


  1. 1.
    Gilakjani, A.P.: Visual, auditory, kinaesthetic learning styles and their impacts on english language teaching. J. Stud. Educ. 2, 104–113 (2012)CrossRefGoogle Scholar
  2. 2.
    Peng, Y., Huang, X., Zhao, Y.: An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) (2017).
  3. 3.
    Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)CrossRefGoogle Scholar
  4. 4.
    Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 24(6), 965–978 (2014)CrossRefGoogle Scholar
  5. 5.
    Kang, C., Xiang, S., Liao, S., Xu, C., Pan, C.: Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans. Multimedia (TMM) 17(3), 370–381 (2015)CrossRefGoogle Scholar
  6. 6.
    Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: International Conference Machine Learning (ICML), pp. 689–696 (2011)Google Scholar
  7. 7.
    Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: ACM MM, pp. 7–16 (2014)Google Scholar
  8. 8.
    Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: IJCAI, pp. 3846–3853 (2016)Google Scholar
  9. 9.
    Bengio, Y., Louradour, J., Collobert, R., and Weston, J.: Curriculum learning. In: ICML, pp. 41–48 (2009)Google Scholar
  10. 10.
    Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: ACM MM, pp. 251–260 (2010)Google Scholar
  11. 11.
    Ranjan, V., Rasiwasia, N., Jawahar, C.V.: Multi-label cross-modal retrieval. In: ICCV, pp. 4094–4102 (2015)Google Scholar
  12. 12.
    Peng, Y., Zhai, X., Zhao, Y., Huang, X.: Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 26(3), 583–596 (2016)CrossRefGoogle Scholar
  13. 13.
    Wei, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern. (TCYB) 47(2), 449–460 (2017)Google Scholar
  14. 14.
    Huang, X., Peng, Y., Yuan, M.: Cross-modal common representation learning by hybrid transfer network. In: IJCAI, pp. 1893–1900 (2017)Google Scholar
  15. 15.
    Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: CVPR, pp. 3441–3450 (2015)Google Scholar
  16. 16.
    Pentina, A., Sharmanska, V., Lampert, C.H.: Curriculum learning of multiple tasks. In: CVPR, pp. 5492–5500 (2015)Google Scholar
  17. 17.
    Kumar, M.P., Packer, B., Koller, D.: Self-paced learning for latent variable models. In: NIPS, pp. 1189–1197 (2010)Google Scholar
  18. 18.
    Gong, C., Tao, D., Maybank, S.J., Liu, W., Kang, G., Yang, J.: Multi-modal curriculum learning for semi-supervised image classification. IEEE Trans. Image Process. (TIP) 25(7), 3249–3260 (2016)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Supancic, J.S., Ramanan, D.: Self-paced learning for long-term tracking. In: CVPR, pp. 2379–2386 (2013)Google Scholar
  20. 20.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556 (2014)
  21. 21.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)Google Scholar
  22. 22.
    Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP. 1746–1751 (2014)Google Scholar
  23. 23.
    Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: ACM MM, pp. 604–611 (2003)Google Scholar
  24. 24.
    Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from National University of Singapore. In: CIVR, No. 48 (2009)Google Scholar
  25. 25.
    Hardoon, D.R., Szedmák, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Institute of Computer Science and TechnologyPeking UniversityBeijingChina

Personalised recommendations