Multimedia Tools and Applications

, Volume 78, Issue 10, pp 13169–13188 | Cite as

Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval

  • Yuhua Jia
  • Liang Bai
  • Shuang Liu
  • Peng WangEmail author
  • Jinlin Guo
  • Yuxiang Xie


Aiming at measuring the inter-media semantic similarities, cross-modal retrieval tries to align heterogenous features to an intermediate common subspace in which they can be reasonably compared. This is based on the same understanding of the semantics which are represented by different modalities. However, the semantics can usually be reflected by multiple concepts since concepts co-occur in real-world rather than occur in isolation. This leads to a more challenging task of multi-label cross-modal retrieval in which multiple concepts are annotated as labels for images as an example. More importantly, the co-occurrence patterns of concepts result in correlated pairs of labels whose relationships need to be considered in an accurate cross-modal retrieval. In this paper, we propose multi-label kernel canonical correlation analysis (ml-KCCA), a novel approach for cross-modal retrieval which enhances kernel CCA with high-level semantic information reflected in multi-label annotations. By kernelizing correlation extraction from multi-label information, more complex non-linear correlations between different modalities can be measured in order to learn a discriminative subspace which is more suitable for cross-modal retrieval tasks. Extensive evaluations on public datasets have validated the improvements of our approach over the state-of-the-art cross-modal retrieval approaches including other CCA extensions.


Cross-modal retrieval Kernel CCA Multi-label information Concept correlations 



This work is supported by the Natural Science Foundation of China under Grant No. 61571453, No. 61502264, and No. 61405252, Natural Science Foundation of Hunan Province, China under Grant No. 14JJ3010, Research Funding of National University of Defense Technology under grant No. ZK16-03-37.


  1. 1.
    Akaho S (2006) A kernel method for canonical correlation analysis. In: Proceedings of the international meeting of the psychometric society, vol 40, pp 263–269Google Scholar
  2. 2.
    Bekkerman R, Jeon J (2007) Multi-modal clustering for multimedia collections. In: IEEE conference on computer vision and pattern recognition, pp 1–8Google Scholar
  3. 3.
    Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: ACM international conference on image and video retrieval, p 48Google Scholar
  4. 4.
    Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognitionGoogle Scholar
  5. 5.
    Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The Pascal Visual Object Classes (VOC) challenge. Int J Comput Vis 88(2):303–338CrossRefGoogle Scholar
  6. 6.
    Gong Y, Lazebnik S, Gordo A et al (2013) Iterative quantization: a Procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 35(12):2916CrossRefGoogle Scholar
  7. 7.
    Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRefGoogle Scholar
  8. 8.
    Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefzbMATHGoogle Scholar
  9. 9.
    Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics, pp 321–377Google Scholar
  10. 10.
    Huyn N (2001) Data analysis and mining in the life sciences. In: ACMGoogle Scholar
  11. 11.
    Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: British machine vision conference, pp 1–12Google Scholar
  12. 12.
    Hwang SJ, Grauman K (2010) Reading between the lines: object localization using implicit cues from image tags. In: IEEE conference on computer vision and pattern recognition, pp 2971–2978Google Scholar
  13. 13.
    Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153MathSciNetCrossRefGoogle Scholar
  14. 14.
    Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446CrossRefGoogle Scholar
  15. 15.
    Jiang W, Chang S-F, Loui AC (2007) Context-based concept fusion with boosted conditional random fields. In: IEEE international conference on acoustics, speech and signal processingGoogle Scholar
  16. 16.
    Jiang Y-G, Wang J, Chang S-F, Ngo C-W (2009) Domain adaptive semantic diffusion for large scale context-based video annotation. In: IEEE 12th international conference on computer vision, pp 1420–1427Google Scholar
  17. 17.
    Jiang Y-G, Dai Q, Wang J, Ngo C-W, Xue X, Chang S-F (2012) Fast semantic diffusion for large-scale context-based image and video annotation. IEEE Trans Image Process 21(6):3080–3091MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Jin Y, Khan L, Wang L, Awad M (2005) Image annotations by combining multiple evidence & WordNet. In: ACM international conference on multimedia, pp 706–715Google Scholar
  19. 19.
    Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimed 17(3):370–381CrossRefGoogle Scholar
  20. 20.
    Kennedy LS, Chang S-F (2007) A reranking approach for context-based concept fusion in video indexing and retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 333–340Google Scholar
  21. 21.
    Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Int J Neural Syst 10(5):365CrossRefGoogle Scholar
  22. 22.
    Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38 (11):39–41CrossRefGoogle Scholar
  23. 23.
    Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Prog Brain Res 155:23–36CrossRefGoogle Scholar
  24. 24.
    Qi G-J, Hua X-S, Rui Y, Tang J, Mei T, Zhang H-J (2007) Correlative multi-label video annotation. In: ACM international conference on multimedia, pp 17–26Google Scholar
  25. 25.
    Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In: IEEE international conference on computer vision, pp 4094–4102Google Scholar
  26. 26.
    Rasiwasia N, Pereira JC, Coviello E et al (2010) A new approach to cross-modal multimedia retrieval. In: ACM international conference on multimedia, pp 251–260Google Scholar
  27. 27.
    Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Proceedings of international conference on artificial intelligence and statisticsGoogle Scholar
  28. 28.
    Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14(3):883–895CrossRefGoogle Scholar
  29. 29.
    Sang J, Fang Q, Xu C (2017) Exploiting social-mobile information for location visualization. ACM TIST 8(3):39:1–39:19Google Scholar
  30. 30.
    Sharma A (2012) Generalized multiview analysis: a discriminative latent space. In: IEEE conference on computer vision and pattern recognition, pp 2160–2167Google Scholar
  31. 31.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer ScienceGoogle Scholar
  32. 32.
    Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep Boltzmann machines. J Mach Learn Res 15(8):1967–2006MathSciNetzbMATHGoogle Scholar
  33. 33.
    Vinokourov A, Shawe-Taylor J, Cristianini N (2002) Inferring a semantic representation of text via cross-language correlation analysis. In: Advances of neural information processing systems, pp 1497–1504Google Scholar
  34. 34.
    Wang C, Jing F, Zhang L, Zhang H-J (2006) Image annotation refinement using random walk with restarts. In: ACM international conference on multimedia, pp 647–650Google Scholar
  35. 35.
    Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: IEEE international conference on computer vision, pp 2088–2095Google Scholar
  36. 36.
    Wang P, Sun LF, Yang SQ, Smeaton AF (2016) Semantically smoothed refinement for everyday concept indexing. In: Pacific rim conference on multimedia (PCM)Google Scholar
  37. 37.
    Wang P, Sun LF, Yang SQ, Smeaton AF (2016) Towards training-free refinement for semantic indexing of visual media. In: International conference on multimedia modeling, pp 251–263Google Scholar
  38. 38.
    Wang P, Sun LF, Yang SQ, Smeaton AF, Gurrin C (2016) Characterizing everyday activities from visual lifelogs based on enhancing concept representation. Comput Vis Image Underst 148:181–192CrossRefGoogle Scholar
  39. 39.
    Wang P, Sun LF, Yang SQ, Smeaton A F (2017) Training-free indexing refinement for visual media via multi-semantics. Neurocomputing 236:39–47CrossRefGoogle Scholar
  40. 40.
    Wang H, Wu X, Jia Y (2017) Heterogeneous domain adaptation method for video annotation. IET Comput Vis 11(2):181–187CrossRefGoogle Scholar
  41. 41.
    Wu Y, Tseng B, Smith JR (2004) Ontology-based multi-classification learning for video concept detection. In: IEEE international conference on multimedia and expoGoogle Scholar
  42. 42.
    Wu F, Zhang H, Zhuang Y (2007) Learning semantic correlations for cross-media retrieval. In: IEEE international conference on image processing. IEEE, pp 1465–1468Google Scholar
  43. 43.
    Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: ACM international conference on multimedia, pp 877–886Google Scholar
  44. 44.
    Xue X, Zhang W, Zhang J, Wu B, Fan J, Lu Y (2011) Correlative multi-label multi-instance image annotation. In: ICCV, pp 651–658Google Scholar
  45. 45.
    Yao T, Mei T, Ngo C W (2015) Learning query and image similarities with ranking canonical correlation analysis. In: IEEE international conference on computer vision, pp 28–36Google Scholar
  46. 46.
    Youshida K, Yoshimoto J, Doya K (2017) Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data. BMC Bioinf 18(1):108CrossRefGoogle Scholar
  47. 47.
    Yu J, Rui Y, Tao D (2014) Click Prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032MathSciNetCrossRefzbMATHGoogle Scholar
  48. 48.
    Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779CrossRefGoogle Scholar
  49. 49.
    Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern PP(99):1–11Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018
corrected publication March/2018

Authors and Affiliations

  • Yuhua Jia
    • 1
  • Liang Bai
    • 1
  • Shuang Liu
    • 1
  • Peng Wang
    • 2
    Email author
  • Jinlin Guo
    • 1
  • Yuxiang Xie
    • 1
  1. 1.Science and Technology on Information Systems Engineering LaboratoryNational University of Defense TechnologyChangshaChina
  2. 2.National Laboratory for Information Science and Technology, Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations