Skip to main content
Log in

Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Aiming at measuring the inter-media semantic similarities, cross-modal retrieval tries to align heterogenous features to an intermediate common subspace in which they can be reasonably compared. This is based on the same understanding of the semantics which are represented by different modalities. However, the semantics can usually be reflected by multiple concepts since concepts co-occur in real-world rather than occur in isolation. This leads to a more challenging task of multi-label cross-modal retrieval in which multiple concepts are annotated as labels for images as an example. More importantly, the co-occurrence patterns of concepts result in correlated pairs of labels whose relationships need to be considered in an accurate cross-modal retrieval. In this paper, we propose multi-label kernel canonical correlation analysis (ml-KCCA), a novel approach for cross-modal retrieval which enhances kernel CCA with high-level semantic information reflected in multi-label annotations. By kernelizing correlation extraction from multi-label information, more complex non-linear correlations between different modalities can be measured in order to learn a discriminative subspace which is more suitable for cross-modal retrieval tasks. Extensive evaluations on public datasets have validated the improvements of our approach over the state-of-the-art cross-modal retrieval approaches including other CCA extensions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Akaho S (2006) A kernel method for canonical correlation analysis. In: Proceedings of the international meeting of the psychometric society, vol 40, pp 263–269

  2. Bekkerman R, Jeon J (2007) Multi-modal clustering for multimedia collections. In: IEEE conference on computer vision and pattern recognition, pp 1–8

  3. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: ACM international conference on image and video retrieval, p 48

  4. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition

  5. Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The Pascal Visual Object Classes (VOC) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  6. Gong Y, Lazebnik S, Gordo A et al (2013) Iterative quantization: a Procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 35(12):2916

    Article  Google Scholar 

  7. Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233

    Article  Google Scholar 

  8. Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  9. Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics, pp 321–377

  10. Huyn N (2001) Data analysis and mining in the life sciences. In: ACM

  11. Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: British machine vision conference, pp 1–12

  12. Hwang SJ, Grauman K (2010) Reading between the lines: object localization using implicit cues from image tags. In: IEEE conference on computer vision and pattern recognition, pp 2971–2978

  13. Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153

    Article  MathSciNet  Google Scholar 

  14. Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446

    Article  Google Scholar 

  15. Jiang W, Chang S-F, Loui AC (2007) Context-based concept fusion with boosted conditional random fields. In: IEEE international conference on acoustics, speech and signal processing

  16. Jiang Y-G, Wang J, Chang S-F, Ngo C-W (2009) Domain adaptive semantic diffusion for large scale context-based video annotation. In: IEEE 12th international conference on computer vision, pp 1420–1427

  17. Jiang Y-G, Dai Q, Wang J, Ngo C-W, Xue X, Chang S-F (2012) Fast semantic diffusion for large-scale context-based image and video annotation. IEEE Trans Image Process 21(6):3080–3091

    Article  MathSciNet  MATH  Google Scholar 

  18. Jin Y, Khan L, Wang L, Awad M (2005) Image annotations by combining multiple evidence & WordNet. In: ACM international conference on multimedia, pp 706–715

  19. Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimed 17(3):370–381

    Article  Google Scholar 

  20. Kennedy LS, Chang S-F (2007) A reranking approach for context-based concept fusion in video indexing and retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 333–340

  21. Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Int J Neural Syst 10(5):365

    Article  Google Scholar 

  22. Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38 (11):39–41

    Article  Google Scholar 

  23. Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Prog Brain Res 155:23–36

    Article  Google Scholar 

  24. Qi G-J, Hua X-S, Rui Y, Tang J, Mei T, Zhang H-J (2007) Correlative multi-label video annotation. In: ACM international conference on multimedia, pp 17–26

  25. Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In: IEEE international conference on computer vision, pp 4094–4102

  26. Rasiwasia N, Pereira JC, Coviello E et al (2010) A new approach to cross-modal multimedia retrieval. In: ACM international conference on multimedia, pp 251–260

  27. Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Proceedings of international conference on artificial intelligence and statistics

  28. Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14(3):883–895

    Article  Google Scholar 

  29. Sang J, Fang Q, Xu C (2017) Exploiting social-mobile information for location visualization. ACM TIST 8(3):39:1–39:19

    Google Scholar 

  30. Sharma A (2012) Generalized multiview analysis: a discriminative latent space. In: IEEE conference on computer vision and pattern recognition, pp 2160–2167

  31. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science

  32. Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep Boltzmann machines. J Mach Learn Res 15(8):1967–2006

    MathSciNet  MATH  Google Scholar 

  33. Vinokourov A, Shawe-Taylor J, Cristianini N (2002) Inferring a semantic representation of text via cross-language correlation analysis. In: Advances of neural information processing systems, pp 1497–1504

  34. Wang C, Jing F, Zhang L, Zhang H-J (2006) Image annotation refinement using random walk with restarts. In: ACM international conference on multimedia, pp 647–650

  35. Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: IEEE international conference on computer vision, pp 2088–2095

  36. Wang P, Sun LF, Yang SQ, Smeaton AF (2016) Semantically smoothed refinement for everyday concept indexing. In: Pacific rim conference on multimedia (PCM)

  37. Wang P, Sun LF, Yang SQ, Smeaton AF (2016) Towards training-free refinement for semantic indexing of visual media. In: International conference on multimedia modeling, pp 251–263

  38. Wang P, Sun LF, Yang SQ, Smeaton AF, Gurrin C (2016) Characterizing everyday activities from visual lifelogs based on enhancing concept representation. Comput Vis Image Underst 148:181–192

    Article  Google Scholar 

  39. Wang P, Sun LF, Yang SQ, Smeaton A F (2017) Training-free indexing refinement for visual media via multi-semantics. Neurocomputing 236:39–47

    Article  Google Scholar 

  40. Wang H, Wu X, Jia Y (2017) Heterogeneous domain adaptation method for video annotation. IET Comput Vis 11(2):181–187

    Article  Google Scholar 

  41. Wu Y, Tseng B, Smith JR (2004) Ontology-based multi-classification learning for video concept detection. In: IEEE international conference on multimedia and expo

  42. Wu F, Zhang H, Zhuang Y (2007) Learning semantic correlations for cross-media retrieval. In: IEEE international conference on image processing. IEEE, pp 1465–1468

  43. Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: ACM international conference on multimedia, pp 877–886

  44. Xue X, Zhang W, Zhang J, Wu B, Fan J, Lu Y (2011) Correlative multi-label multi-instance image annotation. In: ICCV, pp 651–658

  45. Yao T, Mei T, Ngo C W (2015) Learning query and image similarities with ranking canonical correlation analysis. In: IEEE international conference on computer vision, pp 28–36

  46. Youshida K, Yoshimoto J, Doya K (2017) Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data. BMC Bioinf 18(1):108

    Article  Google Scholar 

  47. Yu J, Rui Y, Tao D (2014) Click Prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032

    Article  MathSciNet  MATH  Google Scholar 

  48. Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779

    Article  Google Scholar 

  49. Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern PP(99):1–11

    Google Scholar 

Download references

Acknowledgments

This work is supported by the Natural Science Foundation of China under Grant No. 61571453, No. 61502264, and No. 61405252, Natural Science Foundation of Hunan Province, China under Grant No. 14JJ3010, Research Funding of National University of Defense Technology under grant No. ZK16-03-37.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Wang.

Additional information

Yuhua Jia and Liang Bai are both first authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, Y., Bai, L., Liu, S. et al. Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval. Multimed Tools Appl 78, 13169–13188 (2019). https://doi.org/10.1007/s11042-018-5767-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-5767-1

Keywords

Navigation