Boosting Multimodal Semantic Understanding by Local Similarity Adaptation and Global Correlation Propagation

  • Hong Zhang
  • Xiaoli Liu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6297)


An important trend in multimedia semantic understanding is the utilization and support of multimodal data which are heterogeneous in low-level features, such as image and audio. The main challenge is how to measure different kinds of correlations among multimodal data. In this paper, we propose a novel approach to boost multimodal semantic understanding from local and global perspectives. First, cross-media correlation between images and audio clips is estimated with Kernel Canonical Correlation Analysis; secondly, a multimodal graph is constructed to enable global correlation propagation with adapted intra-media similarity; then cross-media retrieval algorithm is discussed as an application of our approach. A prototype system is developed to demonstrate the feasibility and capability. Experimental results are encouraging and show that the performance of our approach is effective.


multimodal semantics correlation propagation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lew, M., Sebe, N., Djeraba, C., Jain, R.: Content-based Multimedia Information Retrieval: State-of-the-art and Challenges. ACM Transactions on Multimedia Computing, Communication, and Applications 2(1), 1–19 (2006)CrossRefGoogle Scholar
  2. 2.
    Yang, Y., Zhuang, Y., Wu, F., Pan, Y.: Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-media Retrieval. IEEE Transactions on Multimedia 10(3), 437–446 (2008)CrossRefGoogle Scholar
  3. 3.
    Yang, Y., Xu, D., Nie, F., Luo, J., Zhuang, Y.: Ranking with local regression and global alignment for cross media retrieval. In: ACM Multimedia, pp. 175–184 (2009)Google Scholar
  4. 4.
    Swain, M., Ballard, D.: Color indexing. International Journal of Computer Vision 7(1), 11–32 (1991)CrossRefGoogle Scholar
  5. 5.
    Zhao, R., Grosky, W.I.: Negotiating the Semantic Gap: from Feature Maps to Semantic Landscapes. Pattern Recognition 35(3), 593–600 (2002)zbMATHCrossRefGoogle Scholar
  6. 6.
    Zhou, Z.-H., Ng, M., She, Q.-Q., Jiang, Y.: Budget Semi-supervised Learning, pp. 588–595 (2009)Google Scholar
  7. 7.
    Kim, T.-K., Wong, S.-F., Cipolla, R.: Tensor Canonical Correlation Analysis for Action Classification. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2007)Google Scholar
  8. 8.
    Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S.: Relevance Feedback: A Power Tool in Interactive Content-based Image Retrieval. IEEE Trans. on Circuits and Systems for Video Technology 8, 644–655 (1998)CrossRefGoogle Scholar
  9. 9.
    He, X., Ma, W.Y., Zhang, H.J.: Learning an Image Manifold for Retrieval. In: Proceedings of ACM Multimedia Conference (2004)Google Scholar
  10. 10.
    Jafari-Khouzani, K., Soltanian-Zadeh, H.: Radon Transform Orientation Estimation for Rotation Invariant Texture Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6), 1004–1008 (2005)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Srivastava, A., Joshi, S.H., Mio, W., Liu, X.: Statistical Shape Analysis: Clustering, Learning, and Testing. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(4), 590–602 (2005)CrossRefGoogle Scholar
  12. 12.
    Guo, G., Li, S.Z.: Content-based Audio Classification and Retrieval by Support Vector Machines. IEEE Transactions on Neural Networks 14(1), 209–215 (2003)CrossRefGoogle Scholar
  13. 13.
    Fan, J., Elmagarmid, A.K., Zhu, X.q., Aref, W.G., Wu, L.: ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing. IEEE Transactions on Multimedia 6(1), 70–86 (2004)CrossRefGoogle Scholar
  14. 14.
    Müller, M., Röder, T., Clausen, M.: Efficient Content-Based Retrieval of Motion Capture Data. In: Proceedings of ACM SIGGRAPH 2005 (2005)Google Scholar
  15. 15.
    McGurk, H., MacDonald, J.: Hearing Lips and Seeing Voices. Nature 264, 746–748 (1976)CrossRefGoogle Scholar
  16. 16.
    Zhang, H., Weng, J.: Measuring Multi-modality Similarities via Subspace Learning for Cross-media Retrieval. In: Proceedings of 7th Pacific-Rim Conference on Multimedia, pp. 979–988 (2006)Google Scholar
  17. 17.
    Wang, X.-j., Ma, W.-Y., Zhang, L., Li, X.: Multi-graph Enabled Active Learning for Multimodal Web Image Retrieval. In: The 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, Singapore (2005)Google Scholar
  18. 18.
    Yang, Y., Wu, F., Xu, D., et al.: Cross-media Retrieval using query dependent search methods. Pattern Recognition 43(8), 2927–2936 (2010)zbMATHCrossRefGoogle Scholar
  19. 19.
    Zhang, H., Zhuang, Y., Wu, F.: Cross-modal correlation learning for clustering on image-audio dataset. In: ACM International Conference on Multimedia, Germany (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Hong Zhang
    • 1
  • Xiaoli Liu
    • 1
  1. 1.College of Computer Science & TechnologyWuhan University of Science & TechnologyWuhan

Personalised recommendations