Skip to main content
Log in

Multi-modality video shot clustering with tensor representation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video analysis and understanding is a challenging issue nowadays. Video data has multiple media modalities, which present a characteristic of temporal-sequenced associated cooccurrence (TSAC). Traditionally, videos are represented as vectors in the Euclidean space. Many learning algorithms are then applied to these vectors in a high dimensional space for dimensionality reduction, classification, clustering and recognition as well. However, the multiple modalities in video not only have their own properties, but also have correlations between them; whereas the simple vector representation weakens the power of these relatively independent modalities and even ignores their relations to some extent. Clustering is an important technique for multimedia data management. Recently, a powerful clustering algorithm named Affinity Propagation is devised. In this paper, we introduce a higher-order tensor framework for video analysis. In this framework, we represent image frame, audio stream and transcript text which are the three modalities in video shots as data points by the third-order tensor. Besides, we present a dimension reduction method for the high-dimensional features of video shots which explicitly considers the manifold structure of the tensor space from temporal-sequenced associated co-occurring multimodal media data. We call it TensorShot approach. Then we utilize the effective Affinity Propagation to cluster video shots that are in tensor form. Our algorithm preserves the intrinsic structure of the submanifold where tensorshots are sampled. The experiments on TRECVID2005 news video data set show that our algorithm achieves improved performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Babaguchi N, Kawai Y, Kitahashi T (2002) Event based indexing of broadcast sports video by intermodal collaboration. IEEE Trans Multimedia 4(1):68–75

    Article  Google Scholar 

  2. Bader BW, Kolda TG (2004) MATLAB tensor classes for fast algorithm prototyping. Technical Report SAND2004-5187, Sandia National Laboratories

  3. Bader BW, Kolda TG (2006) Efficient MATLAB computations with sparse and factored tensors. Technical Report SAND02006-7592, Sandia National Laboratories

  4. Belkin M, Niyogi P (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst (NIPS2002) 15:585–591

    Google Scholar 

  5. Chung FRK (1997) Spectral graph theory. In: Regional conference series in mathematics, vol 92

  6. Dumais ST, Furnas GW, Landauer TK (1988) Using latent semantic analysis to improve access to textual information. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 281–285

  7. Ekin A, Pankanti S, Hampapur A (2004) Initialization-independent spectral clustering with applications to automatic video analysis. In: IEEE international conference on aoustics, speech, and signal processing (ICASSP’04), vol 3(3), pp 641–644

  8. Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315:972–976

    Article  MathSciNet  Google Scholar 

  9. He XF, Niyogi P (2003) Locality preserving projections. Adv Neural Inf Process Syst (NIPS2003)

  10. He XF, Cai D, Liu HF, Han JW (2005) Image clustering with tensor representation. In: Proceedings of the ACM conference on multimedia, pp 132–140

  11. He XF, Cai D, Niyogi P (2005) Tensor subspace analysis. Adv Neural Inf Process Syst (NIPS2005)

  12. Itti L, Koch C, Niebur E (2003) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 25(9):1075–1088

    Article  Google Scholar 

  13. Kuhn HW (1955) The Hungarian method for the assignment problem. Nav Res Logist Q 2:83–97

    Article  MathSciNet  Google Scholar 

  14. Lathauwer LD (1997) Signal processing based on multilinear algebra. PhD thesis

  15. Lathauwer LD, Moor BD, Vandewalle J (2000) A multilinear singular value decomposition. SIAM J Matrix Anal Appl 21(4):1253–1278

    Article  MATH  MathSciNet  Google Scholar 

  16. Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791

    Article  Google Scholar 

  17. Lee D, Seung H (2000) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562 (NIPS2000)

    Google Scholar 

  18. Liu YN, Wu F (2007) Video semantic concept detection using multi-modality subspace correlation propagation. In: 13th international multimedia modeling conference (mmm2007). Singapore

  19. Liu N, Zhang BY, Yan J, Chen Z, Liu WY, Bai FS, Chien LF (2005) Text representation: from vector to tensor. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05)

  20. Naphade MR, Kennydy L, et al (2005) A light scale concept ontology for multimedia understanding for TRECVID 2005

  21. Ngo C-W, Pong T-C, Zhang H-J (2000) On clustering and retrieval of video shots through temporal slices analysis. IEEE Trans Multimedia 4(4):446–458

    Google Scholar 

  22. Rui Y, Huang T, (2000) A unified framework for video browsing and retrieval. In: Bovik A (ed) Image and video processing handbook. New York, pp 705–715

  23. Snoek CGM, Worring M (2005) Multimedia event-based video indexing using time intervals. IEEE Trans Multimedia 7(4):638–647

    Article  Google Scholar 

  24. Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia, pp 399–402

  25. Tao DC, Li XL, Wu XD, Maybank SJ (2006) Human carrying status in visual surveillance. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06)

  26. Turk MA, Pentland AP (1991) Face recognition using eigenfaces. In: IEEE computer society conference on computer vision and pattern recognition, pp 586–591

  27. Vasilescu MAO, Terzopoulos D (2002) Multilinear analysis of image ensembles: tensorfaces. In: Proceedings of 7th European conference on computer vision, vol 2350, pp 447–460

  28. Yedidia JS, Freeman WT, Weiss Y (2003) Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium. ISBN 1558608117, Chap 8, pp 239–236

  29. Zhang D-Q, Lin C-Y, Chang S-F, Smith JR (2004) Semantic video clustering across sources using bipartite spectral clustering. In: IEEE international conference on multimedia and expo (ICME’04), vol 1, pp 117–120

  30. Zheng X, Cai D, He XF, Ma WY, Lin XY (2004) Locality preserving clustering for image database. In: Proceedings of the ACM conference on multimedia, pp 885–891

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (No.60603096, No. 60533090), Key Technology R&D Program (2006BAH02A13-4), The National High Technology Research and Development Program of China (2006AA010107), Program for Changjiang Scholars and Innovative Research Team in University (IRT0652,PCSIRT), The Cultivation Fund of the Key Scientific and Technical Innovation Project, Ministry of Education of China (No. 706033).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Y., Wu, F. Multi-modality video shot clustering with tensor representation. Multimed Tools Appl 41, 93–109 (2009). https://doi.org/10.1007/s11042-008-0220-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-008-0220-5

Keywords

Navigation