Abstract
Video analysis and understanding is a challenging issue nowadays. Video data has multiple media modalities, which present a characteristic of temporal-sequenced associated cooccurrence (TSAC). Traditionally, videos are represented as vectors in the Euclidean space. Many learning algorithms are then applied to these vectors in a high dimensional space for dimensionality reduction, classification, clustering and recognition as well. However, the multiple modalities in video not only have their own properties, but also have correlations between them; whereas the simple vector representation weakens the power of these relatively independent modalities and even ignores their relations to some extent. Clustering is an important technique for multimedia data management. Recently, a powerful clustering algorithm named Affinity Propagation is devised. In this paper, we introduce a higher-order tensor framework for video analysis. In this framework, we represent image frame, audio stream and transcript text which are the three modalities in video shots as data points by the third-order tensor. Besides, we present a dimension reduction method for the high-dimensional features of video shots which explicitly considers the manifold structure of the tensor space from temporal-sequenced associated co-occurring multimodal media data. We call it TensorShot approach. Then we utilize the effective Affinity Propagation to cluster video shots that are in tensor form. Our algorithm preserves the intrinsic structure of the submanifold where tensorshots are sampled. The experiments on TRECVID2005 news video data set show that our algorithm achieves improved performance.
Similar content being viewed by others
References
Babaguchi N, Kawai Y, Kitahashi T (2002) Event based indexing of broadcast sports video by intermodal collaboration. IEEE Trans Multimedia 4(1):68–75
Bader BW, Kolda TG (2004) MATLAB tensor classes for fast algorithm prototyping. Technical Report SAND2004-5187, Sandia National Laboratories
Bader BW, Kolda TG (2006) Efficient MATLAB computations with sparse and factored tensors. Technical Report SAND02006-7592, Sandia National Laboratories
Belkin M, Niyogi P (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst (NIPS2002) 15:585–591
Chung FRK (1997) Spectral graph theory. In: Regional conference series in mathematics, vol 92
Dumais ST, Furnas GW, Landauer TK (1988) Using latent semantic analysis to improve access to textual information. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 281–285
Ekin A, Pankanti S, Hampapur A (2004) Initialization-independent spectral clustering with applications to automatic video analysis. In: IEEE international conference on aoustics, speech, and signal processing (ICASSP’04), vol 3(3), pp 641–644
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315:972–976
He XF, Niyogi P (2003) Locality preserving projections. Adv Neural Inf Process Syst (NIPS2003)
He XF, Cai D, Liu HF, Han JW (2005) Image clustering with tensor representation. In: Proceedings of the ACM conference on multimedia, pp 132–140
He XF, Cai D, Niyogi P (2005) Tensor subspace analysis. Adv Neural Inf Process Syst (NIPS2005)
Itti L, Koch C, Niebur E (2003) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 25(9):1075–1088
Kuhn HW (1955) The Hungarian method for the assignment problem. Nav Res Logist Q 2:83–97
Lathauwer LD (1997) Signal processing based on multilinear algebra. PhD thesis
Lathauwer LD, Moor BD, Vandewalle J (2000) A multilinear singular value decomposition. SIAM J Matrix Anal Appl 21(4):1253–1278
Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791
Lee D, Seung H (2000) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562 (NIPS2000)
Liu YN, Wu F (2007) Video semantic concept detection using multi-modality subspace correlation propagation. In: 13th international multimedia modeling conference (mmm2007). Singapore
Liu N, Zhang BY, Yan J, Chen Z, Liu WY, Bai FS, Chien LF (2005) Text representation: from vector to tensor. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05)
Naphade MR, Kennydy L, et al (2005) A light scale concept ontology for multimedia understanding for TRECVID 2005
Ngo C-W, Pong T-C, Zhang H-J (2000) On clustering and retrieval of video shots through temporal slices analysis. IEEE Trans Multimedia 4(4):446–458
Rui Y, Huang T, (2000) A unified framework for video browsing and retrieval. In: Bovik A (ed) Image and video processing handbook. New York, pp 705–715
Snoek CGM, Worring M (2005) Multimedia event-based video indexing using time intervals. IEEE Trans Multimedia 7(4):638–647
Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia, pp 399–402
Tao DC, Li XL, Wu XD, Maybank SJ (2006) Human carrying status in visual surveillance. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06)
Turk MA, Pentland AP (1991) Face recognition using eigenfaces. In: IEEE computer society conference on computer vision and pattern recognition, pp 586–591
Vasilescu MAO, Terzopoulos D (2002) Multilinear analysis of image ensembles: tensorfaces. In: Proceedings of 7th European conference on computer vision, vol 2350, pp 447–460
Yedidia JS, Freeman WT, Weiss Y (2003) Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium. ISBN 1558608117, Chap 8, pp 239–236
Zhang D-Q, Lin C-Y, Chang S-F, Smith JR (2004) Semantic video clustering across sources using bipartite spectral clustering. In: IEEE international conference on multimedia and expo (ICME’04), vol 1, pp 117–120
Zheng X, Cai D, He XF, Ma WY, Lin XY (2004) Locality preserving clustering for image database. In: Proceedings of the ACM conference on multimedia, pp 885–891
Acknowledgements
This work is supported by National Natural Science Foundation of China (No.60603096, No. 60533090), Key Technology R&D Program (2006BAH02A13-4), The National High Technology Research and Development Program of China (2006AA010107), Program for Changjiang Scholars and Innovative Research Team in University (IRT0652,PCSIRT), The Cultivation Fund of the Key Scientific and Technical Innovation Project, Ministry of Education of China (No. 706033).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, Y., Wu, F. Multi-modality video shot clustering with tensor representation. Multimed Tools Appl 41, 93–109 (2009). https://doi.org/10.1007/s11042-008-0220-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-008-0220-5