Abstract
In recent years, the bag-of-words (BoW) video representations have achieved promising results in human action recognition in videos. By vector quantizing local spatial temporal (ST) features, the BoW video representation brings in simplicity and efficiency, but limitations too. First, the discretization of feature space in BoW inevitably results in ambiguity and information loss in video representation. Second, there exists no universal codebook for BoW representation. The codebook needs to be re-built when video corpus is changed. To tackle these issues, this paper explores a localized, continuous and probabilistic video representation. Specifically, the proposed representation encodes the visual and motion information of an ensemble of local ST features of a video into a distribution estimated by a generative probabilistic model. Furthermore, the probabilistic video representation naturally gives rise to an information-theoretic distance metric of videos. This makes the representation readily applicable to most discriminative classifiers, such as the nearest neighbor schemes and the kernel based classifiers. Experiments on two datasets, KTH and UCF sports, show that the proposed approach could deliver promising results.
Similar content being viewed by others
References
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2009) Effective Codebooks for Human Action Categorization. Proceedings of International Conference on Computer Vision 506–513
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, New York
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Chan AB, Vasconcelos N, Moreno PJ (2004) A family of probabilistic kernels based on information divergence. Technical Report, University of California, San Diego
Do MN, Vetterli M (2002) Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Trans Image Process 11(2):146–158
Doll’ar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. Proceeding of IEEE international workshop on Visual Surveillance Performance Evaluation and Tracking Surveillance 65–72
Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. Proceedings of International Conference on Computer Vision 487–493
Greenspan H, Goldberger J, Mayer A (2004) Probabilistic space-time video modeling via piecewise GMM. IEEE Trans Pattern Anal Mach Intell 26(3):384–396
Greenspan H, Goldberger J, Ridel L (2001) Continuous probabilistic framework for image matching. Comput Vis Image Underst 84(3):384–406
Hershey JR, Olsen PA (2007) Approximating the Kullback Leibler divergence between Gaussian mixture models. Proceeding of International Conference on Acoustics, Speech and Signal Processing 4:317–320
Hofmann T (1999) Probabilistic latent semantic indexing. In Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 50–57
Kendall D (1984) Shape manifolds, procrustean metrics and complex projective spaces. Bull Lond Math Soc 16:81–121
Kl¨aser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. Proceeding of British Machine Vision Conference 995–1004
Kullback S (1968) Information theory and statistics. Dover, New York
Laptev I, Lindeberg T (2003) Space-time interest points. Proceedings of International Conference on Computer Vision 1:432–439
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. Proceedings of International Conference on Computer Vision and Pattern Recognition 1–8
Cao LL, Liu ZC, Huang TS (2010) Cross-dataset action detection. Proceeding of International Conference on Computer Vision and Pattern Recognition 1998–2005
Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37:145–151
Liu JG, Luo JB, Shah M (2009) Action recognition in unconstrained amateur videos. Proceeding of International Conference on Acoustics, Speech and Signal Processing 3549–3552
Liu JG, Shah M (2008) Learning human actions via information maximization. Proceeding of International Conference on Computer Vision and Pattern Recognition 1–8
Liu JG, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance. Proceedings of International Conference on Computer Vision and Pattern Recognition 461–468
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110
Niebles JC, Wang HC, Li FF (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vision 79:299–318
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Rissanen J (1978) Modeling by shortest data description. Automatic 14:465–471
Rodriguez MD, Ahmed J, Shah M (2008) Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. Proceeding of International Conference on Computer Vision and Pattern Recognition 1–8
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. Proceeding of International Conference on Pattern Recognition 3:32–36
Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descriptor and its application to action recognition. In ACM International Conference on Multimedia 357–360
Song Y, Tang S, Zheng YT, Chua TS, Zhang YD, Lin SX (2010) A distribution based video representation for human action recognition. In Proceedings of IEEE International Conference on Multimedia & Expo
Vasconcelos N, Ho P, Moreno P (2004) The Kullback-Leibler kernel as a framework for discriminant and localized representation for visual recognition. Proceedings of European Conference on Computer Vision 430–441
Veeraraghavan A, Roy-Chowdhury AK, Chellappa R (2005) Matching shape sequences in video with applications in human movement analysis. IEEE Trans Pattern Anal Mach Intell 27(12):1896–1909
Vergés-Llahí J, Sanfeliu (2005) A evaluation of distances between color image segmentations. Pattern Recognit Image Anal 263–270
Wang H, Uiiah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proceeding of British Machine Vision Conference 127–138
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. Proceedings of European Conference on Computer Vision 650–663
Wong S-F, Cipolla R (2007) Extracting spatiotemporal interest points using global information. Proceedings of International Conference on Computer Vision 1–8
Xiong ZY, Radhakrishnan R, Divakaran A, Huang TS (2004) Effective and efficient sports highlights extraction using the minimum description length criterion in selecting GMM structures. Proceeding of International Conference on Multimedia and Expo 3:1947–1950
Xu LM, Tang ZM (2007) Speaker identification using multi-step clustering algorithm with transformation based GMM. Autom Control Comput Sci 41(4):224–231
Zhou X, Zhuang XD, Yan SC, Chang SF, Johnson MH, Huang TS (2008) SIFT-Bag kernel for video event analysis. In ACM International Conference on Multimedia 229–238
Acknowledgments
This work was supported by National Basic Research Program of China (973 Program, 2007CB311105); National Nature Science Foundation of China (60873165); Co-building Program of Beijing Municipal Education Commission.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, Y., Tang, S., Zheng, YT. et al. Exploring probabilistic localized video representation for human action recognition. Multimed Tools Appl 58, 663–685 (2012). https://doi.org/10.1007/s11042-011-0748-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-011-0748-7