Exploring probabilistic localized video representation for human action recognition

Song, Yan; Tang, Sheng; Zheng, Yan-Tao; Chua, Tat-Seng; Zhang, Yongdong; Lin, Shouxun

doi:10.1007/s11042-011-0748-7

Exploring probabilistic localized video representation for human action recognition

Published: 11 February 2011

Volume 58, pages 663–685, (2012)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yan Song^1,2,
Sheng Tang¹,
Yan-Tao Zheng³,
Tat-Seng Chua⁴,
Yongdong Zhang¹ &
…
Shouxun Lin¹

216 Accesses
4 Citations
Explore all metrics

Abstract

In recent years, the bag-of-words (BoW) video representations have achieved promising results in human action recognition in videos. By vector quantizing local spatial temporal (ST) features, the BoW video representation brings in simplicity and efficiency, but limitations too. First, the discretization of feature space in BoW inevitably results in ambiguity and information loss in video representation. Second, there exists no universal codebook for BoW representation. The codebook needs to be re-built when video corpus is changed. To tackle these issues, this paper explores a localized, continuous and probabilistic video representation. Specifically, the proposed representation encodes the visual and motion information of an ensemble of local ST features of a video into a distribution estimated by a generative probabilistic model. Furthermore, the probabilistic video representation naturally gives rise to an information-theoretic distance metric of videos. This makes the representation readily applicable to most discriminative classifiers, such as the nearest neighbor schemes and the kernel based classifiers. Experiments on two datasets, KTH and UCF sports, show that the proposed approach could deliver promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Article 14 March 2020

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

Article 25 September 2020

References

Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2009) Effective Codebooks for Human Action Categorization. Proceedings of International Conference on Computer Vision 506–513
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, New York
Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Article Google Scholar
Chan AB, Vasconcelos N, Moreno PJ (2004) A family of probabilistic kernels based on information divergence. Technical Report, University of California, San Diego
Do MN, Vetterli M (2002) Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Trans Image Process 11(2):146–158
Article MathSciNet Google Scholar
Doll’ar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. Proceeding of IEEE international workshop on Visual Surveillance Performance Evaluation and Tracking Surveillance 65–72
Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. Proceedings of International Conference on Computer Vision 487–493
Greenspan H, Goldberger J, Mayer A (2004) Probabilistic space-time video modeling via piecewise GMM. IEEE Trans Pattern Anal Mach Intell 26(3):384–396
Article Google Scholar
Greenspan H, Goldberger J, Ridel L (2001) Continuous probabilistic framework for image matching. Comput Vis Image Underst 84(3):384–406
Article MATH Google Scholar
Hershey JR, Olsen PA (2007) Approximating the Kullback Leibler divergence between Gaussian mixture models. Proceeding of International Conference on Acoustics, Speech and Signal Processing 4:317–320
Hofmann T (1999) Probabilistic latent semantic indexing. In Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 50–57
Kendall D (1984) Shape manifolds, procrustean metrics and complex projective spaces. Bull Lond Math Soc 16:81–121
Article MathSciNet MATH Google Scholar
Kl¨aser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. Proceeding of British Machine Vision Conference 995–1004
Kullback S (1968) Information theory and statistics. Dover, New York
Google Scholar
Laptev I, Lindeberg T (2003) Space-time interest points. Proceedings of International Conference on Computer Vision 1:432–439
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. Proceedings of International Conference on Computer Vision and Pattern Recognition 1–8
Cao LL, Liu ZC, Huang TS (2010) Cross-dataset action detection. Proceeding of International Conference on Computer Vision and Pattern Recognition 1998–2005
Article Google Scholar
Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory 37:145–151
Article MATH Google Scholar
Liu JG, Luo JB, Shah M (2009) Action recognition in unconstrained amateur videos. Proceeding of International Conference on Acoustics, Speech and Signal Processing 3549–3552
Liu JG, Shah M (2008) Learning human actions via information maximization. Proceeding of International Conference on Computer Vision and Pattern Recognition 1–8
Liu JG, Yang Y, Shah M (2009) Learning semantic visual vocabularies using diffusion distance. Proceedings of International Conference on Computer Vision and Pattern Recognition 461–468
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110
Article Google Scholar
Niebles JC, Wang HC, Li FF (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vision 79:299–318
Article Google Scholar
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Article Google Scholar
Rissanen J (1978) Modeling by shortest data description. Automatic 14:465–471
Article MATH Google Scholar
Rodriguez MD, Ahmed J, Shah M (2008) Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. Proceeding of International Conference on Computer Vision and Pattern Recognition 1–8
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. Proceeding of International Conference on Pattern Recognition 3:32–36
Google Scholar
Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descriptor and its application to action recognition. In ACM International Conference on Multimedia 357–360
Song Y, Tang S, Zheng YT, Chua TS, Zhang YD, Lin SX (2010) A distribution based video representation for human action recognition. In Proceedings of IEEE International Conference on Multimedia & Expo
Vasconcelos N, Ho P, Moreno P (2004) The Kullback-Leibler kernel as a framework for discriminant and localized representation for visual recognition. Proceedings of European Conference on Computer Vision 430–441
Veeraraghavan A, Roy-Chowdhury AK, Chellappa R (2005) Matching shape sequences in video with applications in human movement analysis. IEEE Trans Pattern Anal Mach Intell 27(12):1896–1909
Article Google Scholar
Vergés-Llahí J, Sanfeliu (2005) A evaluation of distances between color image segmentations. Pattern Recognit Image Anal 263–270
Wang H, Uiiah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. Proceeding of British Machine Vision Conference 127–138
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. Proceedings of European Conference on Computer Vision 650–663
Wong S-F, Cipolla R (2007) Extracting spatiotemporal interest points using global information. Proceedings of International Conference on Computer Vision 1–8
Xiong ZY, Radhakrishnan R, Divakaran A, Huang TS (2004) Effective and efficient sports highlights extraction using the minimum description length criterion in selecting GMM structures. Proceeding of International Conference on Multimedia and Expo 3:1947–1950
Xu LM, Tang ZM (2007) Speaker identification using multi-step clustering algorithm with transformation based GMM. Autom Control Comput Sci 41(4):224–231
Article Google Scholar
Zhou X, Zhuang XD, Yan SC, Chang SF, Johnson MH, Huang TS (2008) SIFT-Bag kernel for video event analysis. In ACM International Conference on Multimedia 229–238

Download references

Acknowledgments

This work was supported by National Basic Research Program of China (973 Program, 2007CB311105); National Nature Science Foundation of China (60873165); Co-building Program of Beijing Municipal Education Commission.

Author information

Authors and Affiliations

Laboratory of Advanced Computing Research, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 10090, China
Yan Song, Sheng Tang, Yongdong Zhang & Shouxun Lin
Graduate University of the Chinese Academy of Sciences, Beijing, 10039, China
Yan Song
Institute for Infocomm Research, A*STAR, Singapore, Singapore
Yan-Tao Zheng
School of Computing, National University of Singapore, Singapore, Singapore
Tat-Seng Chua

Authors

Yan Song
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yan-Tao Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Tat-Seng Chua
View author publications
You can also search for this author in PubMed Google Scholar
Yongdong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shouxun Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Y., Tang, S., Zheng, YT. et al. Exploring probabilistic localized video representation for human action recognition. Multimed Tools Appl 58, 663–685 (2012). https://doi.org/10.1007/s11042-011-0748-7

Download citation

Published: 11 February 2011
Issue Date: June 2012
DOI: https://doi.org/10.1007/s11042-011-0748-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring probabilistic localized video representation for human action recognition

Abstract

Access this article

Similar content being viewed by others

Human Action Recognition and Prediction: A Survey

Human action recognition using fusion of multiview and deep features: an application to video surveillance

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring probabilistic localized video representation for human action recognition

Abstract

Access this article

Similar content being viewed by others

Human Action Recognition and Prediction: A Survey

Human action recognition using fusion of multiview and deep features: an application to video surveillance

A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation