Recognizing Realistic Action Using Contextual Feature Group

Ye, Yituo; Qin, Lei; Cheng, Zhongwei; Huang, Qingming

doi:10.1007/978-1-4614-3501-3_38

Yituo Ye⁴,
Lei Qin⁵,
Zhongwei Cheng⁴ &
…
Qingming Huang^4,5

1812 Accesses
3 Citations

Abstract

Although the spatial–temporal local features and the bag of visual words model (BoW) have achieved a great success and a wide adoption in action classification, there still remain some problems. First, the local features extracted are not stable enough, which may be aroused by the background action or camera shake. Second, using local features alone ignores the spatial–temporal relationships of these features, which may decrease the classification accuracy. Finally, the distance mainly used in the clustering algorithm of the BoW model did not take the semantic context into consideration. Based on these problems, we proposed a systematic framework for recognizing realistic actions, with considering the spatial–temporal relationship between the pruned local features and utilizing a new discriminate group distance to incorporate the semantic context information. The Support Vector Machine (SVM) with multiple kernels is employed to make use of both the local feature and feature group information. The proposed method is evaluated on KTH dataset and a relatively realistic dataset YouTube. Experimental results validate our approach and the recognition performance is promising.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A. Bobick and J. Davis. The representation and recognition of action using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):257–267, 2001.
Article Google Scholar
Ke Y, Sukthankar R and Hebert. Spatial–temporal shape and flow correlation for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2007.
Google Scholar
I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003.
Google Scholar
P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, 2005.
Google Scholar
L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In CVPR, 2005.
Google Scholar
P. Scovanner, S. Ali, and M. Shah. A 3-dimensional SIFT descriptor and its application to action recognition. In ACM International Conference on Multimedia, 2007.
Google Scholar
H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. In IEEE International Conference on Computer Vision, 2007.
Google Scholar
Jinen Liu, Jiebo luo, Mubarak Shah. Rcognizing realisitic actions from videos “in the wild”. In IEEE Conference on Computer Vision and Pattern Recognition, 2009
Google Scholar
RYoo, M.S and AGGARWAL, J.K. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In IEEE International Conference on Computer Vision, 2009.
Google Scholar
Qiong Hu, Lei Qin, Qingming Huang, Shuqiang Jiang, and Qi Tian. Action Recognition Using Spatial–Temporal Context,” Analysis and Search. 20th International Conference on Pattern Recognition, August 23–26, 2010, Istanbul, Turkey.
Google Scholar
Shiliang Zhang, Qingming Huang, Gang Hua, Shuqiang Jiang, Wen Gao, and Qi Tian, Building Contextual Visual Vocabulary for Large-scale Image Applications, in Proceedings of ACM Multimedia Conference, ACM MM (Full Paper), Florence, Italy, pp.501–510, Oct.25–29, 2010.
Google Scholar
Y Wang, G Mori, “Human Action Recognition by Semi-Latent Topic Models,” PAMI, 2009.
Google Scholar
S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. In ICML, 2010. 3362, 3366
Google Scholar
J. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categpries using spatial-temporal words. IJCV, 2008. 3366.
Google Scholar
G. Taylor, R. Fergus, Y. Lecun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV, 2010. 3361, 3362, 3366, 3367.
Google Scholar

Download references

Acknowledgements

This work was supported in part by National Basic Research Program of China (973 Program): 2009CB320906, in part by National Natural Science Foundation of China: 61025011, 61035001 and 61003165, and in part by Beijing Natural Science Foundation: 4111003.

Author information

Authors and Affiliations

Graduate University of Chinese Academy of Sciences, Beijing, 100049, China
Yituo Ye, Zhongwei Cheng & Qingming Huang
Key lab of Intelligent Information Processing, Institute of Computing Technology, CAS, Beijing, 100190, China
Lei Qin & Qingming Huang

Authors

Yituo Ye
View author publications
You can also search for this author in PubMed Google Scholar
Lei Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zhongwei Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Qingming Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingming Huang .

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, Y., Qin, L., Cheng, Z., Huang, Q. (2013). Recognizing Realistic Action Using Contextual Feature Group. In: The Era of Interactive Media. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3501-3_38

Download citation

DOI: https://doi.org/10.1007/978-1-4614-3501-3_38
Published: 03 August 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-3500-6
Online ISBN: 978-1-4614-3501-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics