A survey on aggregating methods for action recognition with dense trajectories

Xu, Haiyan; Tian, Qian; Wang, Zhen; Wu, Jianhui

doi:10.1007/s11042-015-2536-2

A survey on aggregating methods for action recognition with dense trajectories

Published: 15 March 2015

Volume 75, pages 5701–5717, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Haiyan Xu¹,
Qian Tian¹,
Zhen Wang¹ &
…
Jianhui Wu¹

715 Accesses
12 Citations
Explore all metrics

Abstract

Action recognition has become a very important topic in computer vision with unconstrained video sequences. There are varieties of approaches to feature extraction and video sequences description, which play important roles in action recognition. In this paper, we survey the main representations along dense trajectories and aggregating methods for the videos in the last decade. We mainly discuss the aggregating methods which are bag of words (BOW), fisher vector (FV) and vector of locally aggregated descriptors (VLAD). Furthermore, the newest mean average precision (mAP) obtained from the references is used to discuss different aggregating methods on realistic datasets. And for more intuitive comparison those aggregating methods, we will evaluate them on KTH in the same conditions. Finally, we analyze and compare those papers’ experimental data to summarize the trends. Based on the reviews from several approaches to action recognition, we further make an analysis and discussion on the technical trends in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action recognition by fusing depth video and skeletal data information

Article 04 July 2018

Action recognition using edge trajectories and motion acceleration descriptor

Article 30 January 2016

An efficient and sparse approach for large scale human action recognition in videos

Article 28 March 2016

References

Arandjelovic R, Zisserman A (2013) All about VLAD. IEEE Conf Comput Vis Pattern Recogn
Atmosukarto I, Ghanem B, Ahuja N (2012) Trajectory-based fisher kernel representation for action recognition in videos. Int Conf Pattern Recogn 3333–3336
Ballas N et al (2013) Space-time robust video representation for action recognition. ICCV
Bilinski P, Bremond F (2012) Contextual statistics of space-time ordered features for human action recognition. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 I.E. Ninth International Conference on. 228–233
Boureau YL et al (2010) Learning mid-level features for recognition. IEEE Conf Comput Vis Pattern Recogn 2559–2566
Bregonzio M et al (2010) Discriminative topics modelling for action feature selection and recognition. BMVC
Cai Z et al (2014) Multi-view super vector for action recognition. CVPR
Cho J et al (2013) Robust action recognition using local motion and group sparsity. Pattern Recogn
Delhumeau J et al (2013) Revisiting the VLAD image representation. In Proceedings of the 21st ACM international conference on multimedia. ACM 653–656
Erol A et al (2007) Vision-based hand pose estimation: a review. Comput Vis Image Underst 108(1):52–73
Article Google Scholar
Fathi A, Mori G (2008) Action recognition by learning mid-level motion features. IEEE Conf Comput Vis Pattern Recogn 1–8
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. IEEE ComputSoc Conf ComputVis Pattern Recogn
Gilbert A, Illingworth J, Bowden R (2009) Fast realistic multi-action recognition using mined dense spatio-temporal features. IEEE Int Conf Comput Vis 925–931
Han D, Bo L, Sminchisescu C (2009) Selection and context for action recognition. IEEE IntConf Comput Vis 1933–1940
http://www.tuicool.com/articles/fyeUnm
Hu W et al (2004) A survey on visual surveillance of object motion and behaviors. IEEE Trans Syst Man Cybern C Appl Rev 34(3):334–352
Article Google Scholar
Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. Int Conf Comput Vis Pattern Recogn
Jégou H et al (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716
Article Google Scholar
Jégou H et al (2010) Aggregating local descriptors into a compact image representation. IEEE Conf Comput Vis Pattern Recogn 3304–3311
Kim SJ et al (2014) View invariant action recognition using generalized 4D features. Pattern Recogn Lett
Klaser A, Marszalek M (2008) A spatio-temporal descriptor based on 3D-gradients. BMVC
Koniusz P, Yan F, Mikolajczyk K (2013) Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Comput Vis Image Underst 117(5):479–492
Article Google Scholar
Kuehne H et al (2011) HMDB: a large video database for human motion recognition. IEEE Int Conf Comput Vis 2556–2563
Lan Z, Bao L, Yu S I, et al (2013) Multimedia classification and event detection using double fusion [J]. Multimedia Tool Appl 1–15
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Laptev I et al (2008) Learning realistic human actions from movies. IEEE Conf Comput Vis Pattern Recogn 1–8
Le QV et al (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. IEEE Conf Comput Vis Pattern Recogn
Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. IEEE Conf Comput Vis Pattern Recogn 1–8
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. IEEE Conf Comput Vis Pattern Recogn
Liu C et al (2012) Action recognition with discriminative mid-level features. IEEE Int Conf Pattern Recogn 3366–3369
Marszalek M, Laptev I, Schmid C (2009) Actions in context. IEEE Conf Comput Vis Pattern Recogn
Murthy OR, Goecke R (2013) Combined ordered and improved trajectories for large scale human action recognition
Murthy OR, Goecke R (2013) Ordered trajectories for large scale human action recognition. IEEE Int Conf Comput Vis Works
Murthy OR, Radwan I, Goecke R (2014) Dense body part trajectories for human action recognition
Niebles JC, Chen CW, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification [M]//computer vision–ECCV 2010. Springer, Berlin, pp 392–405
Google Scholar
Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. Comput Vis–ECCV 2006. Springer. 490–503
Pavlovic VI, Sharma R, Huang TS (1997) Visual interpretation of hand gestures for human-computer interaction: a review. IEEE Trans Pattern Anal Mach Intell 19(7):677–695
Article Google Scholar
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. IEEE Conf Comput Vis Pattern Recogn 1–8
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. Comput Vis–ECCV 2010. Springer. 143–156
Ramanathan M, Yau WY, Teoh EK (2014) Human action recognition with video data: research and evaluation challenges. IEEE Trans Hum Mach Syst
Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos [J]. Mach Vis Appl 24(5):971–981
Article Google Scholar
Roca X (2011) A selective spatio-temporal interest point detector for human action recognition in complex scenes. Int Conf Comput Vis 1776–1783
Rodriguez M, Ahmed J, Shah M (2008) Action MACH: a patio-temporal maximum average correlation height filter for action recognition. IEEE Conf Comput Vis Pattern Recogn
Sadanand S, Corso JJ Action bank: a high-level representation of activity in video. IEEE Conf Comput Vis Pattern Recogn 1234–1241
Schuldt C, Laptev I, Caputo B (2014) Recognizing human actions: a local SVM approach. Proc Int Conf Pattern Recogn 32–36
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th international conference on Multimedia. ACM 357–360
Shabani AH, Zelek JS, Clausi DA (2013) Multiple scale-specific representations for improved human action recognition. Pattern Recogn Lett 34(15):1771–1779
Article Google Scholar
Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia. ACM 399–402
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Ullah MM, Parizi SN, Laptev I (2010) Improving bag-of-features action recognition with non-local cues. BMVC 95.1–95.11
Wang H, Schmid C (2013) Action recognition with improved trajectories. Int Conf Comput Vis
Wang H et al (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 1–20
Wang H et al (2011) Action recognition by dense trajectories. IEEE Conf Comput Vis Pattern Recogn
Wang H et al (2009) Evaluation of local spatio-temporal features for action recognition. Br Mach Vis Conf
Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115(2):224–241
Article Google Scholar
Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector [M]//computer vision–ECCV 2008. Springer, Berlin, pp 650–663
Google Scholar
Wu S, Oreifej O, Shah M (2011) Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. IEEE Int Conf Comput Vis
Wu D, Shao L (2013) Silhouette analysis-based action recognition via exploiting human poses. IEEE Trans Circuits Syst Video Technol 23(2):236–243
Article MathSciNet Google Scholar
Wu Q et al (2013) Realistic human action recognition with multimodal feature selection and fusion. IEEE Trans Syst Man Cybern Syst 43(4):875–885
Article Google Scholar
Wu X et al (2011) Action recognition using context and appearance distribution features. IEEE Conf Comput Vis Pattern Recogn 489–496
Xu H, Tian Q, Wang Z et al (2014) Human action recognition using late fusion and dimensionality reduction[C]//Digital Signal Processing (DSP). IEEE Int Conf 63–67
Yan S et al (2012) Beyond spatial pyramids: a new feature extraction framework with dense spatial sampling for image classification. Comp Vis–ECCV 2012. Springer 473–487
Yanai K (2014) A dense SURF and triangulation based spatio-temporal feature for action recognition. MultiMedia Model. Springer 375–387
Zhang J et al (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Article Google Scholar
Zhang T et al (2011) Boosted exemplar learning for action recognition and annotation. IEEE Trans Circuits Syst Video Technol 21(7):853–866
Article Google Scholar
Zhang T et al (2009) Boosted exemplar learning for human action recognition. IEEE Int Conf Comput Vis Works 538–545
Zhou, X et al (2010) Image classification using super-vector coding of local image descriptors. Comput Vis–ECCV 2010. Springer 141–154
Zhou X et al (2008) Sift-bag kernel for video event analysis. Proceedings of the 16th ACM international conference on Multimedia. ACM 229–238

Download references

Acknowledgments

This work was partly supported by the National Science Foundation of China (Grant No.61001104), Key Foundation of Jiangsu (Grant No.BK2011018), the Fundamental Research Funds for the Central Universities and Graduate Research and Innovation Projects of Universities in Jiangsu Province (KYLX_0129)

Author information

Authors and Affiliations

School of Electronic Science and Engineering, National ASIC Research and Engineering Center, Southeast University, Nanjing, 210096, China
Haiyan Xu, Qian Tian, Zhen Wang & Jianhui Wu

Authors

Haiyan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Qian Tian
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhui Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiyan Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, H., Tian, Q., Wang, Z. et al. A survey on aggregating methods for action recognition with dense trajectories. Multimed Tools Appl 75, 5701–5717 (2016). https://doi.org/10.1007/s11042-015-2536-2

Download citation

Received: 25 February 2014
Revised: 07 February 2015
Accepted: 26 February 2015
Published: 15 March 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s11042-015-2536-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on aggregating methods for action recognition with dense trajectories

Abstract

Access this article

Similar content being viewed by others

Action recognition by fusing depth video and skeletal data information

Action recognition using edge trajectories and motion acceleration descriptor

An efficient and sparse approach for large scale human action recognition in videos

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey on aggregating methods for action recognition with dense trajectories

Abstract

Access this article

Similar content being viewed by others

Action recognition by fusing depth video and skeletal data information

Action recognition using edge trajectories and motion acceleration descriptor

An efficient and sparse approach for large scale human action recognition in videos

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation