Multimedia Tools and Applications

, Volume 75, Issue 10, pp 5701–5717 | Cite as

A survey on aggregating methods for action recognition with dense trajectories

  • Haiyan Xu
  • Qian Tian
  • Zhen Wang
  • Jianhui Wu


Action recognition has become a very important topic in computer vision with unconstrained video sequences. There are varieties of approaches to feature extraction and video sequences description, which play important roles in action recognition. In this paper, we survey the main representations along dense trajectories and aggregating methods for the videos in the last decade. We mainly discuss the aggregating methods which are bag of words (BOW), fisher vector (FV) and vector of locally aggregated descriptors (VLAD). Furthermore, the newest mean average precision (mAP) obtained from the references is used to discuss different aggregating methods on realistic datasets. And for more intuitive comparison those aggregating methods, we will evaluate them on KTH in the same conditions. Finally, we analyze and compare those papers’ experimental data to summarize the trends. Based on the reviews from several approaches to action recognition, we further make an analysis and discussion on the technical trends in this field.


Action recognition Aggregating methods BOW FV VLAD Low-level representation 



This work was partly supported by the National Science Foundation of China (Grant No.61001104), Key Foundation of Jiangsu (Grant No.BK2011018), the Fundamental Research Funds for the Central Universities and Graduate Research and Innovation Projects of Universities in Jiangsu Province (KYLX_0129)


  1. 1.
    Arandjelovic R, Zisserman A (2013) All about VLAD. IEEE Conf Comput Vis Pattern RecognGoogle Scholar
  2. 2.
    Atmosukarto I, Ghanem B, Ahuja N (2012) Trajectory-based fisher kernel representation for action recognition in videos. Int Conf Pattern Recogn 3333–3336Google Scholar
  3. 3.
    Ballas N et al (2013) Space-time robust video representation for action recognition. ICCVGoogle Scholar
  4. 4.
    Bilinski P, Bremond F (2012) Contextual statistics of space-time ordered features for human action recognition. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 I.E. Ninth International Conference on. 228–233Google Scholar
  5. 5.
    Boureau YL et al (2010) Learning mid-level features for recognition. IEEE Conf Comput Vis Pattern Recogn 2559–2566Google Scholar
  6. 6.
    Bregonzio M et al (2010) Discriminative topics modelling for action feature selection and recognition. BMVCGoogle Scholar
  7. 7.
    Cai Z et al (2014) Multi-view super vector for action recognition. CVPRGoogle Scholar
  8. 8.
    Cho J et al (2013) Robust action recognition using local motion and group sparsity. Pattern RecognGoogle Scholar
  9. 9.
    Delhumeau J et al (2013) Revisiting the VLAD image representation. In Proceedings of the 21st ACM international conference on multimedia. ACM 653–656Google Scholar
  10. 10.
    Erol A et al (2007) Vision-based hand pose estimation: a review. Comput Vis Image Underst 108(1):52–73CrossRefGoogle Scholar
  11. 11.
    Fathi A, Mori G (2008) Action recognition by learning mid-level motion features. IEEE Conf Comput Vis Pattern Recogn 1–8Google Scholar
  12. 12.
    Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. IEEE ComputSoc Conf ComputVis Pattern RecognGoogle Scholar
  13. 13.
    Gilbert A, Illingworth J, Bowden R (2009) Fast realistic multi-action recognition using mined dense spatio-temporal features. IEEE Int Conf Comput Vis 925–931Google Scholar
  14. 14.
    Han D, Bo L, Sminchisescu C (2009) Selection and context for action recognition. IEEE IntConf Comput Vis 1933–1940Google Scholar
  15. 15.
  16. 16.
    Hu W et al (2004) A survey on visual surveillance of object motion and behaviors. IEEE Trans Syst Man Cybern C Appl Rev 34(3):334–352CrossRefGoogle Scholar
  17. 17.
    Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. Int Conf Comput Vis Pattern RecognGoogle Scholar
  18. 18.
    Jégou H et al (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716CrossRefGoogle Scholar
  19. 19.
    Jégou H et al (2010) Aggregating local descriptors into a compact image representation. IEEE Conf Comput Vis Pattern Recogn 3304–3311Google Scholar
  20. 20.
    Kim SJ et al (2014) View invariant action recognition using generalized 4D features. Pattern Recogn LettGoogle Scholar
  21. 21.
    Klaser A, Marszalek M (2008) A spatio-temporal descriptor based on 3D-gradients. BMVCGoogle Scholar
  22. 22.
    Koniusz P, Yan F, Mikolajczyk K (2013) Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection. Comput Vis Image Underst 117(5):479–492CrossRefGoogle Scholar
  23. 23.
    Kuehne H et al (2011) HMDB: a large video database for human motion recognition. IEEE Int Conf Comput Vis 2556–2563Google Scholar
  24. 24.
    Lan Z, Bao L, Yu S I, et al (2013) Multimedia classification and event detection using double fusion [J]. Multimedia Tool Appl 1–15Google Scholar
  25. 25.
    Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123CrossRefGoogle Scholar
  26. 26.
    Laptev I et al (2008) Learning realistic human actions from movies. IEEE Conf Comput Vis Pattern Recogn 1–8Google Scholar
  27. 27.
    Le QV et al (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. IEEE Conf Comput Vis Pattern RecognGoogle Scholar
  28. 28.
    Liu J, Ali S, Shah M (2008) Recognizing human actions using multiple features. IEEE Conf Comput Vis Pattern Recogn 1–8Google Scholar
  29. 29.
    Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. IEEE Conf Comput Vis Pattern RecognGoogle Scholar
  30. 30.
    Liu C et al (2012) Action recognition with discriminative mid-level features. IEEE Int Conf Pattern Recogn 3366–3369Google Scholar
  31. 31.
    Marszalek M, Laptev I, Schmid C (2009) Actions in context. IEEE Conf Comput Vis Pattern RecognGoogle Scholar
  32. 32.
    Murthy OR, Goecke R (2013) Combined ordered and improved trajectories for large scale human action recognitionGoogle Scholar
  33. 33.
    Murthy OR, Goecke R (2013) Ordered trajectories for large scale human action recognition. IEEE Int Conf Comput Vis WorksGoogle Scholar
  34. 34.
    Murthy OR, Radwan I, Goecke R (2014) Dense body part trajectories for human action recognitionGoogle Scholar
  35. 35.
    Niebles JC, Chen CW, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification [M]//computer vision–ECCV 2010. Springer, Berlin, pp 392–405Google Scholar
  36. 36.
    Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. Comput Vis–ECCV 2006. Springer. 490–503Google Scholar
  37. 37.
    Pavlovic VI, Sharma R, Huang TS (1997) Visual interpretation of hand gestures for human-computer interaction: a review. IEEE Trans Pattern Anal Mach Intell 19(7):677–695CrossRefGoogle Scholar
  38. 38.
    Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. IEEE Conf Comput Vis Pattern Recogn 1–8Google Scholar
  39. 39.
    Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. Comput Vis–ECCV 2010. Springer. 143–156Google Scholar
  40. 40.
    Ramanathan M, Yau WY, Teoh EK (2014) Human action recognition with video data: research and evaluation challenges. IEEE Trans Hum Mach SystGoogle Scholar
  41. 41.
    Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos [J]. Mach Vis Appl 24(5):971–981CrossRefGoogle Scholar
  42. 42.
    Roca X (2011) A selective spatio-temporal interest point detector for human action recognition in complex scenes. Int Conf Comput Vis 1776–1783Google Scholar
  43. 43.
    Rodriguez M, Ahmed J, Shah M (2008) Action MACH: a patio-temporal maximum average correlation height filter for action recognition. IEEE Conf Comput Vis Pattern RecognGoogle Scholar
  44. 44.
    Sadanand S, Corso JJ Action bank: a high-level representation of activity in video. IEEE Conf Comput Vis Pattern Recogn 1234–1241Google Scholar
  45. 45.
    Schuldt C, Laptev I, Caputo B (2014) Recognizing human actions: a local SVM approach. Proc Int Conf Pattern Recogn 32–36Google Scholar
  46. 46.
    Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th international conference on Multimedia. ACM 357–360Google Scholar
  47. 47.
    Shabani AH, Zelek JS, Clausi DA (2013) Multiple scale-specific representations for improved human action recognition. Pattern Recogn Lett 34(15):1771–1779CrossRefGoogle Scholar
  48. 48.
    Snoek CG, Worring M, Smeulders AW (2005) Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia. ACM 399–402Google Scholar
  49. 49.
    Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402Google Scholar
  50. 50.
    Ullah MM, Parizi SN, Laptev I (2010) Improving bag-of-features action recognition with non-local cues. BMVC 95.1–95.11Google Scholar
  51. 51.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. Int Conf Comput VisGoogle Scholar
  52. 52.
    Wang H et al (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 1–20Google Scholar
  53. 53.
    Wang H et al (2011) Action recognition by dense trajectories. IEEE Conf Comput Vis Pattern RecognGoogle Scholar
  54. 54.
    Wang H et al (2009) Evaluation of local spatio-temporal features for action recognition. Br Mach Vis ConfGoogle Scholar
  55. 55.
    Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115(2):224–241CrossRefGoogle Scholar
  56. 56.
    Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector [M]//computer vision–ECCV 2008. Springer, Berlin, pp 650–663Google Scholar
  57. 57.
    Wu S, Oreifej O, Shah M (2011) Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. IEEE Int Conf Comput VisGoogle Scholar
  58. 58.
    Wu D, Shao L (2013) Silhouette analysis-based action recognition via exploiting human poses. IEEE Trans Circuits Syst Video Technol 23(2):236–243MathSciNetCrossRefGoogle Scholar
  59. 59.
    Wu Q et al (2013) Realistic human action recognition with multimodal feature selection and fusion. IEEE Trans Syst Man Cybern Syst 43(4):875–885CrossRefGoogle Scholar
  60. 60.
    Wu X et al (2011) Action recognition using context and appearance distribution features. IEEE Conf Comput Vis Pattern Recogn 489–496Google Scholar
  61. 61.
    Xu H, Tian Q, Wang Z et al (2014) Human action recognition using late fusion and dimensionality reduction[C]//Digital Signal Processing (DSP). IEEE Int Conf 63–67Google Scholar
  62. 62.
    Yan S et al (2012) Beyond spatial pyramids: a new feature extraction framework with dense spatial sampling for image classification. Comp Vis–ECCV 2012. Springer 473–487Google Scholar
  63. 63.
    Yanai K (2014) A dense SURF and triangulation based spatio-temporal feature for action recognition. MultiMedia Model. Springer 375–387Google Scholar
  64. 64.
    Zhang J et al (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRefGoogle Scholar
  65. 65.
    Zhang T et al (2011) Boosted exemplar learning for action recognition and annotation. IEEE Trans Circuits Syst Video Technol 21(7):853–866CrossRefGoogle Scholar
  66. 66.
    Zhang T et al (2009) Boosted exemplar learning for human action recognition. IEEE Int Conf Comput Vis Works 538–545Google Scholar
  67. 67.
    Zhou, X et al (2010) Image classification using super-vector coding of local image descriptors. Comput Vis–ECCV 2010. Springer 141–154Google Scholar
  68. 68.
    Zhou X et al (2008) Sift-bag kernel for video event analysis. Proceedings of the 16th ACM international conference on Multimedia. ACM 229–238Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.School of Electronic Science and Engineering, National ASIC Research and Engineering CenterSoutheast UniversityNanjingChina

Personalised recommendations