Abstract
Automatic video action recognition have been a long-standing problem in computer vision. To obtain a scalable solution for actions recognition, it is important to have efficient visual representation of motions. In this paper, we propose a new visual representation for actions based in the body motion boundaries. The first step, a set of optical flow frames highlighting the principal motions in the poses is substracted. Then, the motion boundaries are computed from the previous optical flow frames. Maximum Stable Extremal Regions are then applied to motion boundaries maps in order to obtain Motion Stable Shape (MSS) features. Local descriptors were computed based on each detected MSS to capture motion patterns. To predict the classes of the different human actions, we have represented different descriptors with a bag-of-words (BOW) model and for classification, we use a non-linear support vector machine. We have performed a set of experiments on different datasets: Weizmann, KTH, UCF sport, UCF50 and Hollywood to prove the efficiency of our developed model. The achieved results improve the state-of-the-art on the KTH and Weizmann datasets and are comparable to state-of-the-art for UCF sport and UCF50 datasets.
Similar content being viewed by others
References
Alexander K, Marcin M, Cordelia S (2008) A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of the British machine vision conference. Leeds, pp 995–1004
Alexander K, Marcin M, Cordelia S, Andrew Z (2010) Human focused action localization in video. In: International workshop on sign, gesture, and activity (SGA) in conjunction with ECCV, vol 21. Crete, pp 219–233
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. In: IEEE Transactions on pattern analysis and machine intelligence, vol 23. Atlanta, pp 257–267
Bregonzio M, Gong S, Xiang T (2009) Recognising action as clouds of space-time interest points. In: IEEE Conference on computer vision and pattern recognition, pp 1948–1955
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE International workshop on performance evaluation of tracking and surveillance (PETS), vol 3. China, pp 65–72
Efros AA, Berg AC, Greg M, Jitendra M (2003) Recognizing action at a distance. In: IEEE International conference on computer vision, vol 3. Nice, , pp 726-733
Geng C, JianXin S (2015) Human action recognition based on convolutional neural networks with a convolutional auto-encoder. In: 5th International conference on computer sciences and automation engineering (ICCSAE 2015)
Gilbert A, Illingworth J, Bowden R (2011) Action recognition using mined hierarchical compound features. In: IEEE Transactions on pattern analysis and machine intelligence. Guildford, pp 883–897
Ivan L, Patrick P (2007) Retrieving actions in movies. In: Proceedings of the eleventh IEEE international conference on computer vision, vol 2. Rio de Janeiro, pp 1–8
Jain M, Jégou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 21, pp 2555–2562
Ji S, Yang WM, Yu K (2013) 3D convolutional neural networks for human action recognition. In: IEEE Transactions on pattern analysis and machine intelligence, vol 35, pp 221–231
Jiang Y, Bhattacharya S, Chang S, Shah M (2013) High-level event recognition in unconstrained videos. In: International journal of multimedia information retrieval, vol 2, pp 73–101
Jones S, Shao L, Zhang J, Liu Y (2012) Relevance feedback for real-world human action retrieval. In: Pattern recognition letters, vol 33, pp 444–452
Klaser A, Marszafek M, Laptev I, Schmid C (2010) Will person detection help bag-of-features action recognition. In: Rapport de recherche 00514828 NRIA
Konrad S, Luc J (2008) Action snippets: how many frames does human action recognition require. In: Proceedings of the IEEE international conference on computer vision and pattern recognition. Alaska, pp 1–8
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE Conference on computer vision and pattern recognition, pp 2046–2053
Laptev I, Marszaek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1–8
Lassoued I, Zagrouba E, Chahir Y (2011) Video action classification: a new approach combining Spatio-teporal Krawtchouk moments and Laplacian Eiginmaps. In: 7th IEEE International conference on signal image technology and internet-based systems. Dijon, pp 291–301
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE Conference on computer vision and pattern recognition, vol 3. Springs, pp 3361–3368
Lena G, Moshe B, Eli S, Michal I, Ronen B (2005) Actions as space time shapes. In: Proceedings of the tenth IEEE international conference on computer vision, vol 2. Washington, DC, pp 1395–1402
Matas J, Chum O, Urban M, Pajdla T (2004) Robust wide baseline stereo from maximally stable extremal regions. In: Image and vision computing journal, vol 22, pp 761–767
Matikainen P, Hebert M, Sukthankar R (2009) Trajectons: action recognition through the motion analysis of tracked features. In: ICCV Workshops on video-oriented object and event classification. Japan, pp 514–521
Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: IEEE International conference on computer vision, vol 21. Jâpan, pp 104–111
Navneet D, Bill T (2005) Histograms of oriented gradients for human detection. In: Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition, vol 1. San Diego, pp 886–893
Nistér D, Stewénius H (2006) Scalable recognition with a vocabulary tree. In: Proceedings of conference on computer vision and pattern recognition, vol 2, pp 2161–2168
Raptis M, Soatto S (2010) Tracklet descriptors for action modeling and video analysis. In: European conference on computer vision. Crete, pp 577–590
Reddy Kishore K, Shah M (2013) Recognizing 50 human action categories of web videos. In: Machine vision and applications, vol 24, pp 971–981
Rodriguez M, Ahmed J, Shah M (2008) Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR, pp 85–91
Sadanand S, Corso J (2012) Action bank: a high-level representation of activity in video. In: Computer vision and pattern recognition (CVPR), pp 1234–1241
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the 17th international conference on pattern recognition, vol 3, pp 332–336
Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descriptor and its application to action recognition. In: ACM Conference on multimedia. Germany, pp 23-29
Sun J, Mu Y, Yan S, Cheong LF (2010) Activity recognition using dense long-duration trajectories. In: IEEE International conference on multimedia and expo. Singapore, pp 322–327
Sun L, Jia K, Yeung D, Shi B (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE International conference on computer vision, pp 4597–4605
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: European conference on computer vision. Greece, pp 140–153
Thomas S, Lior W, Stanley B, Maximilian R, Tomaso P (2007) A biologically inspired system for action recognition. In: Proceedings of the eleventh IEEE international conference on computer vision. Rio de Janeiro, pp 1–8
Thomas S, Lior W, Stanley B, Maximilian R, Tomaso P (2007) Robust object recognition with cortex-like mechanisms. In: IEEE Transactions on pattern analysis and machine intelligence, vol 23, pp 411–426
Wang L, Suter D (2007) Recognizing human activities from silhouettes: motion subspace and factorial discriminative graphical model. In: IEEE Conference on computer vision and pattern recognition, pp 1–8
Willems G, Tuytelaars T, Gool L (2008) An efficient dense and scaleinvariant spatio-temporal interest point detector. In: European conference on computer vision, vol 4. Heidelberg, pp 650–663
Wong SF, Cipolla R (2007) Extracting spatiotemporal interest points using global information. In: IEEE International conference on computer vision. Rio de Janiero, pp 1–8
Xin M, Zhang H, Wang H, Sun M, Yuan D (2016) ARCH: adaptive recurrent-convolutional hybrid networks for long-term action recognition. In: Neurocomputing, vol 178, pp 87–102
Yap PT, Paramesran R, Ong SH (2003) Image analysis by Krawtchouk moments. In: IEEE Transactions on image processing, vol 12, pp 1367–1376
Yeffet L, Wolf L (2009) Local trinary patterns for human action recognition. In: IEEE International conference on computer vision. Japan, pp 492–497
Yuan J, Liu Z, Wu Y (2011) Discriminative video pattern search for efficient action detection. In: IEEE Transactions on pattern analysis and machine intelligence, vol 33, pp 1728–1743
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tvl1 optical flow. In: Procedure of pattern recognition, p 214–223
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lassoued, I., Zagrouba, E. Human actions recognition: an approach based on stable motion boundary fields. Multimed Tools Appl 77, 20715–20729 (2018). https://doi.org/10.1007/s11042-017-5477-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5477-0