Abstract
In this paper, we present a novel approach for human action and gesture recognition using dual-complementary tensors. In particular, the proposed method constructs a compact and yet discriminative representation by normalizing the input video volume into dual tensors. One tensor is obtained from the raw video volume data and the other one is obtained from the histogram of oriented gradients (HOG) features. Each tensor is converted to factored matrices and the similarity between factored matrices is evaluated using canonical correlation analysis (CCA). We, furthermore, propose an information fusion method to combine the resulting similarity scores. The proposed fusion strategy can effectively enhance discriminability between different action categories and lead to better recognition accuracy. We have conducted several experiments on two publicly available databases (UCF sports and Cambridge-Gesture). The results show that our proposed method achieves comparable recognition accuracy as the state-of-the-art methods.
Similar content being viewed by others
References
Atmosukarto I, Ahuja N, Ghanem B (2015) Action recognition using discriminative structured trajectory groups. In: IEEE winter conference on applications of computer vision. IEEE, pp 899–906
Baraldi L, Paci F, Serra G, Benini L, Cucchiara R (2014) Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In: IEEE conference on computer vision and pattern recognition workshops. IEEE, pp 702–707
Cristani M, Raghavendra R, Del Bue A, Murino V (2013) Human behavior analysis in video surveillance: a social signal processing perspective. Neurocomputing 100:86–97
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, vol 1. IEEE, pp 886–893
De Geest R, Tuytelaars T (2014) Dense interest features for video processing. In: IEEE international conference on image processing. IEEE, pp 5771–5775
Deng X, Liu X, Song M, Cheng J, Bu J, Chen C (2013) Lf-eme: local features with elastic manifold embedding for human action recognition. Neurocomputing 99:144–153
Dollár P Piotr’s computer vision matlab toolbox (PMT). http://vision.ucsd.edu/pdollar/toolbox/doc/index.html
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance. IEEE, pp 65–72
Dollár P, Tu Z, Perona P, Belongie S (2009) Integral channel features. In: The British machine vision conference, vol 2, p 5
Dyana A, Das S (2010) Mst-css (multi-spectro-temporal curvature scale space), a novel spatio-temporal representation for content-based video retrieval. IEEE Trans Circuits Syst for Video Technol 20(8):1080–1094
Fernando B, Gavves E, Oramas J, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: IEEE conference on computer vision and pattern recognition, vol 2, p 8
Guha T, Ward RK (2012) Learning sparse representations for human action recognition. IEEE Trans Pattern Anal Mach Intell 34(8):1576–1588
Harandi MT, Sanderson C, Shirazi S, Lovell BC (2013) Kernel analysis on grassmann manifolds for action recognition. Pattern Recog Lett 34(15):1906–1915
Iosifidis A, Tefas A, Pitas I (2013) Minimum class variance extreme learning machine for human action recognition. IEEE Trans Circuits Syst Video Technol 23 (11):1968–1979
Jones S, Shao L (2013) Content-based retrieval of human actions from realistic video databases. Inf Sci 236:56–65
Jones S, Shao L, Zhang J, Liu Y (2012) Relevance feedback for real-world human action retrieval. Pattern Recog Lett 33(4):446–452
Kim TK, Cipolla R (2009) Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Trans Pattern Anal Mach Intell 31(8):1415–1428
Kim TK, Wong KYK, Cipolla R (2007) Tensor canonical correlation analysis for action classification. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE conference on computer vision and pattern recognition. IEEE, pp. 2046–2053
Lai K, Konrad J, Ishwar P (2012) A gesture-driven computer interface using kinect. In: IEEE Southwest symposium on image analysis and interpretation. IEEE, pp 185–188
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 3361–3368
Li H, Tang J, Wu S, Zhang Y, Lin S (2010) Automatic detection and analysis of player action in moving background sports video sequences. IEEE Trans Circuits Syst Video Technol 20(3):351–364
Lin W, Sun MT, Poovendran R, Zhang Z (2010) Group event detection with a varying number of group members for video surveillance. IEEE Trans Circuits Syst Video Technol 20(8):1057–1067
Liu AA, Su YT, Jia PP, Gao Z, Hao T, Yang ZX (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194–1208
Lui YM (2012) Human gesture recognition on product manifolds. J Mach Learn Res 13(1):3297–3321
Lui YM (2012) Tangent bundles on special manifolds for action recognition. IEEE Trans Circuits Syst Video Technol 22(6):930–942
Lui YM, Beveridge JR (2011) Tangent bundle for human action recognition. In: IEEE international conference on automatic face & gesture recognition and workshops. IEEE, pp 97–102
Lui YM, Beveridge JR, Kirby M (2010) Action classification on product manifolds. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 833–839
Ma AJ, Yuen PC, Zou WW, Lai JH (2013) Supervised spatio-temporal neighborhood topology learning for action recognition. IEEE Trans Circuits Syst Video Technol 23(8):1447–1460
Minhas R, Mohammed AA, Wu QJ (2012) Incremental learning in human action recognition based on snippets. IEEE Trans Circuits Syst Video Technol 22 (11):1529–1541
Nagendar G, Bandiatmakuri SG, Tandarpally MG, Jawahar C (2013) Action recognition using canonical correlation kernels. In: Asian conference on computer vision. Springer, pp 479–492
O’Hara S, Draper BA (2012) Scalable action recognition with a subspace forest. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 1210–1217
Raptis M, Kokkinos I, Soatto S (2012) Discovering discriminative action parts from mid-level video representations. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 1242–1249
Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Ross A, Jain A (2003) Information fusion in biometrics. Pattern Recog Lett 24(13):2115–2125
Rougier C, Meunier J, St-Arnaud A, Rousseau J (2011) Robust video surveillance for fall detection based on human shape deformation. IEEE Trans Circuits Syst Video Technol 21(5):611–622
Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 1234–1241
Scherer S, Glodek M, Layher G, Schels M, Schmidt M, Brosch T, Tschechne S, Schwenker F, Neumann H, Palm G (2012) A generic framework for the inference of user states in human computer interaction. J Multimodal User Interfaces 6(3–4):117–141
Shao L, Zhen X, Tao D, Li X (2014) Spatio-temporal laplacian pyramid coding for action recognition. IEEE Trans Cybern 44(6):817–827
Shi F, Petriu E, Laganiere R (2013) Sampling strategies for real-time action recognition. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 2595–2602
Song Y, Demirdjian D, Davis R (2012) Continuous body and hand gesture recognition for natural human-computer interaction. ACM Trans Interact Intell Syst 2 (1):5
Song Y, Zheng YT, Tang S, Zhou X, Zhang Y, Lin S, Chua TS (2011) Localized multiple kernel learning for realistic human action recognition in videos. IEEE Trans Circuits Syst Video Technol 21(9):1193–1202
Sun D, Roth S, Black MJ (2010) Secrets of optical flow estimation and their principles. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 2432–2439
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 3169–3176
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Wu X, Xu D, Duan L, Luo J (2011) Action recognition using context and appearance distribution features. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 489–496
Wu X, Xu D, Duan L, Luo J, Jia Y (2013) Action recognition using multilevel features and latent structural svm. IEEE Trans Circuits Syst Video Technol 23(8):1422–1431
Yang M, Dai D, Shen L, Van Gool L (2014) Latent dictionary learning for sparse representation based classification. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 4138–4145
Yuan C, Hu W, Tian G, Yang S, Wang H (2013) Multi-task sparse learning with beta process prior for action recognition. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 423–429
Zhen X, Shao L, Tao D, Li X (2013) Embedding motion and structure features for action recognition. IEEE Trans Circuits Syst Video Technol 23(7):1182–1190
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hsieh, CY., Lin, WY. Video-based human action and hand gesture recognition by fusing factored matrices of dual tensors. Multimed Tools Appl 76, 7575–7594 (2017). https://doi.org/10.1007/s11042-016-3407-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-3407-1