Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors
- 153 Downloads
- 1 Citations
Abstract
Hand-crafted and learning-based features are two main types of video representations in the field of video understanding. How to integrate their merits to design good descriptors has been the research hotspot recently. Motivated by TDD (Wang et al. 2015), we combine trajectory pooling method and 3D ConvNets (Tran et al. 2015) and put forward a novel multi-scale trajectory-pooled 3D convolutional descriptor (MTC3D) for action recognition in this paper. Specifically, we calculate multi-scale dense trajectories from the input video and perform trajectory pooling on feature maps of 3D CNN. The proposed descriptor has two advantages: 3D CNN has the ability to extract high-level semantic information from videos and multi-scale trajectory pooling method utilizes the temporal information of videos subtly. The experiments on the datasets of HMDB51 and UCF101 demonstrate that the proposed descriptor achieves state-of-the-art results.
Keywords
Trajectory pooling 3D ConvNets Action recognitionNotes
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 61472103).
References
- 1.Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv (CSUR) 43(3):16CrossRefGoogle Scholar
- 2.Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: Computer vision–ECCV 2006, pp 404–417Google Scholar
- 3.Boiman O, Irani M (2007) Detecting irregularities in images and in video. Int J Comput Vis 74(1):17–31CrossRefGoogle Scholar
- 4.Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 1. IEEE, pp 886–893Google Scholar
- 5.Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Computer vision–ECCV 2006, pp 428–441Google Scholar
- 6.Demiris Y, Khadhouri B (2006) Hierarchical attentive multiple models for execution and recognition of actions. Robot Autonom Syst 54(5):361–369CrossRefGoogle Scholar
- 7.Diba A, Sharma V, Van Gool L (2016) Deep temporal linear encoding networks. arXiv:1611.06678
- 8.Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634Google Scholar
- 9.Fanello SR, Gori I, Metta G, Odone F (2013) Keep it simple and sparse: real-time action recognition. J Mach Learn Res 14(1):2617–2640Google Scholar
- 10.Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol 2. IEEE, pp 524–531Google Scholar
- 11.Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395MathSciNetCrossRefGoogle Scholar
- 12.Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1764–1772Google Scholar
- 13.Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision conference, vol 15, no 50. Manchester, pp 5210–5244Google Scholar
- 14.Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
- 15.Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In: IEEE 11th international conference on computer vision, 2007. ICCV 2007. IEEE, pp 1–8Google Scholar
- 16.Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
- 17.Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732Google Scholar
- 18.Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British machine vision conference. British Machine Vision Association, pp 275–1Google Scholar
- 19.Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
- 20.Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 IEEE international conference on computer vision (ICCV). IEEE, pp 2556–2563Google Scholar
- 21.Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8Google Scholar
- 22.Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: 2011 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3361–3368Google Scholar
- 23.Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114CrossRefGoogle Scholar
- 24.Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110MathSciNetCrossRefGoogle Scholar
- 25.Lu X, Yao H, Sun X, Zhang S, Zhang Y (2017) Trajectory-pooled 3d convolutional descriptors for action recognition. In: Pacific rim conference on multimediaGoogle Scholar
- 26.Nie W, Liu A, Li W, Su Y (2016) Cross-view action recognition by cross-domain learning. Image Vis Comput 55:109–118CrossRefGoogle Scholar
- 27.Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990CrossRefGoogle Scholar
- 28.Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54CrossRefGoogle Scholar
- 29.Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput vis 105(3):222–245MathSciNetzbMATHCrossRefGoogle Scholar
- 30.Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th international conference on multimedia. ACM, pp 357–360Google Scholar
- 31.Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
- 32.Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576Google Scholar
- 33.Snoek CG, Worring M (2008) Concept-based video retrieval. Found Trends Inf Retriev 2(4):215–322CrossRefGoogle Scholar
- 34.Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
- 35.Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, pp 843–852Google Scholar
- 36.Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112Google Scholar
- 37.Szeliski R (2006) Image alignment and stitching: a tutorial. Founda Trends Comput Graph Vis 2(1):1–104MathSciNetzbMATHCrossRefGoogle Scholar
- 38.Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497Google Scholar
- 39.Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558Google Scholar
- 40.Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: 2011 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3169–3176Google Scholar
- 41.Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314Google Scholar
- 42.Wang F, Qi S, Gao G, Zhao S, Wang X (2016) Logo information recognition in large-scale social media data. Multimed Syst 22(1):63–73CrossRefGoogle Scholar
- 43.Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. pp 20–36Google Scholar
- 44.Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702Google Scholar
- 45.Zhao S, Chen L, Yao H, Zhang Y, Sun X (2015) Strategy for dynamic 3d depth data matching towards robust action retrieval. Neurocomputing 151:533–543CrossRefGoogle Scholar
- 46.Zhao S, Yao H, Gao Y, Ji R, Xie W, Jiang X, Chua TS (2016) Predicting personalized emotion perceptions of social images. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 1385–1394Google Scholar
- 47.Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimed 19(3):632–645CrossRefGoogle Scholar
- 48.Zhu Y, Zhao X, Fu Y, Liu Y (2011) Sparse coding on local spatial-temporal volumes for human action recognition. Comput Vis–ACCV 2010:660–671Google Scholar