Abstract
Representation of video is a vital problem in action recognition. This paper proposes Stacked Fisher Vectors (SFV), a new representation with multi-layer nested Fisher vector encoding, for action recognition. In the first layer, we densely sample large subvolumes from input videos, extract local features, and encode them using Fisher vectors (FVs). The second layer compresses the FVs of subvolumes obtained in previous layer, and then encodes them again with Fisher vectors. Compared with standard FV, SFV allows refining the representation and abstracting semantic information in a hierarchical way. Compared with recent mid-level based action representations, SFV need not to mine discriminative action parts but can preserve mid-level information through Fisher vector encoding in higher layer. We evaluate the proposed methods on three challenging datasets, namely Youtube, J-HMDB, and HMDB51. Experimental results demonstrate the effectiveness of SFV, and the combination of the traditional FV and SFV outperforms state-of-the-art methods on these datasets with a large margin.
Chapter PDF
References
Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Computing Surveys 43(3), 16 (2011)
Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 1 (2006)
Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011)
Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC (2011)
Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: Combining multiple features for human action recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 494–507. Springer, Heidelberg (2010)
Jaakkola, T., Haussler, D., et al.: Exploiting generative models in discriminative classifiers. In: NIPS pp. 487–493 (1999)
Jain, A., Gupta, A., Rodriguez, M., Davis, L.S.: Representing videos using mid-level discriminative patches. In: CVPR, pp. 2571–2578 (2013)
Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR, pp. 2555–2562 (2013)
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR, pp. 3304–3311 (2010)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J., et al.: Towards understanding action recognition. In: ICCV (2013)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. TPAMI, 221–231 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Klaser, A., Marszałek, M., Schmid, C.: et al.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC (2008)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, vol. 1, p. 4 (2012)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)
Laptev, I.: On space-time interest points. IJCV 64(2), 107–123 (2005)
Le, Q.V., et al.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)
Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: CVPR, pp. 3337–3344 (2011)
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: CVPR. pp. 1996–2003 (2009)
Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV, pp. 2486–2493 (2011)
Peng, X., Qiao, Y., Peng, Q., Qi, X.: Exploring motion boundary based sampling and spatial-temporal context descriptors for action recognition. In: BMVC, pp. 1–11 (2013)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR abs/1405.4506 (2014)
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)
Ren, X., Ramanan, D.: Histograms of sparse codes for object detection. In: CVPR, pp. 3246–3253 (2013)
Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR, pp. 1234–1241 (2012)
Sapienza, M., Cuzzolin, F., Torr, P.H.: Learning discriminative space–time action parts from weakly labelled videos. IJCV, 1–18 (2014)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep fisher networks for large-scale image classification. In: NIPS, pp. 163–171 (2013)
Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV, pp. 1470–1477 (2003)
Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: CVPR, pp. 3169–3176 (2011)
Wang, H., Kläser, A., Schmid, C., Liu, C.-L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV, 1–20 (2013)
Wang, H., Schmid, C., et al.: Action recognition with improved trajectories. In: ICCV (2013)
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Cordelia, Schmid, o.: Evaluation of local spatio-temporal features for action recognition. In: BMVC (2009)
Wang, L., Qiao, Y., Tang, X.: Mining motion atoms and phrases for complex action recognition. In: ICCV, pp. 2680–2687 (2013)
Wang, L., Qiao, Y., Tang, X.: Motionlets: Mid-level 3d parts for human motion recognition. In: CVPR, pp. 2674–2681 (2013)
Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part III. LNCS, vol. 7726, pp. 572–585. Springer, Heidelberg (2013)
Zhu, J., Wang, B., Yang, X., Zhang, W., Tu, Z.: Action recognition with actons. In: ICCV (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Peng, X., Zou, C., Qiao, Y., Peng, Q. (2014). Action Recognition with Stacked Fisher Vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-10602-1_38
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10601-4
Online ISBN: 978-3-319-10602-1
eBook Packages: Computer ScienceComputer Science (R0)