With the emergence of a large number of video resources, video action recognition is attracting much attention. Recently, realizing the outstanding performance of three-dimensional (3D) convolutional neural networks (CNNs), many works have began to apply it for action recognition and obtained satisfactory results. However, little attention has been paid to reduce the model size and computation cost of 3D CNNs. In this paper, we first propose a novel 3D convolution called the Xwise Separable Convolution, then we construct an original 3D CNN called the XwiseNet. Our work aims to make 3D CNNs lightweight without reducing its recognition accuracy. Our key idea is extremely decoupling the 3D convolution in channel, spatial and temporal dimensions. Experiments have verified that the XwiseNet outperforms 3D-ResNet-50 on the Mini-Kinetics benchmark with only 6% training parameters and 12% computation cost.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Cai W, Wei Z (2020) PiiGAN: generative adversarial network for pluralistic image inpainting. IEEE Access 8:48451–48463. https://doi.org/10.1109/ACCESS.2979348
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Chen Y, Fang H, Xu B, Yan Z, Kalantidis Y, Rohrbach M, Yan S, Feng J (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. 1904.05049
Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. arXiv:1807.11195
Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3d convnets: New architecture and transfer learning for video classification.1711.08200
Dollár P, Rabaud V, Cottrell G, et al. (2005) Behavior recognition via sparse spatio-temporal features. Beijing, China:VS-PETS
Feichtenhofer C, Fan H, Malik J, He K (2019) Slow fast networks for video recognition. pp 16201–6210. https://doi.org/10.1109/ICCV.2019.00630
Hara K, H Kataoka, Y Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In: The IEEE conference on computer vision and pattern recognition (CVPR)
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. volume 1512.03385
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. 1704.04861
Iandola FN, Moskewicz MW, Ashraf K, Song H, Dally WJ, Squeezenet Kurt Keutzer. (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5MB model size. 1602.07360
Ikizler N, Cinbis R G, Duygulu P. (2008) Human action recognition with line and flow histograms. In: 2008 19th international conference on pattern recognition. IEEE, Piscataway, pp 1–4
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. 1502.03167
Jhuang H, Serre T, Wolf L, et al. (2007) A biologically inspired system for action recognition. In: 2007 IEEE 11th international conference on computer vision. IEEE, Piscataway, pp 1–8
Ji S, Yang M, Yu K (2013) 3d Convolutional neural networks for human action recognition. IEEE Trans Pattern Anal 35(1):221–231
Kim TK, Wong SF, Cipolla R (2007) Tensor canonical correlation analysis for action classification. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE, Piscataway, pp 1–8
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. british machine vision conference
Lin M, Chen Q, Yan S (2013) Network in network. Computer Science
Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the european conference on computer vision (ECCV). pp 116–131
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: International conference on international conference on machine learning
Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318
Qiu Z., Yao T., Mei T. (2017) Learning Spatio-Temporal representation with Pseudo-3D residual networks. In: Proceedings of the international conference on computer vision (ICCV)
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004. IEEE
Sifre L, Mallat S (2014) Rigid-motion scattering for image classification. PhD thesis, 1:3
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. 1406.2199
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: The IEEE international conference on computer vision (ICCV)
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2017) A closer look at spatiotemporal convolutions for action recognition. 1711.11248
Wang L, Li W, Li W, et al. (2017) Appearance-and-relation networks for video classification. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision – ECCV 2016. Springer International Publishing, Cham, pp 20–36
Wang Z, Zou C, Cai W (2020) Small sample classification of hyperspectral remote sensing images based on sequential joint deeping learning model. IEEE Access, pp 1–1 https://doi.org/10.1109/ACCESS.2986267
Wong SF, Kim TK, Cipolla R (2007) Learning motion categories using both semantic and structural information. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE, Piscataway, pp 1–6
Xie S, Sun C, Huang J, et al. (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the european conference on computer vision (ECCV). pp 305–321
Yang H, Yuan C, Li B, Du Y, Xing J, Hu W, Maybank SJ (2019) Asymmetric 3D Convolutional Neural Networks for action recognition. Pattern Recognit 85:1–12
You H, Tian S, Yu L, Lv Y (2019) Pixel-level remote sensing image recognition based on bidirectional word vectors. IEEE Trans Geosci Remote Sens, pp 1–13. https://doi.org/10.1109/TGRS.2019.2945591
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Zhou Y, Sun X, Zha Z-J, Zeng W (2018) MiCT mixed 3D/2D convolutional tube for human action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 449–458
This work was supported in part by the Natural Science Foundation of China under Grant U1536203 and 61972169, in part by the National key research and development program of China(2016QY01W0200), in part by the Major Scientific and Technological Project of Hubei Province (2018AAA068 and 2019AAA051).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Ling, H., Chen, Y., Chen, J. et al. XwiseNet: action recognition with Xwise separable convolutions. Multimed Tools Appl (2020). https://doi.org/10.1007/s11042-020-09137-5
- Action recognition
- Deep learning
- Three-dimensional convolutional neural networks