XwiseNet: action recognition with Xwise separable convolutions


With the emergence of a large number of video resources, video action recognition is attracting much attention. Recently, realizing the outstanding performance of three-dimensional (3D) convolutional neural networks (CNNs), many works have began to apply it for action recognition and obtained satisfactory results. However, little attention has been paid to reduce the model size and computation cost of 3D CNNs. In this paper, we first propose a novel 3D convolution called the Xwise Separable Convolution, then we construct an original 3D CNN called the XwiseNet. Our work aims to make 3D CNNs lightweight without reducing its recognition accuracy. Our key idea is extremely decoupling the 3D convolution in channel, spatial and temporal dimensions. Experiments have verified that the XwiseNet outperforms 3D-ResNet-50 on the Mini-Kinetics benchmark with only 6% training parameters and 12% computation cost.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Cai W, Wei Z (2020) PiiGAN: generative adversarial network for pluralistic image inpainting. IEEE Access 8:48451–48463. https://doi.org/10.1109/ACCESS.2979348

    Article  Google Scholar 

  2. 2.

    Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  3. 3.

    Chen Y, Fang H, Xu B, Yan Z, Kalantidis Y, Rohrbach M, Yan S, Feng J (2019) Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. 1904.05049

  4. 4.

    Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. arXiv:1807.11195

  5. 5.

    Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3d convnets: New architecture and transfer learning for video classification.1711.08200

  6. 6.

    Dollár P, Rabaud V, Cottrell G, et al. (2005) Behavior recognition via sparse spatio-temporal features. Beijing, China:VS-PETS

  7. 7.

    Feichtenhofer C, Fan H, Malik J, He K (2019) Slow fast networks for video recognition. pp 16201–6210. https://doi.org/10.1109/ICCV.2019.00630

  8. 8.

    Hara K, H Kataoka, Y Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  9. 9.

    He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. volume 1512.03385

  10. 10.

    Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. 1704.04861

  11. 11.

    Iandola FN, Moskewicz MW, Ashraf K, Song H, Dally WJ, Squeezenet Kurt Keutzer. (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5MB model size. 1602.07360

  12. 12.

    Ikizler N, Cinbis R G, Duygulu P. (2008) Human action recognition with line and flow histograms. In: 2008 19th international conference on pattern recognition. IEEE, Piscataway, pp 1–4

  13. 13.

    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. 1502.03167

  14. 14.

    Jhuang H, Serre T, Wolf L, et al. (2007) A biologically inspired system for action recognition. In: 2007 IEEE 11th international conference on computer vision. IEEE, Piscataway, pp 1–8

  15. 15.

    Ji S, Yang M, Yu K (2013) 3d Convolutional neural networks for human action recognition. IEEE Trans Pattern Anal 35(1):221–231

    Article  Google Scholar 

  16. 16.

    Kim TK, Wong SF, Cipolla R (2007) Tensor canonical correlation analysis for action classification. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE, Piscataway, pp 1–8

  17. 17.

    Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. british machine vision conference

  18. 18.

    Lin M, Chen Q, Yan S (2013) Network in network. Computer Science

  19. 19.

    Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the european conference on computer vision (ECCV). pp 116–131

  20. 20.

    Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: International conference on international conference on machine learning

  21. 21.

    Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318

    Article  Google Scholar 

  22. 22.

    Qiu Z., Yao T., Mei T. (2017) Learning Spatio-Temporal representation with Pseudo-3D residual networks. In: Proceedings of the international conference on computer vision (ICCV)

  23. 23.

    Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004. IEEE

  24. 24.

    Sifre L, Mallat S (2014) Rigid-motion scattering for image classification. PhD thesis, 1:3

  25. 25.

    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. 1406.2199

  26. 26.

    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: The IEEE international conference on computer vision (ICCV)

  27. 27.

    Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2017) A closer look at spatiotemporal convolutions for action recognition. 1711.11248

  28. 28.

    Wang L, Li W, Li W, et al. (2017) Appearance-and-relation networks for video classification. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  29. 29.

    Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision – ECCV 2016. Springer International Publishing, Cham, pp 20–36

  30. 30.

    Wang Z, Zou C, Cai W (2020) Small sample classification of hyperspectral remote sensing images based on sequential joint deeping learning model. IEEE Access, pp 1–1 https://doi.org/10.1109/ACCESS.2986267

  31. 31.

    Wong SF, Kim TK, Cipolla R (2007) Learning motion categories using both semantic and structural information. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE, Piscataway, pp 1–6

  32. 32.

    Xie S, Sun C, Huang J, et al. (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the european conference on computer vision (ECCV). pp 305–321

  33. 33.

    Yang H, Yuan C, Li B, Du Y, Xing J, Hu W, Maybank SJ (2019) Asymmetric 3D Convolutional Neural Networks for action recognition. Pattern Recognit 85:1–12

    Article  Google Scholar 

  34. 34.

    You H, Tian S, Yu L, Lv Y (2019) Pixel-level remote sensing image recognition based on bidirectional word vectors. IEEE Trans Geosci Remote Sens, pp 1–13. https://doi.org/10.1109/TGRS.2019.2945591

  35. 35.

    Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  36. 36.

    Zhou Y, Sun X, Zha Z-J, Zeng W (2018) MiCT mixed 3D/2D convolutional tube for human action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 449–458

Download references


This work was supported in part by the Natural Science Foundation of China under Grant U1536203 and 61972169, in part by the National key research and development program of China(2016QY01W0200), in part by the Major Scientific and Technological Project of Hubei Province (2018AAA068 and 2019AAA051).

Author information



Corresponding author

Correspondence to Yao Chen.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ling, H., Chen, Y., Chen, J. et al. XwiseNet: action recognition with Xwise separable convolutions. Multimed Tools Appl (2020). https://doi.org/10.1007/s11042-020-09137-5

Download citation


  • Action recognition
  • Deep learning
  • Three-dimensional convolutional neural networks
  • Lightweight