Learning Spatiotemporal 3D Convolution with Video Order Self-supervision

  • Tomoyuki SuzukiEmail author
  • Takahiro Itazuri
  • Kensho Hara
  • Hirokatsu Kataoka
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11130)


The purpose of this work is to explore self-supervised learning (SSL) strategy to capture a better feature with spatiotemporal 3D convolution. Although one of the next frontier in video recognition must be spatiotemporal 3D CNN, the convergence of the 3D convolutions is really difficult because of their enormous parameters or missing temporal(motion) feature. One of the effective solutions is to collect a \(10^5\)-order video database such as Kinetics/Moments in Time. However, this is not an efficient with burden of manual annotations. In the paper, we train 3D CNN on wrong video-sequence detection tasks in a self-supervised manner (without any manual annotation). The shuffling and verification of consecutive video-frame-order is effective for 3D CNN to capture temporal feature and get a good start point of parameters to be fine-tuned. In the experimental section, we verify that our pretrained 3D CNN on wrong clip detection improves the level of performance on UCF101 (\(+3.99\%\) better than baseline, namely training 3D convolution from scratch).


3D Convolutional Neural Network Self-supervised learning Motion feature Human action recognition 


  1. 1.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)Google Scholar
  2. 2.
    Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5729–5738. IEEE (2017)Google Scholar
  3. 3.
    Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
  4. 4.
    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 18–22 (2018)Google Scholar
  5. 5.
    Huang, D.A., et al.: What makes a video a video: analyzing temporal information in video understanding models and datasets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7366–7375 (2018)Google Scholar
  6. 6.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  7. 7.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  8. 8.
    Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 667–676. IEEE (2017)Google Scholar
  9. 9.
    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). Scholar
  10. 10.
    Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. arXiv preprint arXiv:1801.03150 (2018)
  11. 11.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  12. 12.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Tomoyuki Suzuki
    • 1
    Email author
  • Takahiro Itazuri
    • 2
  • Kensho Hara
    • 1
  • Hirokatsu Kataoka
    • 1
  1. 1.National Institute of Advanced Industrial Science and Technology (AIST)TokyoJapan
  2. 2.Waseda UniversityTokyoJapan

Personalised recommendations