Using Phase Instead of Optical Flow for Action Recognition

  • Omar Hommos
  • Silvia L. PinteaEmail author
  • Pascal S. M. Mettes
  • Jan C. van Gemert
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11134)


Currently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian model of dynamics does not track, but describes local changes. For video, an Eulerian phase-based motion representation, using complex steerable filters, has been successfully employed recently for motion magnification and video frame interpolation. Inspired by these previous works, here, we proposes learning Eulerian motion representations in a deep architecture for action recognition. We learn filters in the complex domain in an end-to-end manner. We design these complex filters to resemble complex Gabor filters, typically employed for phase-information extraction. We propose a phase-information extraction module, based on these complex filters, that can be used in any network architecture for extracting Eulerian representations. We experimentally analyze the added value of Eulerian motion representations, as extracted by our proposed phase extraction module, and compare with existing motion representations based on optical flow, on the UCF101 dataset.


Motion representation Phase derivatives Eulerian motion representation Action recognition 


  1. 1.
    Bracewell, R.: “Convolution” and “two-dimensional convolution”. In: The Fourier Transform and Its Applications (1965)Google Scholar
  2. 2.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 4724–4733. IEEE (2017)Google Scholar
  3. 3.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)Google Scholar
  4. 4.
    Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: NIPS, pp. 3468–3476 (2016)Google Scholar
  5. 5.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)Google Scholar
  6. 6.
    Fleet, D.J., Jepson, A.D.: Computation of component image velocity from local phase information. IJCV 5(1), 77–104 (1990)CrossRefGoogle Scholar
  7. 7.
    Fleet, D.J., Jepson, A.D., Jenkin, M.R.: Phase-based disparity measurement. CVGIP Image Underst. 53(2), 198–210 (1991)CrossRefGoogle Scholar
  8. 8.
    Freeman, W.T., Adelson, E.H., et al.: The design and use of steerable filters. TPAMI 13(9), 891–906 (1991)CrossRefGoogle Scholar
  9. 9.
    Gautama, T., Van Hulle, M.M., et al.: A phase-based approach to the estimation of the optical flow field using spatial filtering. TNN 13(5), 1127–1136 (2002)Google Scholar
  10. 10.
    Hommos, O.: Learning phase-based descriptions for action recognition. Master’s thesis, Delft University of Technology, May 2018Google Scholar
  11. 11.
    Jain, M., van Gemert, J.C., Snoek, C.G.: What do 15,000 object categories tell us about classifying and localizing actions? In: CVPR, pp. 46–55 (2015)Google Scholar
  12. 12.
    Kay, W., et al.: The kinetics human action video dataset. CoRR (2017)Google Scholar
  13. 13.
    Kooij, J.F.P., van Gemert, J.C.: Depth-aware motion magnification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 467–482. Springer, Cham (2016). Scholar
  14. 14.
    Meyer, S., Djelouah, A., McWilliams, B., Sorkine-Hornung, A., Gross, M., Schroers, C.: PhaseNet for video frame interpolation. In: CVPR, pp. 498–507 (2018)Google Scholar
  15. 15.
    Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: ActionFlowNet: learning motion representation for action recognition. CoRR (2016)Google Scholar
  16. 16.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR, pp. 4694–4702. IEEE (2015)Google Scholar
  17. 17.
    Oh, T.H., et al.: Learning-based video motion magnification. CoRR (2018)Google Scholar
  18. 18.
    Pintea, S.L., van Gemert, J.C.: Making a case for learning motion representations with phase. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 55–64. Springer, Cham (2016). Scholar
  19. 19.
    Sevilla-Lara, L., Liao, Y., Guney, F., Jampani, V., Geiger, A., Black, M.J.: On the integration of optical flow and action recognition. CoRR (2017)Google Scholar
  20. 20.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  21. 21.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR (2012)Google Scholar
  22. 22.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR, June 2015Google Scholar
  23. 23.
    Trabelsi, C., et al.: Deep complex networks. CoRR (2017)Google Scholar
  24. 24.
    Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR, abs/1412.0767, vol. 2. no. 7, p. 8 (2014)Google Scholar
  25. 25.
    Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. TPAMI 40(6), 1510–1517 (2018)CrossRefGoogle Scholar
  26. 26.
    Wadhwa, N., Rubinstein, M., Durand, F., Freeman, W.T.: Phase-based video motion processing. TOG 32(4), 80 (2013)CrossRefGoogle Scholar
  27. 27.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  28. 28.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. CoRR (2017)Google Scholar
  29. 29.
    Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, pp. 2718–2726. IEEE (2016)Google Scholar
  30. 30.
    Zhang, Y., Pintea, S., van Gemert, J.: Video acceleration magnification. In: CVPR. IEEE (2017)Google Scholar
  31. 31.
    Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A.G.: Hidden two-stream convolutional networks for action recognition. CoRR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Omar Hommos
    • 1
  • Silvia L. Pintea
    • 1
    Email author
  • Pascal S. M. Mettes
    • 2
  • Jan C. van Gemert
    • 1
  1. 1.Computer Vision LabDelft University of TechnologyDelftNetherlands
  2. 2.Intelligent Sensory Interactive SystemsUniversity of AmsterdamAmsterdamNetherlands

Personalised recommendations