Advertisement

On the Integration of Optical Flow and Action Recognition

  • Laura Sevilla-LaraEmail author
  • Yiyi LiaoEmail author
  • Fatma GüneyEmail author
  • Varun JampaniEmail author
  • Andreas GeigerEmail author
  • Michael J. BlackEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11269)

Abstract

Most of the top performing action recognition methods use optical flow as a “black box” input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we fine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: (1) optical flow is useful for action recognition because it is invariant to appearance, (2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, (3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, (4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and (5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.

Keywords

Optical flow Action recognition Video understanding 

References

  1. 1.
    Bailer, C., Taetz, B., Stricker, D.: Flow fields: dense correspondence fields for highly accurate large displacement optical flow estimation. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  2. 2.
    Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 92(1), 1–31 (2011)CrossRefGoogle Scholar
  3. 3.
    Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994).  https://doi.org/10.1007/BF01420984CrossRefGoogle Scholar
  4. 4.
    Bobick, A., Davis, J.: An appearance-based representation of action. In: International Pattern Recognition (ICPR) (1996)Google Scholar
  5. 5.
    Brox, T., Malik, J.: Large displacement optical flow: descriptor matching in variational motion estimation. Pattern Anal. Mach. Intell. (PAMI) 500–513 (2011).  https://doi.org/10.1109/TPAMI.2010.143CrossRefGoogle Scholar
  6. 6.
    Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33783-3_44CrossRefGoogle Scholar
  7. 7.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. CoRR abs/1705.07750 (2017). http://arxiv.org/abs/1705.07750
  8. 8.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  9. 9.
    Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. CoRR abs/1611.02155 (2016)Google Scholar
  10. 10.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  11. 11.
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.C.: ActionVLAD: learning spatio-temporal aggregation for action classification. CoRR abs/1704.02895 (2017)Google Scholar
  12. 12.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. arXiv preprint arXiv:1703.06870 (2017)
  13. 13.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  14. 14.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference in Machine Learning (ICML), pp. 448–456 (2015). http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
  15. 15.
    Janai, J., Güney, F., Wulff, J., Black, M., Geiger, A.: Slow flow: exploiting high-speed cameras for accurate and diverse optical flow reference data. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  16. 16.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: International Conference on Computer Vision (ICCV), pp. 3192–3199 (2013)Google Scholar
  17. 17.
    Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14(2), 201–211 (1973).  https://doi.org/10.3758/BF03212378CrossRefGoogle Scholar
  18. 18.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  19. 19.
    Koenderink, J., Richards, W., van Doorn, A.J.: Space-time disarray and visual awareness. i-Perception 3, 159–165 (2012)CrossRefGoogle Scholar
  20. 20.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: International Conference on Computer Vision (ICCV) (2011)Google Scholar
  21. 21.
    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  22. 22.
    Ng, J.Y., Choi, J., Neumann, J., Davis, L.S.: ActionFlowNet: learning motion representation for action recognition. CoRR abs/1612.03052 (2016). http://arxiv.org/abs/1612.03052
  23. 23.
    Ng, J.Y., Hausknecht, M.J., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. CoRR abs/1503.08909 (2015). http://arxiv.org/abs/1503.08909
  24. 24.
    Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  25. 25.
    Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  26. 26.
    Roth, S., Black, M.J.: On the spatial statistics of optical flow. Int. J. Comput. Vis. 74(1), 33–50 (2007)CrossRefGoogle Scholar
  27. 27.
    Sigurdsson, G.A., Divvala, S.K., Farhadi, A., Gupta, A.: Asynchronous temporal fields for action recognition. CoRR abs/1612.06371 (2016)Google Scholar
  28. 28.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS), pp. 568–576 (2014)Google Scholar
  29. 29.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  30. 30.
    Soomro, K., Roshan Zamir, A., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01 (2012)Google Scholar
  31. 31.
    Sun, D., Roth, S., Lewis, J.P., Black, M.J.: Learning optical flow. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 83–97. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-88690-7_7CrossRefGoogle Scholar
  32. 32.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)Google Scholar
  33. 33.
    Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. Pattern Anal. Mach. Intell. (PAMI) 40, 1510–1517 (2017)CrossRefGoogle Scholar
  34. 34.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_2CrossRefGoogle Scholar
  35. 35.
    Wulff, J., Black, M.J.: Efficient sparse-to-dense optical flow estimation using a learned basis and layers. In: Computer Vision and Pattern Recognition (CVPR), pp. 120–130 (2015)Google Scholar
  36. 36.
    Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.FacebookMenlo ParkUSA
  2. 2.Max Planck Institute for Intelligent SystemsTübingenGermany
  3. 3.Zhejiang UniversityHangzhouChina
  4. 4.Oxford UniversityOxfordUK
  5. 5.NVIDIASanta ClaraUSA
  6. 6.University of TuebingenTübingenGermany

Personalised recommendations