Skip to main content

Learning to Extract Motion from Videos in Convolutional Neural Networks

  • Conference paper
  • First Online:
Computer Vision – ACCV 2016 (ACCV 2016)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10115))

Included in the following conference series:

Abstract

This paper shows how to extract dense optical flow from videos with a convolutional neural network (CNN). The proposed model constitutes a potential building block for deeper architectures to allow using motion without resorting to an external algorithm, e.g. for recognition in videos. We derive our network architecture from signal processing principles to provide desired invariances to image contrast, phase and texture. We constrain weights within the network to enforce strict rotation invariance and substantially reduce the number of parameters to learn. We demonstrate end-to-end training on only 8 sequences of the Middlebury dataset, orders of magnitude less than competing CNN-based motion estimation methods, and obtain comparable performance to classical methods on the Middlebury benchmark. Importantly, our method outputs a distributed representation of motion that allows representing multiple, transparent motions, and dynamic textures. Our contributions on network design and rotation invariance offer insights nonspecific to motion estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The general formulation applies to convolutional layers as well as to our pixelwise weights (Eqs. 9, 11), in which case the 2D rotation of the kernel has no effect.

References

  1. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR NIPS Spotlight Session abs/1406.2199 (2014)

    Google Scholar 

  2. Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: ACM Multimedia Conference (2015)

    Google Scholar 

  3. Ye, H., Wu, Z., Zhao, R.W., Wang, X., Jiang, Y.G., Xue, X.: Evaluating two-stream CNN for video classification. In: ACM on International Conference on Multimedia Retrieval (ICMR) (2015)

    Google Scholar 

  4. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR abs/1412.0767 (2014)

    Google Scholar 

  5. Teney, D., Brown, M.: Segmentation of dynamic scenes with distributions of spatiotemporally oriented energies. In: British Machine Vision Conference (BMVC) (2014)

    Google Scholar 

  6. Derpanis, K.G., Wildes, R.P.: Spacetime texture representation and recognition based on a spatiotemporal orientation analysis. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 34, 1193–1205 (2012)

    Article  Google Scholar 

  7. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: International Conference on Computer Vision (ICCV) (2013)

    Google Scholar 

  8. Brox, T., Malik, J.: Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 33, 500–513 (2011)

    Article  Google Scholar 

  9. Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. CoRR abs/1504.06852 (2015)

    Google Scholar 

  10. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

    Google Scholar 

  11. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)

    Google Scholar 

  12. Konda, K.R., Memisevic, R.: Unsupervised learning of depth and motion. CoRR abs/1312.3429 (2013)

    Google Scholar 

  13. Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15567-3_11

    Chapter  Google Scholar 

  14. Olshausen, B.: Learning sparse, overcomplete representations of time-varying natural images. In: ICIP, vol. 1, pp. I-41 (2003)

    Google Scholar 

  15. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell. 11, 185–203 (1981)

    Article  Google Scholar 

  16. Fortun, D., Bouthemy, P., Kervrann, C.: Optical flow modeling and computation: a survey. Comput. Vis. Image Underst. (CVIU) 393, 1–21 (2015)

    Article  MATH  Google Scholar 

  17. Heeger, D.J.: Model for the extraction of image flow. J. Opt. Soc. Am. A 4, 1455–1471 (1987)

    Article  Google Scholar 

  18. Adelson, E.H., Bergen, J.: Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A 2, 284–299 (1985)

    Article  Google Scholar 

  19. Rust, N.C., Mante, V., Simoncelli, E.P., Movshon, J.A.: How MT cells analyze the motion of visual patterns. Nature Neurosci. 9, 1421–1431 (2006)

    Article  Google Scholar 

  20. Solari, F., Chessa, M., Medathati, N., Kornprobst, P.: What can we expect from a V1-MT feedforward architecture for optical flow estimation? Signal Process.: Image Commun. 39, 342–354 (2015)

    Google Scholar 

  21. Ulman, V.: Improving accuracy of optical flow of heeger’s original method on biomedical images. In: Campilho, A., Kamel, M. (eds.) ICIAR 2010. LNCS, vol. 6111, pp. 263–273. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13772-3_27

    Chapter  Google Scholar 

  22. Derpanis, K.G., Wildes, R.P.: The structure of multiplicative motions in natural imagery. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 32, 1310–1316 (2010)

    Article  Google Scholar 

  23. Teney, D., Brown, M., Kit, D., Hall, P.: Learning similarity metrics for dynamic scene segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  24. Fasel, B., Gatica-Perez, D.: Rotation-invariant neoperceptron. In: International Conference on Pattern Recognition (ICPR) (2006)

    Google Scholar 

  25. Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y.: Tiled convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2010)

    Google Scholar 

  26. Dieleman, S., Willett, K.W., Dambre, J.: Rotation-invariant convolutional neural networks for galaxy morphology prediction. CoRR abs/1503.07077 (2015)

    Google Scholar 

  27. Laptev, D., Buhmann, J.M.: Transformation-invariant convolutional jungles. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  28. Rowley, H., Baluja, S., Kanade, T.: Rotation invariant neural network-based face detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (1998)

    Google Scholar 

  29. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)

    Google Scholar 

  30. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)

    Google Scholar 

  31. Jayaraman, D., Grauman, K.: Learning image representations equivariant to ego-motion. CoRR abs/1505.02206 (2015)

    Google Scholar 

  32. Niyogi, S.A.: Fitting models to distributed representations of vision. In: International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, pp. 3–9. Morgan Kaufmann Publishers Inc. (1995)

    Google Scholar 

  33. Fleet, D., Jepson, A.: Computation of component image velocity from local phase information. Int. J. Comput. Vis. (IJCV) 5, 77–104 (1990)

    Article  Google Scholar 

  34. LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8_3

    Chapter  Google Scholar 

  35. Memin, E., Perez, P.: A multigrid approach for hierarchical motion estimation. In: IEEE Intenational Conference on Computer Vision (ICCV) (1998)

    Google Scholar 

  36. Anonymous: Website to be provided upon acceptance of the paper. http://damienteney.info/cnnFlow.htm

  37. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. In: International Conference on Computer Vision (ICCV) (2007)

    Google Scholar 

  38. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33783-3_44

    Chapter  Google Scholar 

  39. Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2439. IEEE (2010)

    Google Scholar 

  40. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24673-2_3

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Damien Teney .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Teney, D., Hebert, M. (2017). Learning to Extract Motion from Videos in Convolutional Neural Networks. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10115. Springer, Cham. https://doi.org/10.1007/978-3-319-54193-8_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54193-8_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54192-1

  • Online ISBN: 978-3-319-54193-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics