Learning to Extract Motion from Videos in Convolutional Neural Networks

Teney, Damien; Hebert, Martial

doi:10.1007/978-3-319-54193-8_26

Damien Teney¹⁷ &
Martial Hebert¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10115))

Included in the following conference series:

Asian Conference on Computer Vision

3915 Accesses
12 Citations

Abstract

This paper shows how to extract dense optical flow from videos with a convolutional neural network (CNN). The proposed model constitutes a potential building block for deeper architectures to allow using motion without resorting to an external algorithm, e.g. for recognition in videos. We derive our network architecture from signal processing principles to provide desired invariances to image contrast, phase and texture. We constrain weights within the network to enforce strict rotation invariance and substantially reduce the number of parameters to learn. We demonstrate end-to-end training on only 8 sequences of the Middlebury dataset, orders of magnitude less than competing CNN-based motion estimation methods, and obtain comparable performance to classical methods on the Middlebury benchmark. Importantly, our method outputs a distributed representation of motion that allows representing multiple, transparent motions, and dynamic textures. Our contributions on network design and rotation invariance offer insights nonspecific to motion estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The general formulation applies to convolutional layers as well as to our pixelwise weights (Eqs. 9, 11), in which case the 2D rotation of the kernel has no effect.

References

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR NIPS Spotlight Session abs/1406.2199 (2014)
Google Scholar
Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: ACM Multimedia Conference (2015)
Google Scholar
Ye, H., Wu, Z., Zhao, R.W., Wang, X., Jiang, Y.G., Xue, X.: Evaluating two-stream CNN for video classification. In: ACM on International Conference on Multimedia Retrieval (ICMR) (2015)
Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR abs/1412.0767 (2014)
Google Scholar
Teney, D., Brown, M.: Segmentation of dynamic scenes with distributions of spatiotemporally oriented energies. In: British Machine Vision Conference (BMVC) (2014)
Google Scholar
Derpanis, K.G., Wildes, R.P.: Spacetime texture representation and recognition based on a spatiotemporal orientation analysis. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 34, 1193–1205 (2012)
Article Google Scholar
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Brox, T., Malik, J.: Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 33, 500–513 (2011)
Article Google Scholar
Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. CoRR abs/1504.06852 (2015)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Google Scholar
Konda, K.R., Memisevic, R.: Unsupervised learning of depth and motion. CoRR abs/1312.3429 (2013)
Google Scholar
Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15567-3_11
Chapter Google Scholar
Olshausen, B.: Learning sparse, overcomplete representations of time-varying natural images. In: ICIP, vol. 1, pp. I-41 (2003)
Google Scholar
Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell. 11, 185–203 (1981)
Article Google Scholar
Fortun, D., Bouthemy, P., Kervrann, C.: Optical flow modeling and computation: a survey. Comput. Vis. Image Underst. (CVIU) 393, 1–21 (2015)
Article MATH Google Scholar
Heeger, D.J.: Model for the extraction of image flow. J. Opt. Soc. Am. A 4, 1455–1471 (1987)
Article Google Scholar
Adelson, E.H., Bergen, J.: Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A 2, 284–299 (1985)
Article Google Scholar
Rust, N.C., Mante, V., Simoncelli, E.P., Movshon, J.A.: How MT cells analyze the motion of visual patterns. Nature Neurosci. 9, 1421–1431 (2006)
Article Google Scholar
Solari, F., Chessa, M., Medathati, N., Kornprobst, P.: What can we expect from a V1-MT feedforward architecture for optical flow estimation? Signal Process.: Image Commun. 39, 342–354 (2015)
Google Scholar
Ulman, V.: Improving accuracy of optical flow of heeger’s original method on biomedical images. In: Campilho, A., Kamel, M. (eds.) ICIAR 2010. LNCS, vol. 6111, pp. 263–273. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13772-3_27
Chapter Google Scholar
Derpanis, K.G., Wildes, R.P.: The structure of multiplicative motions in natural imagery. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 32, 1310–1316 (2010)
Article Google Scholar
Teney, D., Brown, M., Kit, D., Hall, P.: Learning similarity metrics for dynamic scene segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Fasel, B., Gatica-Perez, D.: Rotation-invariant neoperceptron. In: International Conference on Pattern Recognition (ICPR) (2006)
Google Scholar
Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., Ng, A.Y.: Tiled convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2010)
Google Scholar
Dieleman, S., Willett, K.W., Dambre, J.: Rotation-invariant convolutional neural networks for galaxy morphology prediction. CoRR abs/1503.07077 (2015)
Google Scholar
Laptev, D., Buhmann, J.M.: Transformation-invariant convolutional jungles. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Rowley, H., Baluja, S., Kanade, T.: Rotation invariant neural network-based face detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (1998)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)
Google Scholar
Jayaraman, D., Grauman, K.: Learning image representations equivariant to ego-motion. CoRR abs/1505.02206 (2015)
Google Scholar
Niyogi, S.A.: Fitting models to distributed representations of vision. In: International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, pp. 3–9. Morgan Kaufmann Publishers Inc. (1995)
Google Scholar
Fleet, D., Jepson, A.: Computation of component image velocity from local phase information. Int. J. Comput. Vis. (IJCV) 5, 77–104 (1990)
Article Google Scholar
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8_3
Chapter Google Scholar
Memin, E., Perez, P.: A multigrid approach for hierarchical motion estimation. In: IEEE Intenational Conference on Computer Vision (ICCV) (1998)
Google Scholar
Anonymous: Website to be provided upon acceptance of the paper. http://damienteney.info/cnnFlow.htm
Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. In: International Conference on Computer Vision (ICCV) (2007)
Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2439. IEEE (2010)
Google Scholar
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24673-2_3
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

The University of Adelaide, Adelaide, Australia
Damien Teney
Carnegie Mellon University, Pittsburgh, USA
Martial Hebert

Authors

Damien Teney
View author publications
You can also search for this author in PubMed Google Scholar
Martial Hebert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Damien Teney .

Editor information

Editors and Affiliations

National Tsing Hua University, Hsinchu, Taiwan
Shang-Hong Lai
Graz University of Technology, Graz, Austria
Vincent Lepetit
Drexel University, Philadelphia, Pennsylvania, USA
Ko Nishino
The University of Tokyo, Tokyo, Japan
Yoichi Sato

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Teney, D., Hebert, M. (2017). Learning to Extract Motion from Videos in Convolutional Neural Networks. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10115. Springer, Cham. https://doi.org/10.1007/978-3-319-54193-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-54193-8_26
Published: 11 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54192-1
Online ISBN: 978-3-319-54193-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics