Advertisement

Predicting Video Frames Using Feature Based Locally Guided Objectives

  • Prateep BhattacharjeeEmail author
  • Sukhendu Das
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11364)

Abstract

This paper presents feature reconstruction based approach using Generative Adversarial Networks (GAN) to solve the problem of predicting future frames from natural video scenes. Recent GAN based methods often generate blurry outcomes and fail miserably in case of long-range prediction. Our proposed method incorporates an intermediate feature generating GAN to minimize the disparity between the ground truth and predicted outputs. For this, we propose two novel objective functions: (a) Locally Guided Gram Loss (LGGL) and (b) Multi-Scale Correlation Loss (MSCL) to further enhance the quality of the predicted frames. LGGL aides the feature generating GAN to maximize the similarity between the intermediate features of the ground-truth and the network output by constructing Gram matrices from locally extracted patches over several levels of the generator. MSCL incorporates a correlation based objective to effectively model the temporal relationships between the predicted and ground-truth frames at the frame generating stage. Our proposed model is end-to-end trainable and exhibits superior performance compared to the state-of-the-art on four real-world benchmark video datasets.

Keywords

Video frame prediction GANs Correlation loss Guided gram loss 

Supplementary material

484519_1_En_42_MOESM1_ESM.pdf (3 mb)
Supplementary material 1 (pdf 3052 KB)

References

  1. 1.
    Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)
  2. 2.
    Bhattacharjee, P., Das, S.: Temporal coherency based criteria for predicting video frames using deep multi-stage generative adversarial networks. In: Advances in Neural Information Processing Systems, pp. 4268–4277 (2017)Google Scholar
  3. 3.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision 2005, vol. 2, pp. 1395–1402 (2005)Google Scholar
  4. 4.
    Bovik, A.C.: The Essential Guide to Video Processing, 2nd edn. Academic Press, Boston (2009)Google Scholar
  5. 5.
    Briechle, K., Hanebeck, U.D.: Template matching using fast normalized cross correlation. In: Proceedings of SPIE, vol. 4387, pp. 95–102 (2001)Google Scholar
  6. 6.
    Brox, T., Bregler, C., Malik, J.: Large displacement optical flow. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 41–48 (2009)Google Scholar
  7. 7.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  8. 8.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  9. 9.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  10. 10.
    Kalchbrenner, N., et al.: Video pixel networks. In: International Conference on Machine Learning, pp. 1771–1779 (2017)Google Scholar
  11. 11.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE International Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  12. 12.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
  13. 13.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10578-9_45CrossRefGoogle Scholar
  14. 14.
    Lewis, J.P.: Fast normalized cross-correlation. In: Vision Interface, vol. 10, pp. 120–123 (1995)Google Scholar
  15. 15.
    Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. arXiv preprint (2017)Google Scholar
  16. 16.
    Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: International Conference on Computer Vision (ICCV), vol. 2 (2017)Google Scholar
  17. 17.
    Lu, C., Hirsch, M., Schölkopf, B.: Flexible spatio-temporal networks for video prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6523–6531 (2017)Google Scholar
  18. 18.
    Luo, J., Konofagou, E.E.: A fast normalized cross-correlation calculation method for motion estimation. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 57(6), 1347–1357 (2010)CrossRefGoogle Scholar
  19. 19.
    Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (ICLR) (2016)Google Scholar
  20. 20.
    Nakhmani, A., Tannenbaum, A.: A new distance measure based on generalized image normalized cross-correlation for robust video tracking and image recognition. Pattern Recogn. Lett. 34(3), 315–321 (2013)CrossRefGoogle Scholar
  21. 21.
    Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: Advances in Neural Information Processing Systems, pp. 2863–2871 (2015)Google Scholar
  22. 22.
    van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)
  23. 23.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
  24. 24.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th IEEE International Conference on Pattern Recognition 2004, vol. 3, pp. 32–36 (2004)Google Scholar
  25. 25.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  26. 26.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)Google Scholar
  27. 27.
    Subramaniam, A., Chatterjee, M., Mittal, A.: Deep neural networks with inexact matching for person re-identification. In: Advances in Neural Information Processing Systems, pp. 2667–2675 (2016)Google Scholar
  28. 28.
    Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: ICLR, vol. 1, no. 2 (2017)Google Scholar
  29. 29.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances In Neural Information Processing Systems, pp. 613–621 (2016)Google Scholar
  30. 30.
    Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  31. 31.
    Walker, J., Gupta, A., Hebert, M.: Patch to the future: unsupervised visual prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3302–3309 (2014)Google Scholar
  32. 32.
    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. (TIP) 13(4), 600–612 (2004)CrossRefGoogle Scholar
  33. 33.
    Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2016)Google Scholar
  34. 34.
    Zhou, Y., Berg, T.L.: Learning temporal transformations from time-lapse videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 262–277. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_16CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Visualization and Perception Laboratory, Department of Computer Science and EngineeringIndian Institute of Technology MadrasChennaiIndia

Personalised recommendations