Advertisement

Forecasting Hands and Objects in Future Frames

  • Chenyou FanEmail author
  • Jangwon Lee
  • Michael S. Ryoo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11131)

Abstract

This paper presents an approach to forecast future presence and location of human hands and objects. Given an image frame, the goal is to predict what objects will appear in the future frame (e.g., 5 s later) and where they will be located at, even when they are not visible in the current frame. The key idea is that (1) an intermediate representation of a convolutional object recognition model abstracts scene information in its frame and that (2) we can predict (i.e., regress) such representations corresponding to the future frames based on that of the current frame. We present a new two-stream fully convolutional neural network (CNN) architecture designed for forecasting future objects given a video. The experiments confirm that our approach allows reliable estimation of future objects in videos, obtaining much higher accuracy compared to the state-of-the-art future object presence forecast method on public datasets.

Keywords

Future location forecast Activity prediction Object forecast 

References

  1. 1.
    Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: IEEE International Conference on Computer Vision (ICCV), December 2015Google Scholar
  2. 2.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  3. 3.
    Cordts, M., et al.: The cityscapes dataset. In: CVPR Workshop on the Future of Datasets in Vision (2015)Google Scholar
  4. 4.
    Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems (NIPS), pp. 64–72 (2016)Google Scholar
  5. 5.
    Hoai, M., De la Torre, F.: Max-margin early event detectors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  6. 6.
    Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33765-9_15CrossRefGoogle Scholar
  7. 7.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  8. 8.
    Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104 (2016)
  9. 9.
    Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos. arXiv preprint arXiv:1701.01821 (2017)
  10. 10.
    Ma, W., Huang, D., Lee, N., Kitani, K.M.: A game-theoretic approach to multi-pedestrian activity forecasting. arXiv preprint arXiv:1604.01431 (2016)
  11. 11.
    Park, H.S., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  12. 12.
    Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2847–2854. IEEE (2012)Google Scholar
  13. 13.
    Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  14. 14.
    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE International Conference on Computer Vision (ICCV) (2011)Google Scholar
  15. 15.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS), pp. 568–576 (2014)Google Scholar
  16. 16.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  17. 17.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations with unlabeled video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  18. 18.
    Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_51CrossRefGoogle Scholar
  19. 19.
    Walker, J., Gupta, A., Hebert, M.: Patch to the future: unsupervised visual prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3302–3309 (2014)Google Scholar
  20. 20.
    Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In: Advances in Neural Information Processing Systems (NIPS) (2016)Google Scholar
  21. 21.
    Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Indiana UniversityBloomingtonUSA

Personalised recommendations