Advertisement

Action Anticipation with RBF Kernelized Feature Mapping RNN

  • Yuge ShiEmail author
  • Basura FernandoEmail author
  • Richard HartleyEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)

Abstract

We introduce a novel Recurrent Neural Network-based algorithm for future video feature generation and action anticipation called feature mapping RNN. Our novel RNN architecture builds upon three effective principles of machine learning, namely parameter sharing, Radial Basis Function kernels and adversarial training. Using only some of the earliest frames of a video, the feature mapping RNN is able to generate future features with a fraction of the parameters needed in traditional RNN. By feeding these future features into a simple multilayer perceptron facilitated with an RBF kernel layer, we are able to accurately predict the action in the video.

In our experiments, we obtain 18% improvement on JHMDB-21 dataset, 6% on UCF101-24 and 13% improvement on UT-Interaction datasets over prior state-of-the-art for action anticipation.

Keywords

Human action prediction novel Recurrent Neural Network Radial Basis Function kernel Adversarial training 

References

  1. 1.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)zbMATHGoogle Scholar
  2. 2.
    Dix, A.: Human-computer interaction. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 1327–1331. Springer, Bosto (2009).  https://doi.org/10.1007/978-0-387-39940-9_192CrossRefGoogle Scholar
  3. 3.
    Duan, L.Y., Xu, M., Chua, T.S., Tian, Q., Xu, C.S.: A mid-level representation framework for semantic sports video analysis. In: 2003 ACM International Conference on Multimedia, pp. 33–44. ACM (2003)Google Scholar
  4. 4.
    Ekin, A., Tekalp, A.M., Mehrotra, R.: Automatic soccer video analysis and summarization. IEEE Trans. Image Process. 12(7), 796–807 (2003)CrossRefGoogle Scholar
  5. 5.
    Enzweiler, M., Gavrila, D.M.: Integrated pedestrian classification and orientation estimation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition, pp. 982–989 (2010)Google Scholar
  6. 6.
    Fan, Z., Lin, T., Zhao, X., Jiang, W., Xu, T., Yang, M.: An online approach for gesture recognition toward real-world applications. In: Zhao, Y., Kong, X., Taubman, D. (eds.) ICIG 2017. LNCS, vol. 10666, pp. 262–272. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-71607-7_23CrossRefGoogle Scholar
  7. 7.
    Felsen, P., Agrawal, P., Malik, J.: What will happen next? Forecasting player moves in sports videos. In: 2017 IEEE International Conference on Computer Vision (2017)Google Scholar
  8. 8.
    Fouhey, D.F., Zitnick, C.L.: Predicting object dynamics in scenes. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2027–2034 (2014).  https://doi.org/10.1109/CVPR.2014.260
  9. 9.
    Gandhi, T., Trivedi, M.M.: Image based estimation of pedestrian orientation for improving path prediction. In: 2008 IEEE Intelligent Vehicles Symposium, pp. 506–511 (2008)Google Scholar
  10. 10.
    Gao, J., Yang, Z., Nevatia, R.: RED: reinforced encoder-decoder networks for action anticipation. arXiv preprint arXiv:1707.04818 (2017)
  11. 11.
    Gong, H., Sim, J., Likhachev, M., Shi, J.: Multi-hypothesis motion planning for visual object tracking. In: 2011 IEEE International Conference on Computer Vision, pp. 619–626, November 2011.  https://doi.org/10.1109/ICCV.2011.6126296
  12. 12.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  14. 14.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  15. 15.
    Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  16. 16.
    Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: 2016 IEEE International Conference on Robotics and Automation, pp. 3118–3125 (2016)Google Scholar
  17. 17.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: 2013 IEEE International Conference on Computer Vision, pp. 3192–3199 (Dec 2013)Google Scholar
  18. 18.
    Keller, C.G., Gavrila, D.M.: Will the pedestrian cross? A study on pedestrian path prediction. IEEE Trans. Intell. Transp. Syst. 15(2), 494–506 (2014)CrossRefGoogle Scholar
  19. 19.
    Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33765-9_15CrossRefGoogle Scholar
  20. 20.
    Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_39CrossRefGoogle Scholar
  21. 21.
    Kooij, J.F.P., Schneider, N., Flohr, F., Gavrila, D.M.: Context-based pedestrian path prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 618–633. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_40CrossRefGoogle Scholar
  22. 22.
    Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 14–29 (2016).  https://doi.org/10.1109/TPAMI.2015.2430335CrossRefGoogle Scholar
  23. 23.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)Google Scholar
  24. 24.
    Lampert, C.H.: Predicting the future behavior of a time-varying probability distribution. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 942–950 (2015).  https://doi.org/10.1109/CVPR.2015.7298696
  25. 25.
    Laviers, K., Sukthankar, G., Aha, D.W., Molineaux, M., Darken, C., et al.: Improving offensive performance through opponent modeling. In: 2009 AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (2009)Google Scholar
  26. 26.
    Li, K., Fu, Y.: Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1644–1657 (2014)CrossRefGoogle Scholar
  27. 27.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950 (2016).  https://doi.org/10.1109/CVPR.2016.214
  28. 28.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950 (2016)Google Scholar
  29. 29.
    MacKenzie, I.S.: Fitts’ law as a research and design tool in human-computer interaction. Hum.-Comput. Interact. 7(1), 91–139 (1992)CrossRefGoogle Scholar
  30. 30.
    Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: 2017 IEEE International Conference on Computer Vision, pp. 5784–5793 (2017)Google Scholar
  31. 31.
    Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in video. In: 2009 International Conference on Machine Learning, pp. 737–744. ACM (2009)Google Scholar
  32. 32.
    Newell, A., Card, S.K.: The prospects for psychological science in human-computer interaction. Hum.-Comput. Interact. 1(3), 209–242 (1985)CrossRefGoogle Scholar
  33. 33.
    Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. CoRR abs/1412.6604 (2014)Google Scholar
  34. 34.
    Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from streaming videos. In: 2011 International Conference on Computer Vision, pp. 1036–1043, November 2011.  https://doi.org/10.1109/ICCV.2011.6126349
  35. 35.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: 2009 IEEE International Conference on Computer Vision, pp. 1593–1600 (2009).  https://doi.org/10.1109/ICCV.2009.5459361
  36. 36.
    Ryoo, M.S., Aggarwal, J.K.: UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA) (2010). http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
  37. 37.
    Ryoo, M.S., Chen, C.-C., Aggarwal, J.K., Roy-Chowdhury, A.: An overview of contest on semantic description of human activities (SDHA) 2010. In: Ünay, D., Çataltepe, Z., Aksoy, S. (eds.) ICPR 2010. LNCS, vol. 6388, pp. 270–285. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-17711-8_28CrossRefGoogle Scholar
  38. 38.
    Sadegh Aliakbarian, M., Sadat Saleh, F., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: 2017 IEEE International Conference on Computer Vision, October 2017Google Scholar
  39. 39.
    Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: 2017 IEEE International Conference on Computer Vision, vol. 2, p. 5 (2017)Google Scholar
  40. 40.
    Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real time multiple spatiotemporal action localisation and prediction (2017)Google Scholar
  41. 41.
    Soomro, K., Idrees, H., Shah, M.: Online localization and prediction of actions and interactions. CoRR abs/1612.01194 (2016)Google Scholar
  42. 42.
    Soomro, K., Idrees, H., Shah, M.: Predicting the where and what of actors and actions through online action localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2648–2657 (2016)Google Scholar
  43. 43.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012)Google Scholar
  44. 44.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: 2015 International Conference on Machine Learning. pp. 843–852 (2015)Google Scholar
  45. 45.
    Suard, F., Rakotomamonjy, A., Bensrhair, A., Broggi, A.: Pedestrian detection using infrared images and histograms of oriented gradients. In: 2006 IEEE Intelligent Vehicles Symposium, pp. 206–212. IEEE (2006)Google Scholar
  46. 46.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016).  https://doi.org/10.1109/CVPR.2016.308
  47. 47.
    Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation (2017). arXiv preprint arXiv:1707.04993
  48. 48.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106 (2016)Google Scholar
  49. 49.
    Vu, T.-H., Olsson, C., Laptev, I., Oliva, A., Sivic, J.: Predicting actions from static scenes. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 421–436. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_28CrossRefGoogle Scholar
  50. 50.
    Walker, J., Gupta, A., Hebert, M.: Patch to the future: Unsupervised visual prediction. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3302–3309 (2014).  https://doi.org/10.1109/CVPR.2014.416
  51. 51.
    Walker, J., Gupta, A., Hebert, M.: Dense optical flow prediction from a static image. In: 2015 IEEE International Conference on Computer Vision, pp. 2443–2451, December 2015.  https://doi.org/10.1109/ICCV.2015.281
  52. 52.
    Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: Video forecasting by generating pose futures. In: 2017 IEEE International Conference on Computer Vision, pp. 3352–3361 (2017)Google Scholar
  53. 53.
    Xie, D., Todorovic, S., Zhu, S.C.: Inferring dark matter and dark energy from videos. In: 2013 IEEE International Conference on Computer Vision, pp. 2224–2231, December 2013.  https://doi.org/10.1109/ICCV.2013.277
  54. 54.
    Yu, G., Yuan, J., Liu, Z.: Predicting human activities using spatio-temporal structure of interest points. In: 2012 ACM International Conference on Multimedia (2012)Google Scholar
  55. 55.
    Zhong, D., Chang, S.F.: Structure analysis of sports video using domain models. In: 2001 IEEE International Conference on Multimedia and Expo. Citeseer (2001)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.The Australian National UniversityCanberraAustralia

Personalised recommendations