Advertisement

Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

  • Yanghao Li
  • Cuiling LanEmail author
  • Junliang Xing
  • Wenjun Zeng
  • Chunfeng Yuan
  • Jiaying LiuEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9911)

Abstract

Human action recognition from well-segmented 3D skeleton data has been intensively studied and has been attracting an increasing attention. Online action detection goes one step further and is more challenging, which identifies the action type and localizes the action positions on the fly from the untrimmed stream data. In this paper, we study the problem of online action detection from streaming skeleton data. We propose a multi-task end-to-end Joint Classification-Regression Recurrent Neural Network to better explore the action type and temporal localization information. By employing a joint classification and regression optimization objective, this network is capable of automatically localizing the start and end points of actions more accurately. Specifically, by leveraging the merits of the deep Long Short-Term Memory (LSTM) subnetwork, the proposed model automatically captures the complex long-range temporal dynamics, which naturally avoids the typical sliding window design and thus ensures high computational efficiency. Furthermore, the subtask of regression optimization provides the ability to forecast the action prior to its occurrence. To evaluate our proposed model, we build a large streaming video dataset with annotations. Experimental results on our dataset and the public G3D dataset both demonstrate very promising performance of our scheme.

Keywords

Action detection Recurrent neural network Joint classification-regression 

Notes

Acknowledgement

This work was supported by National High-tech Technology R&D Program (863 Program) of China under Grant 2014AA015205, National Natural Science Foundation of China under contract No. 61472011 and No. 61303178, and Beijing Natural Science Foundation under contract No. 4142021.

Supplementary material

Supplementary material 1 (mp4 17191 KB)

419982_1_En_13_MOESM2_ESM.pdf (127 kb)
Supplementary material 2 (pdf 126 KB)

References

  1. 1.
    Weinland, D., Ronfard, R., Boyerc, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115(2), 224–241 (2011)CrossRefGoogle Scholar
  2. 2.
  3. 3.
    Johansson, G.: Visual perception of biological motion and a model for it is analysis. Percept. Psychophys. 14(2), 201–211 (1973)CrossRefGoogle Scholar
  4. 4.
    Han, F., Reily, B., Hoff, W., Zhang, H.: Space-time representation of people based on 3D skeletal data: a review, pp. 1–20 (2016). arXiv:1601.01006
  5. 5.
    Hoai, M., De la Torre, F.: Max-margin early event detectors. Int. J. Comput. Vis. 107(2), 191–202 (2014)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Zanfir, M., Leordeanu, M., Sminchisescu, C.: The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2752–2759 (2013)Google Scholar
  7. 7.
    Oneata, D., Verbeek, J., Schmid, C.: The LEAR submission at THUMOS 2014 (2014)Google Scholar
  8. 8.
    Siva, P., Xiang, T.: Weakly supervised action detection. In: British Machine Vision Conference, Citeseer, vol. 2, p. 6 (2011)Google Scholar
  9. 9.
    Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance feature (2014)Google Scholar
  10. 10.
    Sharaf, A., Torki, M., Hussein, M.E., El-Saban, M.: Real-time multi-scale action detection from 3D skeleton data. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, pp. 998–1005 (2015)Google Scholar
  11. 11.
    Wang, L., Wang, Z., Xiong, Y., Qiao, Y.: CUHK&SIAT submission for THUMOS15 action recognition challenge (2015)Google Scholar
  12. 12.
    Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of ACM International Conference on Multimedia (2015)Google Scholar
  13. 13.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  14. 14.
    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)Google Scholar
  15. 15.
    Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., Xie, X.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI Conference on Artificial Intelligence (2016)Google Scholar
  16. 16.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  17. 17.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  18. 18.
    Wei, P., Zheng, N., Zhao, Y., Zhu, S.C.: Concurrent action detection with structural prediction. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3136–3143 (2013)Google Scholar
  19. 19.
    Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2642–2649 (2013)Google Scholar
  20. 20.
    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1036–1043 (2011)Google Scholar
  21. 21.
    Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., Snoek, C.G.: Action localization with tubelets from motion. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 740–747 (2014)Google Scholar
  22. 22.
    Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1302–1311 (2015)Google Scholar
  23. 23.
    Böck, S., Arzt, A., Krebs, F., Schedl, M.: Online real-time onset detection with recurrent neural networks. In: Proceedings of IEEE International Conference on Digital Audio Effects (2012)Google Scholar
  24. 24.
    Wollmer, M., Blaschke, C., Schindl, T., Schuller, B., Farber, B., Mayer, S., Trefflich, B.: Online driver distraction detection using long short-term memory. IEEE Trans. Intell. Transp. Syst. 12(2), 574–582 (2011)CrossRefGoogle Scholar
  25. 25.
    Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. SCI, vol. 385. Springer, Heidelberg (2012)zbMATHGoogle Scholar
  26. 26.
    Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, Los Alamitos (2001)Google Scholar
  27. 27.
    Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990)CrossRefGoogle Scholar
  28. 28.
    Glocker, B., Pauly, O., Konukoglu, E., Criminisi, A.: Joint classification-regression forests for spatially structured multi-object segmentation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 870–881. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33765-9_62 Google Scholar
  29. 29.
    Schulter, S., Leistner, C., Wohlhart, P., Roth, P.M., Bischof, H.: Accurate object detection with joint classification-regression random forests. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 923–930 (2014)Google Scholar
  30. 30.
    Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition Workshops, pp. 9–14 (2010)Google Scholar
  31. 31.
    Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body pose features and multiple instance learning. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35 (2012)Google Scholar
  32. 32.
    Bloom, V., Makris, D., Argyriou, V.: G3D: a gaming action dataset and real time action recognition evaluation framework. In: Proceedings of International Conference on Computer Vision and Pattern Recognition Workshops, pp. 7–12 (2012)Google Scholar
  33. 33.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)CrossRefGoogle Scholar
  34. 34.
    Freund, Y., Schapire, R.E., et al.: Experiments with a new boosting algorithm. In: Proceedings of International Conference on Machine Learning, vol. 96, pp. 148–156 (1996)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Institute of Computer Science and TechnologyPeking UniversityBeijingChina
  2. 2.Microsoft Research AsiaBeijingChina
  3. 3.Institute of AutomationChinese Academy of SciencesBeijingChina

Personalised recommendations