Global Regularizer and Temporal-Aware Cross-Entropy for Skeleton-Based Early Action Recognition

  • Qiuhong KeEmail author
  • Jun Liu
  • Mohammed Bennamoun
  • Hossein Rahmani
  • Senjian An
  • Ferdous Sohel
  • Farid Boussaid
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11364)


In this paper, we propose a new approach to recognize the class label of an action before this action is fully performed based on skeleton sequences. Compared to action recognition which uses fully observed action sequences, early action recognition with partial sequences is much more challenging mainly due to: (1) the global information of a long-term action is not available in the partial sequence, and (2) the partial sequences at different observation ratios of an action contain a number of sub-actions with diverse motion information. To address the first challenge, we introduce a global regularizer to learn a hidden feature space, where the statistical properties of the partial sequences are similar to those of the full sequences. We introduce a temporal-aware cross-entropy to address the second challenge and achieve better prediction performance. We evaluate the proposed method on three challenging skeleton datasets. Experimental results show the superiority of the proposed method for skeleton-based early action recognition.


Early action recognition Global regularizer Temporal-aware cross-entropy 3D skeleton sequences 



We greatly acknowledge NVIDIA for providing a Titan XP GPU for the experiments involved in this research. This work was partially supported by Australian Research Council grants DP150100294, DP150104251, and DE120102960.


  1. 1.
    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV, pp. 1036–1043. IEEE (2011)Google Scholar
  2. 2.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). Scholar
  3. 3.
    Ma, Q., Shen, L., Chen, E., Tian, S., Wang, J., Cottrell, G.W.: WALKING WALKing walking: action recognition from action echoes. In: IJCAI, pp. 2457–2463. AAAI Press (2017)Google Scholar
  4. 4.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp. 1290–1297. IEEE (2012)Google Scholar
  5. 5.
    Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp. 588–595. IEEE (2014)Google Scholar
  6. 6.
    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp. 1110–1118. IEEE (2015)Google Scholar
  7. 7.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: CVPR. IEEE (2016)Google Scholar
  8. 8.
    Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, vol. 2, p. 8. AAAI Press (2016)Google Scholar
  9. 9.
    Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). Scholar
  10. 10.
    Ke, Q., Bennamoun, M., An, S., Boussaid, F., Sohel, F.: A new representation of skeleton sequences for 3D action recognition. In: CVPR. IEEE (2017)Google Scholar
  11. 11.
    Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010). Scholar
  12. 12.
    Wang, L., Qiao, Y., Tang, X.: Latent hierarchical model of temporal structure for complex activity classification. IEEE Transa. Image Process. 23, 810–822 (2014)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634. IEEE (2015)Google Scholar
  14. 14.
    Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230 (2018)
  15. 15.
    Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: ICCV, pp. 5784–5793. IEEE (2017)Google Scholar
  16. 16.
    Bütepage, J., Black, M.J., Kragic, D., Kjellström, H.: Deep representation learning for human motion prediction and classification. In: CVPR 2017. IEEE (2017)Google Scholar
  17. 17.
    Ke, Q., Liu, J., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Chapter 5 - computer vision for human-machine interaction. In: Leo, M., Farinella, G.M., (eds.) Computer Vision for Assistive Healthcare, pp. 127–145. Academic Press (2018)Google Scholar
  18. 18.
    Tang, C., Li, W., Wang, P., Wang, L.: Online human action recognition based on incremental learning of weighted covariance descriptors. Inf. Sci. 467, 219–237 (2018)CrossRefGoogle Scholar
  19. 19.
    Rahmani, H., Mahmood, A., Huynh, D., Mian, A.: Histogram of oriented principal components for cross-view action recognition. PAMI 38, 2430–2443 (2016)CrossRefGoogle Scholar
  20. 20.
    Rahmani, H., Mian, A., Shah, M.: Learning a deep model for human action recognition from novel viewpoints. PAMI 40(3), 667–681 (2018)CrossRefGoogle Scholar
  21. 21.
    Rahmani, H., Bennamoun, M.: Learning action recognition model from depth and skeleton videos. In: ICCV. IEEE (2017)Google Scholar
  22. 22.
    Rahmani, H., Mian, A.: 3D action recognition from novel viewpoints. In: CVPR, pp. 1506–1515. IEEE (2016)Google Scholar
  23. 23.
    Wang, P., Li, W., Ogunbona, P., Wan, J., Escalera, S.: RGB-D-based human motion recognition with deep learning: a survey. Comput. Vis. Image Underst. 171, 118–139 (2018)CrossRefGoogle Scholar
  24. 24.
    Ke, Q., An, S., Bennamoun, M., Sohel, F., Boussaid, F.: Skeletonnet: mining deep part features for 3-D action recognition. IEEE Sig. Process. Lett. 24, 731–735 (2017)CrossRefGoogle Scholar
  25. 25.
    Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27, 2842–2855 (2018)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: CVPR, vol. 7. IEEE (2017)Google Scholar
  27. 27.
    Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV, pp. 1036–1043. IEEE (2011)Google Scholar
  28. 28.
    Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Cham (2014). Scholar
  29. 29.
    Ke, Q., Bennamoun, M., An, S., Boussaid, F., Sohel, F.: Human interaction prediction using deep temporal features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 403–414. Springer, Cham (2016). Scholar
  30. 30.
    Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Leveraging structural context models and ranking score fusion forhuman interaction prediction. IEEE Trans. Multimed. 20, 1712–1723 (2017)CrossRefGoogle Scholar
  31. 31.
    Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: ICRA, pp. 3118–3125. IEEE (2016)Google Scholar
  32. 32.
    Aliakbarian, M.S., Saleh, F., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: ICCV. IEEE (2017)Google Scholar
  33. 33.
    Farha, Y.A., Richard, A., Gall, J.: When will you do what? Anticipating temporal occurrences of activities. arXiv preprint arXiv:1804.00892 (2018)
  34. 34.
    Hu, J.-F., Zheng, W.-S., Ma, L., Wang, G., Lai, J.: Real-time RGB-D activity prediction by soft regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 280–296. Springer, Cham (2016). Scholar
  35. 35.
    Liu, J., Shahroudy, A., Wang, G., Duan, L.Y., Kot, A.C.: SSNet: scale selection network for online 3D action prediction. In: CVPR, pp. 8349–8358. IEEE (2018)Google Scholar
  36. 36.
    Herath, S., Harandi, M., Porikli, F.: Learning an invariant hilbert space for domain adaptation. arXiv preprint arXiv:1611.08350 (2016)
  37. 37.
    Hubert Tsai, Y.H., Yeh, Y.R., Frank Wang, Y.C.: Learning cross-domain landmarks for heterogeneous domain adaptation. In: CVPR, pp. 5081–5090. IEEE (2016)Google Scholar
  38. 38.
    Baktashmotlagh, M., Harandi, M., Salzmann, M.: Distribution-matching embedding for visual domain adaptation. J. Mach. Learn. Res. 17, 3760–3789 (2016)MathSciNetzbMATHGoogle Scholar
  39. 39.
    Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22, 199–210 (2011)CrossRefGoogle Scholar
  40. 40.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
  41. 41.
    CMU: CMU graphics lab motion capture database (2013).
  42. 42.
    Hu, J.F., Zheng, W.S., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition. In: CVPR, pp. 5344–5352. IEEE (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Qiuhong Ke
    • 1
    Email author
  • Jun Liu
    • 2
  • Mohammed Bennamoun
    • 1
  • Hossein Rahmani
    • 3
  • Senjian An
    • 4
  • Ferdous Sohel
    • 5
  • Farid Boussaid
    • 1
  1. 1.The University of Western AustraliaCrawleyAustralia
  2. 2.Nanyang Technological UniversitySingaporeSingapore
  3. 3.Lancaster UniversityLancashireEngland
  4. 4.Curtin UniversityBentleyAustralia
  5. 5.Murdoch UniversityMurdochAustralia

Personalised recommendations