Advertisement

Multimedia Tools and Applications

, Volume 77, Issue 24, pp 31627–31645 | Cite as

A discriminative structural model for joint segmentation and recognition of human actions

  • Cuiwei Liu
  • Jingyi Hou
  • Xinxiao Wu
  • Yunde Jia
Article
  • 57 Downloads

Abstract

Achieving joint segmentation and recognition of continuous actions in a long-term video is a challenging task due to the varying durations of actions and the complex transitions of multiple actions. In this paper, a novel discriminative structural model is proposed for splitting a long-term video into segments and annotating the action label of each segment. A set of state variables is introduced into the model to explore discriminative semantic concepts shared among different actions. To exploit the statistical dependences among segments, temporal context is captured at both the action level and the semantic concept level. The state variables are treated as latent information in the discriminative structural model and inferred during both training and testing. Experiments on multi-view IXMAS and realistic Hollywood datasets demonstrate the effectiveness of the proposed method.

Keywords

Action recognition Action segmentation Discriminative structural model 

Notes

Acknowledgements

This work was supported in part by the Natural Science Foundation of China(NSFC) under Grants No. 61602320 and No. 61673062, and Liaoning Doctoral Startup Project under Grant No. 201601172, and project of Liaoning provincial education department under Grant No. L201607.

References

  1. 1.
    Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International conference on human behavior unterstanding, pp 29–39Google Scholar
  2. 2.
    Chen Q, Cai Y, Brown L, Datta A, Fan Q, Feris R, Yan S, Hauptmann A, Pankanti S (2013) Spatio-temporal fisher vector coding for surveillance event detection. In: Proceedings of the 21st ACM international conference on Multimedia, ACM, pp 589–592Google Scholar
  3. 3.
    Cheng Y, Fan Q, Pankanti S, Choudhary A (2014) Temporal sequence modeling for video event detection. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 2235–2242Google Scholar
  4. 4.
    Chun SY, Lee CS (2016) Human action recognition using histogram of motion intensity and direction from multiple views. IET Comput Vis 10(4):250–256CrossRefGoogle Scholar
  5. 5.
    Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Eurpoean conference on computer vision, Springer, pp 428–441Google Scholar
  6. 6.
    Do TMT, Artières T (2009) Large margin training for hidden markov models with partially observed states. In: Annual international conference on machine learning, ACM, pp 265–272Google Scholar
  7. 7.
    Fernando B, Gavves E, Oramas J, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: IEEE Conference on computer vision and pattern recognition, vol 2. p 8Google Scholar
  8. 8.
    Fu Y, Zhang T, Wang W (2017) Sparse coding-based space-time video representation for action recognition. Multimedia Tools and Applications 76:1–14CrossRefGoogle Scholar
  9. 9.
    Gaidon A, Harchaoui Z, Schmid C (2011) Actom sequence models for efficient action detection. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 3201–3208Google Scholar
  10. 10.
    Harchaoui Z, Moulines E, Bach FR (2009) Kernel change-point analysis. In: Advances in neural information processing systems, pp 609–616Google Scholar
  11. 11.
    Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21CrossRefGoogle Scholar
  12. 12.
    Hoai M, Lan ZZ, De la Torre F (2011) Joint segmentation and classification of human actions in video. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 3265–3272Google Scholar
  13. 13.
    Hsu YP, Liu C, Chen TY, Fu LC (2016) Online view-invariant human action recognition using rgb-d spatio-temporal matrix. Pattern Recogn 60:215–226CrossRefGoogle Scholar
  14. 14.
    Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
  15. 15.
    Junejo IN, Dexter E, Laptev I, Perez P (2011) View-independent action recognition from temporal self-similarities. IEEE Trans Pattern Anal Mach Intell 33 (1):172–185CrossRefGoogle Scholar
  16. 16.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition, pp 1725–1732Google Scholar
  17. 17.
    Kulkarni K, Evangelidis G, Cech J, Horaud R (2015) Continuous action recognition based on sequence alignment. Int J Comput Vis 112(1):90–114CrossRefGoogle Scholar
  18. 18.
    Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123CrossRefGoogle Scholar
  19. 19.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 1–8Google Scholar
  20. 20.
    Lei J, Zhang J, Li G, Guo Q, Tu D (2016) Continuous action segmentation and recognition using hybrid convolutional neural network-hidden markov model model. IET Comput Vis 10(6):537–544CrossRefGoogle Scholar
  21. 21.
    Li S, Li K, Fu Y (2015) Temporal subspace clustering for human motion segmentation. In: IEEE International conference on computer vision, pp 4453–4461Google Scholar
  22. 22.
    Lin W, Chen Y, Wu J, Wang H, Sheng B, Li H (2015) A new network-based algorithm for human activity recognition in videos. IEEE Trans Circuits Syst Video Technol 24(5):826–841CrossRefGoogle Scholar
  23. 23.
    Liu C, Wu X, Jia Y (2016) A hierarchical video description for complex activity understanding. Int J Comput Vis 118(2):240–255MathSciNetCrossRefGoogle Scholar
  24. 24.
    Liu C, Xu W, Wu Q, Yang G (2016) Learning motion and content-dependent features with convolutions for action recognition. Multimedia Tools and Applications 75(21):13,023–13,039CrossRefGoogle Scholar
  25. 25.
    Liu J, Gu Y, Kamijo S (2017) Customer behavior classification using surveillance camera for marketing. Multimedia Tools and Applications 76(5):6595–6622CrossRefGoogle Scholar
  26. 26.
    Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 3337–3344Google Scholar
  27. 27.
    Lu G, Kudo M, Toyama J (2013) Temporal segmentation and assignment of successive actions in a long-term video. Pattern Recogn Lett 34(15):1936–1944CrossRefGoogle Scholar
  28. 28.
    Lv F, Nevatia R (2007) Single view human action recognition using key pose matching and viterbi path searching. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 1–8Google Scholar
  29. 29.
    Ni B, Moulin P, Yang X, Yan S (2015) Motion part regularization: Improving action recognition via trajectory selection. In: IEEE Conference on computer vision and pattern recognition, pp 3698–3706Google Scholar
  30. 30.
    Ogale A, Karapurkar A, Guerra-Filho G, Aloimonos Y (2004) View-invariant identification of pose sequences for action recognition. In: Video analysis and content extraction workshop, CiteseerGoogle Scholar
  31. 31.
    Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intell Rev 46(4):485–514CrossRefGoogle Scholar
  32. 32.
    Ryan MS, Nudd GR (1973) The viterbi algorithm. Proc IEEE 61(5):268–278MathSciNetGoogle Scholar
  33. 33.
    Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 1234–1241Google Scholar
  34. 34.
    Santos L, Khoshhal K, Dias J (2015) Trajectory-based human action segmentation. Pattern Recogn 48(2):568–579CrossRefGoogle Scholar
  35. 35.
    Shao L, Ji L, Liu Y, Zhang J (2012) Human action segmentation and recognition via motion and shape analysis. Pattern Recogn Lett 33(4):438–445CrossRefGoogle Scholar
  36. 36.
    Shao L, Zhen X, Tao D, Li X (2014) Spatio-temporal laplacian pyramid coding for action recognition. IEEE Transactions on Cybernetics 44(6):2168–2267Google Scholar
  37. 37.
    Shi Q, Wang L, Cheng L, Smola A (2008) Discriminative human action segmentation and recognition using semi-markov model. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 1–8Google Scholar
  38. 38.
    Shi Q, Cheng L, Wang L, Smola A (2011) Human action segmentation and recognition using discriminative semi-markov models. Int J Comput Vis 93(1):22–32CrossRefGoogle Scholar
  39. 39.
    Simon T, Nguyen MH, De La Torre F, Cohn JF (2010) Action unit detection with segment-based svms. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 2737–2744Google Scholar
  40. 40.
    Tejerodepablos A, Nakashima Y, Sato T, Yokoya N (2016) Human action recognition-based video summarization for rgb-d personal sports video. In: IEEE International conference on multimedia and expo, pp 1–6Google Scholar
  41. 41.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE International conference on computer vision, pp 4489–4497Google Scholar
  42. 42.
    Vitaladevuni SN, Kellokumpu V, Davis LS (2008) Action recognition using ballistic dynamics. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 1–8Google Scholar
  43. 43.
    Wang H, Kläser A., Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79MathSciNetCrossRefGoogle Scholar
  44. 44.
    Wang J, Nie X, Xia Y, Wu Y, Zhu SC (2014) Cross-view action modeling, learning, and recognition. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp 2649–2656Google Scholar
  45. 45.
    Wang H, Dan O, Verbeek J, Schmid C (2016) A robust and efficient video representation for action recognition. Int J Comput Vis 119(3):219–238MathSciNetCrossRefGoogle Scholar
  46. 46.
    Wang W, Yan Y, Zhang L, Hong R, Sebe N (2016) Collaborative sparse coding for multiview action recognition. IEEE MultiMedia 23(4):80–87CrossRefGoogle Scholar
  47. 47.
    Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2):249–257CrossRefGoogle Scholar
  48. 48.
    Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3d exemplars. In: IEEE International conference on computer vision, IEEE, pp 1–7Google Scholar
  49. 49.
    Weinland D, Özuysal M, Fua P (2010) Making action recognition robust to occlusions and viewpoint changes. In: European conference on computer vision, Springer, pp 635–648Google Scholar
  50. 50.
    Wu D, Shao L (2013) Silhouette analysis-based action recognition via exploiting human poses. IEEE Trans Circuits Syst Video Technol 23(2):236–243MathSciNetCrossRefGoogle Scholar
  51. 51.
    Wu X, Xu D, Duan L, Luo J, Jia Y (2013) Action recognition using multilevel features and latent structural svm. IEEE Trans Circuits Syst Video Technol 23(8):1422–1431CrossRefGoogle Scholar
  52. 52.
    Wu D, Sharma N, Blumenstein M (2017) Recent advances in video-based human action recognition using deep learning: a review. In: International joint conference on neural networks, IEEE, pp 2865–2872Google Scholar
  53. 53.
    Xuan X, Murphy K (2007) Modeling changing dependency structure in multivariate time series. In: International conference on machine learning, ACM, pp 1055–1062Google Scholar
  54. 54.
    Yang Y, Mao G (2013) A self-adaptive sliding window technique for mining data streams. In: Intelligence computation and evolutionary computation, pp 689–697CrossRefGoogle Scholar
  55. 55.
    Yi Y, Wang H, Zhang B (2017) Learning correlations for human action recognition in videos. Multimedia Tools and Applications 76(18):18891–18913CrossRefGoogle Scholar
  56. 56.
    Yu CNJ, Joachims T (2009) Learning structural svms with latent variables. In: Annual international conference on machine learning, ACM, pp 1169–1176Google Scholar
  57. 57.
    Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q (2017) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Transactions on Circuits and Systems for Video TechnologyGoogle Scholar
  58. 58.
    Zhen X, Shao L (2013) Spatio-temporal steerable pyramid for human action recognition. In: IEEE International conference and workshops on automatic face and gesture recognition, IEEEGoogle Scholar
  59. 59.
    Zhou Q, Wang G, Jia K, Zhao Q (2013) Learning to share latent tasks for action recognition. In: IEEE International conference on computer vision, IEEE, pp 2264–2271Google Scholar
  60. 60.
    Zhu G, Huang Q, Xu C, Xing L, Gao W, Yao H (2007) Human behavior analysis for highlight ranking in broadcast racket sports video. IEEE Trans Multimedia 9(6):1167–1182CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computer ScienceShenyang Aerospace UniversityShenyangPeople’s Republic of China
  2. 2.Beijing Laboratory of Intelligent Information Technology, School of Computer ScienceBeijing Institute of TechnologyBeijingPeople’s Republic of China

Personalised recommendations