Human interaction recognition using spatial-temporal salient feature

  • Tao Hu
  • Xinyan Zhu
  • Shaohua WangEmail author
  • Lian Duan


Depth sensor is widely used today and has great impact in object pose estimation, camera tracking, human actions, and scene reconstruction. This paper presents a novel method for human interaction recognition based on 3D skeleton data captured by Kinect sensor using hierarchical spatial-temporal saliency-based representation method. Hierarchical saliency can be conceptualized as Salient Actions at the highest level, determined by the initial movement in an interaction; Salient Points at middle level, determined by a single time point uniquely identified for all instances of Salient Action; Salient Joints at the lowest level, determined by the greatest positional changes of human joints in a Salient Action sequence. Given the interaction saliency at different levels, several types of features, such as spatial displacement, direction relations, and etc., are introduced based on action characteristics. Since there are few publicly accessible test datasets, we created a new dataset with eight types of interactions named K3HI, using the Microsoft Kinect. The method was tested based on Support Vector Machine (SVM) multi-class classifier. In the experiment, the results demonstrate that the average recognition accuracy of hierarchical saliency-based representation is 90.29%, outperforming methods using other features.


Depth sensor Interaction recognition Hierarchical saliency SVM 



We are grateful to the volunteers for capturing data. This research is supported by the National Key R&D Program of China (No. 2016YFB0502204), the National Key Technology R&D Program (No. 2015BAK03B04), the Funds for the Central Universities (No. 413000010), the Open Found of State Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (No.16(03)), Guangxi Higher Education Undergraduate Teaching Reform Project Category A (2016JGA258), and the Opening Foundation of Key Laboratory of Environment Change and Resources Use in Beibu Gulf Ministry of Education (Guangxi Teachers Education University) and Guangxi Key Laboratory of Earth Surface Processes and Intelligent Simulation (Guangxi Teachers Education University) (No.GTEU-KLOP-K1704).


  1. 1.
    Aggarwal JK, Park S (2004) Semantic-level understanding of human actions and interactions using event hierarchy. IEEE Workshop on Articulated and Nonrigid Motion, Washington, DC, p. 12Google Scholar
  2. 2.
    Brand M (1997) Coupled hidden Markov models for modeling interacting processes. MIT Media Lab Perceptual Computing / Learning and Common Sense Technical ReportGoogle Scholar
  3. 3.
    Chen L, Wei H, Ferryman J (2013) A survey of human motion analysis using depth imagery. Pattern Recogn Lett 34(15):1995–2006. CrossRefGoogle Scholar
  4. 4.
    Crawford GP, Fiske TG, Silverstein LD (1997) Reflective color LCDs based on H-PDLC and PSCT technologies. J Soc Inf Disp 5(1):45–48. CrossRefGoogle Scholar
  5. 5.
    Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, 65-72Google Scholar
  6. 6.
    Du Y, Chen F, Xu W (2007) Human interaction representation and recognition through motion decomposition. IEEE Signal Processing Letters 14(12):952–955CrossRefGoogle Scholar
  7. 7.
    Edwards M, Deng J, Xie X (2016) From pose to activity: surveying datasets and introducing CONVERSE. Comput Vis Image Underst 144:73–105CrossRefGoogle Scholar
  8. 8.
    Firman M (2016) RGBD datasets: past, present and future. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition WorkshopsGoogle Scholar
  9. 9.
    Fryer PM, Colgan E, Galligan E, Graham W, Horton R, Hunt D, Jenkins L, John R, Koke P, Kuo Y, Latzko K, Libsch F, Lien A, Nywening R, Polastre R, Rothwell ME, Wilson J, Wisnieff R, Wright S (1997) A six-mask TFT-LCD process using copper-gate metallurgy. J Soc Inf Disp 5(1):49–52. CrossRefGoogle Scholar
  10. 10.
    Guo P, Miao Z, Zhang X-P, Shen Y, Wang S (2012) Coupled observation decomposed hidden Markov model for multiperson activity recognition. Circuits and Systems for Video Technology, IEEE Transactions on 22(9):1306–1320CrossRefGoogle Scholar
  11. 11.
    Hu T, Zhu X, Guo W, Su K (2013) Efficient interaction recognition through positive action representation. Math Probl Eng 2013:1–11Google Scholar
  12. 12.
    Hu T, Zhu X, Guo W, Wang S, Zhu J (2018) Human action recognition based on scene semantics. Multimedia Tools and Applications, 1–22Google Scholar
  13. 13.
    Kakizaki T, Tanamachi S, Hayashi M (1997) Development of 25-in. Active-matrix LCD using plasma addressing for video-rate high-quality displays. J Soc Inf Disp 5(1):57–60. CrossRefGoogle Scholar
  14. 14.
    Kong Y, Fu Y (2016) Close human interaction recognition using patch-aware models. IEEE Trans Image Process 25(1):167–178MathSciNetCrossRefGoogle Scholar
  15. 15.
    Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp 9–14Google Scholar
  16. 16.
    Lv F, Nevatia R (2007) Single view human action recognition using key pose matching and viterbi path searching. Paper presented at the Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference onGoogle Scholar
  17. 17.
    Mastorakis G, Makris D (2012) Fall detection system using Kinect’s infrared sensor. J Real-Time Image Proc:1–12.
  18. 18.
    Megavannan V, Agarwal B, Babu RV (2012) Human action recognition using depth maps. In: Signal Processing and Communications (SPCOM), 2012 International Conference on, pp 1–5Google Scholar
  19. 19.
    Ng AY, Jordan MI (2001) On Discriminative vs. Generative classifiers: a comparison of logistic regression and naive Bayes. Paper presented at the In NIPSGoogle Scholar
  20. 20.
    Ni B, Wang G, Moulin P (2011) RGBD-HuDaAct: a color-depth video database for human daily activity recognition. In: Consumer depth cameras for computer vision. Springer, London, pp 193–208Google Scholar
  21. 21.
    Nowozin S, Shotton J (2012) Action points: a representation for low-latency online human action recognition. Microsoft Research TechReportGoogle Scholar
  22. 22.
    Oikonomopoulos A, Patras I, Pantic M (2005) Spatiotemporal salient points for visual recognition of human actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 36(3):710–719CrossRefGoogle Scholar
  23. 23.
    Park S, Aggarwal J (2004) A hierarchical Bayesian network for event recognition of human actions and interactions. Multimedia Systems 10(2):164–179CrossRefGoogle Scholar
  24. 24.
    Rapantzikos K, Avrithis Y, Kollias S (2009) Dense saliency-based spatiotemporal feature points for action recognition. Paper presented at the IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, MiamiGoogle Scholar
  25. 25.
    Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011a) Real-time human pose recognition in parts from single depth images. In CVPR Google Scholar
  26. 26.
    Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011b) Real-Time Human Pose Recognition in Parts from a Single Depth Image. Paper presented at the IEEE computer vision and pattern recognition (CVPR) 2011, ColoradoGoogle Scholar
  27. 27.
    Sung J, Ponce C, Selman B, Saxena A (2011) Human Activity Detection from RGBD Images. CoRR, abs/1107.0169Google Scholar
  28. 28.
    Vig E, Dorr M, Cox DD (2012) Saliency-based selection of sparse descriptors for action recognition. Paper presented at the Proceedings of the International Conference on Image ProcessingGoogle Scholar
  29. 29.
    Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. Paper presented at the Computer Vision and Pattern Recognition (CVPR), 2012 I.E. Conference on, ProvidenceGoogle Scholar
  30. 30.
    Xia L, Chen C-C, Aggarwal JK (2011) Human detection using depth information by Kinect. Paper presented at the Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 I.E. Computer Society Conference on Colorado Springs, COGoogle Scholar
  31. 31.
    Yao A, Fanelli JGG, Van Gool L (2011) Does human action recognition benefit from pose estimation? In: Proceedings of the 22nd British Machine Vision Conference-BMVC 2011Google Scholar
  32. 32.
    Yun K, Honorio J, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. Paper presented at the 2012 I.E. Computer Society Conference on Computer Vision And Pattern Recognition Workshops CVPRWGoogle Scholar
  33. 33.
    Zhang X, Wandell BA (1997) A spatial extension of CIELAB for digital color-image reproduction. J Soc Inf Disp 5(1):61–63. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Tao Hu
    • 1
    • 2
    • 3
    • 4
  • Xinyan Zhu
    • 1
    • 2
  • Shaohua Wang
    • 5
    Email author
  • Lian Duan
    • 4
    • 6
  1. 1.State Key Laboratory of Information Engineering in Surveying, Mapping and Remote SensingWuhan UniversityWuhanChina
  2. 2.Collaborative Innovation Center of Geospatial TechnologyWuhan UniversityWuhanChina
  3. 3.School of InformationKent State UniversityKentUSA
  4. 4.Key Laboratory of Environment Change and Resources Use in Beibu GulfGuangxi Teachers Education UniversityNanningChina
  5. 5.International Software SchoolWuhan UniversityWuhanChina
  6. 6.Geography Science and Planning SchoolGuangxi Teachers Education UniversityNanningChina

Personalised recommendations