Advertisement

Multi-view depth-based pairwise feature learning for person-person interaction recognition

  • Meng Li
  • Howard Leung
Article
  • 71 Downloads

Abstract

This paper addresses the problem of recognizing person-person interaction using multi-view data captured by depth cameras. Due to the complex spatio-temporal structure of interaction between two persons, it is difficult to characterize different classes of person-person interactions for recognition. To handle this difficulty, we divide each person-person interaction into body part interactions, and analyze the person-person interaction using the pairwise features of these body part interactions. We first make use of two features for representing the relative movement and local physical contact between the body parts of two people and extract the pairwise features to characterize the corresponding body part interaction. For processing each camera view, we propose a regression-based learning approach with a sparsity inducing regularizer to model each person-person interaction as the combination of pairwise features for a sparse set of body part interactions. To take full advantage of the information in all depth camera views, we further extend the proposed interaction learning model to combine features from multi-views to order to increase the recognition performance. Our approach is evaluated on three public activity recognition datasets captured with depth cameras. Experimental results on the three datasets have demonstrated the efficacy of the proposed method.

Keywords

Person-person interaction recognition Pairwise feature Regression-based learning Multi-view Depth camera 

References

  1. 1.
    Aggarwal JK, Xia L (2014) Human activity recognition from 3d data: a review. Pattern Recogn Lett 48:70–80CrossRefGoogle Scholar
  2. 2.
    Amer MR, Todorovic S (2012) Sum-product networks for modeling activitie with stochastic structure. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1314–1321Google Scholar
  3. 3.
    Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27Google Scholar
  4. 4.
    Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425CrossRefGoogle Scholar
  5. 5.
    Choi W, Shahid K, Savarese S (2011) Learning context for collective activity recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3273–3280Google Scholar
  6. 6.
    Choi W, Savarese S (2012) A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part IV. LNCS, vol 7575. Springer, Heidelberg, pp 215–230Google Scholar
  7. 7.
    Desai C, Ramanan D, Fowlkes C (2010) Discriminative models for static human-object interactions. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 9–16Google Scholar
  8. 8.
    Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1110–1 118Google Scholar
  9. 9.
    Du Y, Fu Y, Wang L (2016) Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans Image Process 25(7):3010–3022MathSciNetCrossRefGoogle Scholar
  10. 10.
    Filipovych R, Ribeiro E (2008) Recognizing primitive interactions by exploring actor-object states. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–7Google Scholar
  11. 11.
    Gong S, Xiang T (2003) Recognition of group activities using dynamic probabilistic networks. IEEE international conference on computer vision (ICCV), pp 742–749Google Scholar
  12. 12.
    Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789CrossRefGoogle Scholar
  13. 13.
    Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The ChaLearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951CrossRefGoogle Scholar
  14. 14.
    Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3D skeletal data: a review. Comput Vis Image Underst 158:85–105CrossRefGoogle Scholar
  15. 15.
    Ji Y, Ye G, Cheng H (2014) Interactive body part contrast mining for human interaction recognition. In: Multimedia and expo workshops (ICMEW), pp 1–6Google Scholar
  16. 16.
    Kong Y, Fu Y (2015) Close human interaction recognition using patch-aware models. IEEE Trans Image Process 25(1):167–178MathSciNetCrossRefGoogle Scholar
  17. 17.
    Kong Y, Jia Y, Fu Y (2012) Learning human interaction by interactive phrases. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part I. LNCS, vol 7572. Springer, Heidelberg, pp 300–313Google Scholar
  18. 18.
    Kong Y, Jia Y, Fu Y (2014) Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans Pattern Anal Mach Intell 36(9):1775–1788CrossRefGoogle Scholar
  19. 19.
    Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562CrossRefGoogle Scholar
  20. 20.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–8Google Scholar
  21. 21.
    Li M, Leung H (2016) Multiview skeletal interaction recognition using active joint interaction graph. IEEE Trans Multimed 18(11):2293–2302CrossRefGoogle Scholar
  22. 22.
    Odashima S, Shimosaka M, Kaneko T, Fukui R, Sato T (2012) Collective activity localization with contextual spatial pyramid. In: Fusiello A, Murino V, Cucchiara R (eds) ECCV 2012 Ws/Demos, Part III. LNCS, vol 7585. Springer, Heidelberg, pp 243–252Google Scholar
  23. 23.
    Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in tv shows. IEEE Trans Pattern Anal Mach Intell 34 (12):2441–2453CrossRefGoogle Scholar
  24. 24.
    Prest A, Schmid C, Ferrari V (2012) Weakly supervised learning of interactions between humans and objects. IEEE Trans Pattern Anal Mach Intell 34(3):601–614CrossRefGoogle Scholar
  25. 25.
    Raptis M, Sigal L (2013) Poselet key-framing: a model for human activity recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2650–2657Google Scholar
  26. 26.
    Ryoo MS, Aggarwal JK (2006) Recognition of composite human activities through context-free grammar based representation. In: IEEE computer society conference on computer vision and pattern recognition, vol 2, pp 1709–1718Google Scholar
  27. 27.
    Ryoo M, Aggarwal J (2009) Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: IEEE international conference on computer vision (ICCV), pp 1593–1600Google Scholar
  28. 28.
    Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+ D: a large scale dataset for 3D human activity analysis. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1010–1019Google Scholar
  29. 29.
    Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1297–1304Google Scholar
  30. 30.
    Vahdat A, Gao B, Ranjbar M, Mori G (2011) A discriminative key pose sequence model for recognizing human interactions. In: IEEE international conference on computer vision workshops (ICCVW), pp 1729–1736Google Scholar
  31. 31.
    Vieira A, Nascimento E, Oliveira G, Liu Z, Campos M (2012) Stop: space-time occupancy patterns for 3D action recognition from depth map sequences. In: Iberoamerican congress on pattern recognition. Buenos Aires, pp 252–259Google Scholar
  32. 32.
    Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon A, Lazebnik S, Perona P, Sato Y, Schmid C (eds) ECCV 2012, Part II. LNCS, vol 7573. Springer, Heidelberg, pp 872–885Google Scholar
  33. 33.
    Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell 36(5):914–927CrossRefGoogle Scholar
  34. 34.
    Xu N, Liu A, Nie W, Wong Y, Li F, Su Y (2015) Multi-modal & multi-view & interactive benchmark dataset for human action recognition. In: ACM international conference on multimedia, pp 1195–1198Google Scholar
  35. 35.
    Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 17–24Google Scholar
  36. 36.
    Yao A, Gall J, Fanelli G, Van Gool L (2011) Does human action recognition benefit from pose estimation? In: Proceedings of the 22nd British machine vision conference (BMVC), pp 6.71–67.11Google Scholar
  37. 37.
    Ye M, Zhang Q, Wang L, Zhu J, Yang R, Gall J (2013) A survey on human motion analysis from depth data. In: Grzegorzek M, Theobalt C, Koch R, Kolb A (eds) Time-of-flight and depth imaging. LNCS, vol 8200. Springer, Heidelberg, pp 149–187Google Scholar
  38. 38.
    Yu TH, Kim TK, Cipolla R (2010) Real-time action recognition by spatiotemporal semantic and structural forests. In: British machine vision conference (BMVC), pp 1–7Google Scholar
  39. 39.
    Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 28–35Google Scholar
  40. 40.
    Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105CrossRefGoogle Scholar
  41. 41.
    Zhang S, Liu X, Xiao J (2017) On geometric features for skeleton-based action recognition using multilayer LSTM networks. In: IEEE winter conference on applications of computer vision (WACV), pp 148–157Google Scholar
  42. 42.
    Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: AAAI, pp 3697–3703Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Mathematics and StatisticsHebei University of Economics and BusinessShijiazhuangChina
  2. 2.Department of Computer ScienceCity University of Hong KongHong KongChina

Personalised recommendations