Recognizing Conversational Interaction Based on 3D Human Pose

  • Jingjing Deng
  • Xianghua Xie
  • Ben Daubney
  • Hui Fang
  • Phil W. Grant
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8192)


In this paper, we take a bag of visual words approach to investigate whether it is possible to distinguish conversational scenarios from observing human motion alone, in particular gestures in 3D. The conversational interactions concerned in this work have rather subtle differences among them. Unlike typical action or event recognition, each interaction in our case contain many instances of primitive motions and actions, many of which are shared among different conversation scenarios. Hence, extracting and learning temporal dynamics are essential. We adopt Kinect sensors to extract low level temporal features. These features are then generalized to form a visual vocabulary that can be further generalized to a set of topics from temporal distributions of visual vocabulary. A subject-specific supervised learning approach based on both generative and discriminative classifiers is employed to classify the testing sequences to seven different conversational scenarios. We believe this is among one of the first work that is devoted to conversational interaction classification using 3D pose features and to show this task is indeed possible.


3D human pose conversational interaction classification interaction analysis Kinect sensor 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Comput. Surv. 43(3), 16 (2011)CrossRefGoogle Scholar
  2. 2.
    Yao, A., Gall, J., Fanelli, G., Gool, L.V.: Does human action recognition benefit from pose estimation? In: Proceedings of the British Machine Vision Conference, pp. 67.1–67.11. BMVA Press (2011)Google Scholar
  3. 3.
    Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Fang, H., Deng, J., Xie, X., Grant, P.W.: From clamped local shape models to global shape model. In: Proceedings of the 2013 International Conference on Image Processing, ICIP (2013)Google Scholar
  5. 5.
    Fathi, A.: Social interactions: A first-person perspective. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR 2012, pp. 1226–1233. IEEE Computer Society, Washington, DC (2012), Google Scholar
  6. 6.
    Gee, A.H., Cipolla, R.: Determining the gaze of faces in images. Image and Vision Computing 12, 639–647 (1994)CrossRefGoogle Scholar
  7. 7.
    Holte, M.B., Tran, C., Trivedi, M.M., Moeslund, T.B.: Human pose estimation and activity recognition from multi-view videos: Comparative explorations of recent developments. IEEE Journal of Selected Topics in Signal Processing 6(5), 538–552 (2012)CrossRefGoogle Scholar
  8. 8.
    Hospedales, T., Gong, S., Xiang, T.: Video behaviour mining using a dynamic topic model. International Journal of Computer Vision, 1–21 (2012)Google Scholar
  9. 9.
    Ivanov, Y.A., Bobick, A.F.: Recognition of visual activities and interactions by stochastic parsing. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 852–872 (2000)CrossRefGoogle Scholar
  10. 10.
    Kovar, L., Gleicher, M.: Automated extraction and parameterization of motions in large data sets. ACM Trans. Graph. 23(3), 559–568 (2004)CrossRefGoogle Scholar
  11. 11.
    Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 104(2-3), 90–126 (2006)CrossRefGoogle Scholar
  12. 12.
    Müller, M., Röder, T., Clausen, M.: Efficient content-based retrieval of motion capture data. ACM Trans. Graph. 24(3), 677–685 (2005)CrossRefGoogle Scholar
  13. 13.
    Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79(3), 299–318 (2008)CrossRefGoogle Scholar
  14. 14.
    Oliver, N., Garg, A., Horvitz, E.: Layered representations for learning and inferring office activity from multiple sensory channels. Comput. Vis. Image Underst. 96(2), 163–180 (2004), CrossRefGoogle Scholar
  15. 15.
    Oliver, N., Rosario, B., Pentland, A.: A bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 831–843 (2000)CrossRefGoogle Scholar
  16. 16.
    Ryoo, M.S., Aggarwal, J.K.: Semantic representation and recognition of continued and recursive human activities. Int. J. Comput. Vision 82(1), 1–24 (2009)CrossRefGoogle Scholar
  17. 17.
    Ryoo, M.S., Aggarwal, J.K.: Stochastic representation and recognition of high-level group activities. Int. J. Comput. Vision 93(2), 183–200 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Turaga, P.K., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Techn. 18(11), 1473–1488 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Jingjing Deng
    • 1
  • Xianghua Xie
    • 1
  • Ben Daubney
    • 1
  • Hui Fang
    • 1
  • Phil W. Grant
    • 1
  1. 1.Department of Computer ScienceSwansea UniversitySwanseaUnited Kingdom

Personalised recommendations