Conversational Interaction Recognition Based on Bodily and Facial Movement

  • Jingjing Deng
  • Xianghua XieEmail author
  • Shangming Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8814)


We examine whether 3D pose and face features can be used to both learn and recognize different conversational interactions. We believe this to be among the first work devoted to this subject and show that this task is indeed possible with a promising degree of accuracy using both features derived from pose and face. To extract 3D pose we use the Kinect Sensor, and we use a combined local and global model to extract face features from normal RGB cameras. We show that whilst both of these features are contaminated with noises. They can still be used to effectively train classifiers. The differences in interaction among different scenarios in our data set are extremely subtle. Both generative and discriminative methods are investigated, and a subject specific supervised learning approach is employed to classify the testing sequences to seven different conversational scenarios.


Human interaction modeling Conversantional interaction analysis 3D human pose Face analysis Randomized decision trees HMM SVM 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Computing Survey 43(16), 1–43 (2011)CrossRefGoogle Scholar
  2. 2.
    Yao, A., Gall, J., Fanelli, G., Gool, L.V.: Does human action recognition benefit from pose estimation? In: BMVC (2011)Google Scholar
  3. 3.
    Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs fisherfaces: recognition using class specific linear projection. IEEE T-PAMI 19(7), 711–720 (1997)CrossRefGoogle Scholar
  4. 4.
    Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. J. of Machine Learning Research 3, 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Buehler, P., Everingham, M., Zisserman, A.: Learning sign language by watching TV (using weakly aligned subtitles). In: CVPR (2009)Google Scholar
  6. 6.
    Cootes, T., Edward, G., Taylor, C.: Active appearance models. IEEE T-PAMI 23(6), 681–685 (2001)CrossRefGoogle Scholar
  7. 7.
    Cristinacce, D., Cootes, T.: Automatic feature localisation with constrained local models. PR 41, 3054–3067 (2008)CrossRefzbMATHGoogle Scholar
  8. 8.
    Daubney, B., Xie, X.: Entropy driven hierarchical search for 3d human pose estimation. In: BMVC, pp. 1–11 (2011)Google Scholar
  9. 9.
    Daubney, B., Xie, X.: Tracking 3d human pose with large root node uncertainty. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1321–1328 (June 2011)Google Scholar
  10. 10.
    Deng, J., Xie, X., Daubney, B.: A bag of words approach to subject specific 3d human pose interaction classification with random decision forests. Graphical Models 76(3), 162–171 (2014)CrossRefGoogle Scholar
  11. 11.
    Deng, J., Xie, X., Daubney, B., Fang, H., Grant, P.W.: Recognizing conversational interaction based on 3D human pose. In: Blanc-Talon, J., Kasinski, A., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2013. LNCS, vol. 8192, pp. 138–149. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  12. 12.
    Fang, H., Deng, J., Xie, X., Grant, P.: From clamped local shape models to global shape model. In: IEEE ICIP, pp. 3513–3517 (September 2013)Google Scholar
  13. 13.
    Friedman, J., Hastie, T., Tibshirani, R.: Addictive logistic regression: a statistical view of boosting. Annals of Statistics 28, 337–407 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Gee, A.H., Cipolla, R.: Determining the gaze of faces in images. IVC 12, 639–647 (1994)CrossRefGoogle Scholar
  15. 15.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 5228–5235 (2004)CrossRefGoogle Scholar
  16. 16.
    Kovar, L., Gleicher, M.: Automated extraction and parameterization of motions in large data sets. ACM ToG 23(3), 559–568 (2004)CrossRefGoogle Scholar
  17. 17.
    Müller, M., Röder, T., Clausen, M.: Efficient content-based retrieval of motion capture data. ACM ToG 24(3), 677–685 (2005)CrossRefGoogle Scholar
  18. 18.
    Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. IJCV 79(3), 299–318 (2008)CrossRefGoogle Scholar
  19. 19.
    Oliver, N., Rosario, B., Pentland, A.: A bayesian computer vision system for modeling human interactions. IEEE T-PAMI 22(8), 831–843 (2000)CrossRefGoogle Scholar
  20. 20.
    Viola, P., Jones, M.: Robust real-time face detection. IJCV 57(2), 137–154 (2004)CrossRefGoogle Scholar
  21. 21.
    Zhang, D., Gatica-Perez, D., Bengio, S., McCowan, I.: Modeling individual and group actions in meetings with layered hmms. IEEE Multimedia 8(3), 509–520 (2006)CrossRefGoogle Scholar
  22. 22.
    Zhou, S.M., Lyons, R.A., Bodger, O., Demmler, J.C., Atkinson, M.A.: Svm with entropy regularization and particle swarm optimization for identifying childrens health and socioeconomic determinants of education attainments using linked datasets. In: IEEE Inter. Conf. Neural Networks, pp. 3867–3874 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Department of Computer ScienceSwansea UniversitySwanseaUK

Personalised recommendations