Domain-Adaptive Discriminative One-Shot Learning of Gestures

  • Tomas Pfister
  • James Charles
  • Andrew Zisserman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8694)


The objective of this paper is to recognize gestures in videos – both localizing the gesture and classifying it into one of multiple classes.

We show that the performance of a gesture classifier learnt from a single (strongly supervised) training example can be boosted significantly using a ‘reservoir’ of weakly supervised gesture examples (and that the performance exceeds learning from the one-shot example or reservoir alone). The one-shot example and weakly supervised reservoir are from different ‘domains’ (different people, different videos, continuous or non-continuous gesturing, etc), and we propose a domain adaptation method for human pose and hand shape that enables gesture learning methods to generalise between them. We also show the benefits of using the recently introduced Global Alignment Kernel [12], instead of the standard Dynamic Time Warping that is generally used for time alignment.

The domain adaptation and learning methods are evaluated on two large scale challenging gesture datasets: one for sign language, and the other for Italian hand gestures. In both cases performance exceeds the previous published results, including the best skeleton-classification-only entry in the 2013 ChaLearn challenge.


Sign Language Gesture Recognition Domain Adaptation Dynamic Time Warping Hand Shape 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. IEEE PAMI 32(2), 288–303 (2010)CrossRefGoogle Scholar
  2. 2.
    Baisero, A., Pokorny, F.T., Kragic, D., Ek, C.: The path kernel. In: ICPRAM (2013)Google Scholar
  3. 3.
    Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: Proc. ICCV (2013)Google Scholar
  4. 4.
    Books, M.: The standard dictionary of the British sign language. DVD (2005)Google Scholar
  5. 5.
    Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In: Proc. ICCV (2001)Google Scholar
  6. 6.
    Bristol Centre for Deaf Studies: Signstation, (accessed March 1, 2014)
  7. 7.
    Buehler, P., Everingham, M., Zisserman, A.: Learning sign language by watching TV (using weakly aligned subtitles). In: Proc. CVPR (2009)Google Scholar
  8. 8.
    Chai, X., Li, G., Lin, Y., Xu, Z., Tang, Y., Chen, X., Zhou, M.: Sign language recognition and translation with Kinect. In: Proc. Int. Conf. Autom. Face and Gesture Recog. (2013)Google Scholar
  9. 9.
    Charles, J., Pfister, T., Everingham, M., Zisserman, A.: Automatic and efficient human pose estimation for sign language videos. IJCV (2013)Google Scholar
  10. 10.
    Charles, J., Pfister, T., Magee, D., Hogg, D., Zisserman, A.: Domain adaptation for upper body pose tracking in signed TV broadcasts. In: Proc. BMVC (2013)Google Scholar
  11. 11.
    Cooper, H., Bowden, R.: Learning signs from subtitles: A weakly supervised approach to sign language recognition. In: Proc. CVPR (2009)Google Scholar
  12. 12.
    Cuturi, M.: Fast global alignment kernels. In: ICML (2011)Google Scholar
  13. 13.
    Cuturi, M., Vert, J., Birkenes, Ø., Matsui, T.: A kernel for time series based on global alignments. In: ICASSP (2007)Google Scholar
  14. 14.
    Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: Proc. CVPR (2009)Google Scholar
  15. 15.
    Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Guyon, I., Athitsos, V., Escalante, H., Sigal, L., Argyros, A., Sminchisescu, C.: Chalearn multi-modal gesture recognition 2013: grand challenge and workshop summary. In: ACM MM (2013)Google Scholar
  16. 16.
    Fanello, S., Gori, I., Metta, G., Odone, F.: Keep it simple and sparse: real-time action recognition. J. Machine Learning Research 14(1), 2617–2640 (2013)Google Scholar
  17. 17.
    Farhadi, A., Forsyth, D., White, R.: Transfer learning in sign language. In: Proc. CVPR (2007)Google Scholar
  18. 18.
    Gaidon, A., Harchaoui, Z., Schmid, C.: A time series kernel for action recognition. In: Proc. BMVC (2011)Google Scholar
  19. 19.
    Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H., Hamner, B.: Results and analysis of the ChaLearn gesture challenge 2012. In: Proc. ICPR (2013)Google Scholar
  20. 20.
    Guyon, I., Athitsos, V., Jangyodsuk, P., Hamner, B., Escalante, H.: ChaLearn gesture challenge: Design and first results. In: CVPR Workshops (2012)Google Scholar
  21. 21.
    Hariharan, B., Malik, J., Ramanan, D.: Discriminative decorrelation for clustering and classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 459–472. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  22. 22.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: Proc. ICCV (2007)Google Scholar
  23. 23.
    Kelly, D., McDonald, J., Markham, C.: Weakly supervised training of a sign language recognition system using multiple instance learning density matrices. Trans. Systems, Man, and Cybernetics 41(2), 526–541 (2011)Google Scholar
  24. 24.
    Krishnan, R., Sarkar, S.: Similarity measure between two gestures using triplets. In: CVPR Workshops (2013)Google Scholar
  25. 25.
    Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: Proc. ICCV (2011)Google Scholar
  26. 26.
    Nayak, S., Duncan, K., Sarkar, S., Loeding, B.: Finding recurrent patterns from continuous sign language sentences for automated extraction of signs. J. Machine Learning Research 13(1), 2589–2615 (2012)zbMATHMathSciNetGoogle Scholar
  27. 27.
    Pfister, T., Charles, J., Everingham, M., Zisserman, A.: Automatic and efficient long term arm and hand tracking for continuous sign language TV broadcasts. In: Proc. BMVC (2012)Google Scholar
  28. 28.
    Pfister, T., Charles, J., Zisserman, A.: Large-scale learning of sign language by watching TV (using co-occurrences). In: Proc. BMVC (2013)Google Scholar
  29. 29.
    Rother, C., Kolmogorov, V., Blake, A.: Grabcut: interactive foreground extraction using iterated graph cuts. In: Proc. ACM SIGGRAPH (2004)Google Scholar
  30. 30.
    Sakoe, H.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing (1978)Google Scholar
  31. 31.
    Sakoe, H., Chiba, S.: A similarity evaluation of speech patterns by dynamic programming. In: Nat. Meeting of Institute of Electronic Communications Engineers of Japan (1970)Google Scholar
  32. 32.
    Shimodaira, H., Noma, K., Nakai, M., Sagayama, S.: Dynamic time-alignment kernel in support vector machine. In: NIPS (2001)Google Scholar
  33. 33.
    Wan, J., Ruan, Q., Li, W., Deng, S.: One-shot learning gesture recognition from RGB-D data using bag of features. J. Machine Learning Research 14(1), 2549–2582 (2013)Google Scholar
  34. 34.
    Wu, J., Cheng, J., Zhao, C., Lu, H.: Fusing multi-modal features for gesture recognition. In: ICMI (2013)Google Scholar
  35. 35.
    Zhou, F., De la Torre, F.: Generalized time warping for multi-modal alignment of human motion. In: Proc. CVPR (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Tomas Pfister
    • 1
  • James Charles
    • 2
  • Andrew Zisserman
    • 1
  1. 1.Visual Geometry Group, Department of Engineering ScienceUniversity of OxfordUK
  2. 2.Computer Vision Group, School of ComputingUniversity of LeedsUK

Personalised recommendations