Advertisement

International Journal of Computer Vision

, Volume 110, Issue 1, pp 70–90 | Cite as

Automatic and Efficient Human Pose Estimation for Sign Language Videos

  • James Charles
  • Tomas PfisterEmail author
  • Mark Everingham
  • Andrew Zisserman
Article

Abstract

We present a fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length. To achieve this, we make contributions in four areas: (i) we show that the overlaid signer can be separated from the background TV broadcast using co-segmentation over all frames with a layered model; (ii) we show that joint positions (shoulders, elbows, wrists) can be predicted per-frame using a random forest regressor given only this segmentation and a colour model; (iii) we show that the random forest can be trained from an existing semi-automatic, but computationally expensive, tracker; and, (iv) introduce an evaluator to assess whether the predicted joint positions are correct for each frame. The method is applied to 20 signing footage videos with changing background, challenging imaging conditions, and for different signers. Our framework outperforms the state-of-the-art long term tracker by Buehler et al. (International Journal of Computer Vision 95:180–197, 2011), does not require the manual annotation of that work, and, after automatic initialisation, performs tracking in real-time. We also achieve superior joint localisation results to those obtained using the pose estimation method of Yang and Ramanan  (Proceedings of the IEEE conference on computer vision and pattern recognition, 2011).

Keywords

Sign language Human pose estimation Co-segmentation Random forest 

Notes

Acknowledgments

We are grateful to Lubor Ladicky for discussions, and to Patrick Buehler for his very generous help. Funding is provided by the Engineering and Physical Sciences Research Council (EPSRC) grant Learning to Recognise Dynamic Visual Content from Broadcast Footage.

Supplementary material

Supplementary material 1 (mpg 5066 KB)

References

  1. Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9(7), 1545–1588.CrossRefGoogle Scholar
  2. Andriluka, M., Roth, S., & Schiele, B. (2012). Discriminative appearance models for pictorial structures. International Journal of Computer Vision, 99(3), 259–280.MathSciNetCrossRefGoogle Scholar
  3. Apostoloff, N. E., & Zisserman, A. (2007). Who are you?—real-time person identification. In Proceedings of the British machine vision conference.Google Scholar
  4. Benfold, B., & Reid, I. (2008). Colour invariant head pose classification in low resolution video. In Proceedings of the British machine vision conference.Google Scholar
  5. Bosch, A., Zisserman, A., & Munoz, X. (2007). Image classification using random forests and ferns. In Proceedings of the international conference on computer vision.Google Scholar
  6. Bowden, R., Windridge, D., Kadir, T., Zisserman, A., & Brady, J. M. (2004). A linguistic feature vector for the visual interpretation of sign language. In Proceedings of the European conference on computer vision. Berlin: Springer.Google Scholar
  7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefzbMATHGoogle Scholar
  8. Buehler, P., Everingham, M., Huttenlocher, D. P., & Zisserman, A. (2011). Upper body detection and tracking in extended signing sequences. International Journal of Computer Vision, 95(2), 180–197.CrossRefGoogle Scholar
  9. Buehler, P., Everingham, M., & Zisserman, A. (2009). Learning sign language by watching TV (using weakly aligned subtitles). In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  10. Buehler, P., Everingham, M., & Zisserman, A. (2010). Employing signed TV broadcasts for automated learning of British sign language. In Workshop on representation and processing of sign languages.Google Scholar
  11. Chai, Y., Lempitsky, V., & Zisserman, A. (2011). BiCoS: A bi-level co-segmentation method for image classification. In Proceedings of the international conference on computer vision.Google Scholar
  12. Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L., & Zisserman, A. (2012). Tricos: A tri-level class-discriminative co-segmentation method for image classification. In European conference on computer vision.Google Scholar
  13. Charles, J., Pfister, T., Magee, D., Hogg, D., & Zisserman, A. (2013). Domain adaptation for upper body pose tracking in signed TV broadcasts. In Proceedings of the British machine vision conference.Google Scholar
  14. Chunli, W., Wen, G., & Jiyong, M. (2002). A real-time large vocabulary recognition system for Chinese Sign Language. Gesture and sign language in HCI.Google Scholar
  15. Cooper, H., & Bowden, R. (2007). Large lexicon detection of sign language. Workshop on human computer interaction.Google Scholar
  16. Cooper, H., & Bowden, R. (2009). Learning signs from subtitles: A weakly supervised approach to sign language recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  17. Cootes, T., Ionita, M., Lindner, C., & Sauer, P. (2012). Robust and accurate shape model fitting using random forest regression voting. In Proceedings of the European conference on computer vision.Google Scholar
  18. Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision, 7(2), 81–227.Google Scholar
  19. Criminisi, A., Shotton, J., & Robertson, & D., Konukoglu, E., (2011). Regression forests for efficient anatomy detection and localization in CT studies. In International conference on medical image computing and computer assisted intervention workshop on probabilistic models for medical image analysis.Google Scholar
  20. Dalal, N., & Triggs, B. (2005). Histogram of Oriented Gradients for Human Detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  21. Dantone, M., Gall, J., Fanelli, G., & Van Gool, L. (2012). Real-time facial feature detection using conditional regression forests. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  22. Dreuw, P., Deselaers, T., Rybach, D., Keysers, D., & Ney, H. (2006). Tracking using dynamic programming for appearance-based sign language recognition. In Proceedings of the IEEE conference on automatic face and gesture recognition.Google Scholar
  23. Dreuw, P., Forster, J., & Ney, H. (2012). Tracking benchmark databases for video-based sign language recognition. In Trends and topics in computer vision (pp. 286–297). Berlin: Springer.Google Scholar
  24. Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In Proceedings of the British machine vision conference.Google Scholar
  25. Eichner, M., Marin-Jimenez, M., Zisserman, A., & Ferrari, V. (2012). 2D articulated human pose estimation and retrieval in (almost) unconstrained still images. International Journal of Computer Vision, 1–25.Google Scholar
  26. Fanelli, G., Dantone, M., Gall, J., Fossati, A., & Van Gool, L. (2012). Random forests for real time 3D face analysis. International Journal of Computer Vision, 101(3), 1–22.Google Scholar
  27. Fanelli, G., Gall, J., & Van Gool, L. (2011). Real time head pose estimation with random regression forests. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  28. Farhadi, A., & Forsyth, D. (2006). Aligning asl for statistical translation using a discriminative word model. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  29. Farhadi, A., Forsyth, D., & White, R. (2007). Transfer learning in sign language. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  30. Felzenszwalb, P., Girshick, R., & McAllester, D. (2010). Cascade object detection with deformable part models. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  31. Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.CrossRefGoogle Scholar
  32. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  33. Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  34. Gall, J., & Lempitsky, V. (2009). Class-specific hough forests for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  35. Geremia, E., Clatz, O., Menze, B., Konukoglu, E., Criminisi, A., & Ayache, N. (2011). Spatial decision forests for MS lesion segmentation in multi-channel magnetic resonance images. NeuroImage, 57(2), 378–390.CrossRefGoogle Scholar
  36. Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In Proceedings of the international conference on computer vision.Google Scholar
  37. Hochbaum, D., & Singh, V. (2009). An efficient algorithm for co-segmentation. In Proceedings of the international conference on computer vision.Google Scholar
  38. Jammalamadaka, N., Zisserman, A., Eichner, M., Ferrari, V., & Jawahar, C. V. (2012). Has my algorithm succeeded? An evaluator for human pose estimators. In Proceedings of the European conference on computer vision.Google Scholar
  39. Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In IEEE international workshop on machine learning for vision-based motion analysis.Google Scholar
  40. Jojic, N., & Frey, B. (2001). Learning flexible sprites in video layers. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  41. Joulin, A., Bach, F., & Ponce, J. (2010). Discriminative clustering for image co-segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  42. Kadir, T., Bowden, R., Ong, E., & Zisserman, A. (2004). Minimal training, large lexicon, unconstrained sign language recognition. In Proceedings of the British machine vision conference.Google Scholar
  43. Kadir, T., Zisserman, A., & Brady, J. M. (2004). An affine invariant salient region detector. In Proceedings of the European conference on computer vision.Google Scholar
  44. Kontschieder, P., Bulò, S., Criminisi, A., Kohli, P., Pelillo, M., & Bischof, H. (2012). Context-sensitive decision forests for object detection. In Advances in neural information processing systems.Google Scholar
  45. Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2008). Learning layered motion segmentations of video. International Journal of Computer Vision, 76, 301–319.CrossRefGoogle Scholar
  46. Lepetit, V., & Fua, P. (2006). Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1465–1479.CrossRefGoogle Scholar
  47. Liu, C., Gong, S., Loy, C., & Lin, X. (2012). Person re-identification: What features are important?. In Proceedings of the European conference on computer vision.Google Scholar
  48. Marée, R., Geurts, P., Piater, J., & Wehenkel, L. (2005). Random subwindows for robust image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  49. Moeslund, T. (2011). Visual analysis of humans: Looking at people. Berlin: Springer.Google Scholar
  50. Nowozin, S., Rother, C., Bagon, S., Sharp, T., Yao, B., & Kohli, P. (2011). Decision tree fields. In Proceedings of the international conference on computer vision.Google Scholar
  51. Ong, E., & Bowden, R. (2004). A boosted classifier tree for hand shape detection. In Proceedings of the international conference on automatic face and gesture recognition.Google Scholar
  52. Ozuysal, M., Calonder, M., Lepetit, V., & Fua, P. (2010). Fast keypoint recognition using random ferns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 448–461.CrossRefGoogle Scholar
  53. Pfister, T., Charles, J., Everingham, M., & Zisserman, A. (2012). Automatic and efficient long term arm and hand tracking for continuous sign language TV broadcasts. In Proceedings of the British machine vision conference.Google Scholar
  54. Pfister, T., Charles, J., & Zisserman, A. (2013). Large-scale learning of sign language by watching TV (using co-occurrences). In Proceedings of the British machine vision conference.Google Scholar
  55. Ramanan, D. (2006). Learning to parse images of articulated bodies. In Advances in neural information processing systems.Google Scholar
  56. Ramanan, D., Forsyth, D. A., & Zisserman, A. (2007). Tracking people by learning their appearance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 65–81.CrossRefGoogle Scholar
  57. Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: interactive foreground extraction using iterated graph cuts. In Proceedings of the ACM SIGGRAPH conference on computer graphics.Google Scholar
  58. Rother, C., Minka, T., Blake, A., & Kolmogorov, V. (2006). Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  59. Santner, J., Leistner, C., Saffari, A., Pock, T., & Bischof, H. (2010). Prost: Parallel robust online simple tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  60. Sapp, B., Jordan, C., & Taskar, B. (2010). Adaptive pose priors for pictorial structures. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  61. Sapp, B., Weiss, D., & Taskar, B. (2011). Parsing human motion with stretchable models. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  62. Sharp, T. (2008). Implementing decision trees and forests on a GPU. In Proceedings of the European conference on computer vision.Google Scholar
  63. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  64. Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  65. Sivic, J., Zitnick, C. L., & Szeliski, R. (2006). Finding people in repeated shots of the same scene. In Proceedings of the British machine vision conference, Edinburgh.Google Scholar
  66. Starner, T., Weaver, J., & Pentland, A. (1998a). Real-time american sign language recognition using desk- and wearable computer-based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.CrossRefGoogle Scholar
  67. Starner, T., Weaver, J., & Pentland, A. (1998b). Real-time American Sign Language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.CrossRefGoogle Scholar
  68. Sun, M., Kohli, P., & Shotton, J. (2012). Conditional regression forests for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  69. Szeliski, R., Avidan, S., & Anandan, P. (2000). Layer extraction from multiple images containing reflections and transparency. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  70. Taylor, J., Shotton, J., Sharp, T., & Fitzgibbon, A. (2012). The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  71. Tran, D., & Forsyth, D. (2010). Improved human parsing with a full relational model. In Proceedings of the European conference on computer vision.Google Scholar
  72. Vogler, C., & Metaxas, D. (1998). ASL recognition based on a coupling between HMMs and 3D motion analysis. In Proceedings of the international conference on computer vision.Google Scholar
  73. Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  74. Yin, P., Criminisi, A., Winn, J., & Essa, I. (2007). Tree-based classifiers for bilayer video Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  75. Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation, and landmark localization in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  76. Zisserman, A., Winn, J., Fitzgibbon, A., van Gool, L., Sivic, J., Williams, C., & Hogg, D. (2012). In memoriam: Mark Everingham. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2081–2082.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • James Charles
    • 1
  • Tomas Pfister
    • 2
    Email author
  • Mark Everingham
    • 1
  • Andrew Zisserman
    • 2
  1. 1.School of ComputingUniversity of LeedsLeedsUK
  2. 2.Department of Engineering ScienceUniversity of OxfordOxfordUK

Personalised recommendations