Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

Abstract

Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and have released software and evaluation code. We summarize important conclusions here: (1) Coarse pose estimation appears viable for scenes with isolated hands. However, high precision pose estimation [required for immersive virtual reality and cluttered scenes (where hands may be interacting with nearby objects and surfaces) remain a challenge. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate criteria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experiments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that training data is as important as the model itself. We conclude with directions for future progress.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Notes

  1. 1.

    http://www.ics.uci.edu/~jsupanci/#HandData.

References

  1. Ballan, L., Taneja, A., Gall, J., Gool, L. J. V., & Pollefeys, M. (2012). Motion capture of hands in action using discriminative salient points. In ECCV (6).

  2. Bray, M., Koller-Meier, E., Müller, P., Van Gool, L., & Schraudolph, N. N. (2004). 3D hand tracking by rapid stochastic gradient descent using a skinning model. In 1st European conference on visual media production (CVMP).

  3. Bullock, I. M., Member, S., Zheng, J. Z., Rosa, S. D. L., Guertler, C., & Dollar, A. M. (2013). IEEE transactions on grasp frequency and usage in daily household and machine shop tasks, Haptics.

  4. Camplani, M., & Salgado, L. (2012). Efficient spatio-temporal hole filling strategy for kinect depth maps. In Proceedings of SPIE.

  5. Castellini, C., Tommasi, T., Noceti, N., Odone, F., & Caputo, B. (2011). Using object affordances to improve object recognition. In IEEE transactions on autonomous mental development.

  6. Choi, C., Sinha, A., Hee Choi, J., Jang, S., & Ramani, K. (2015). A collaborative filtering approach to real-time hand pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 2336–2344).

  7. Cooper, H. (2012). Sign language recognition using sub-units. The Journal of Machine Learning Research, 13, 2205.

    MATH  Google Scholar 

  8. Delamarre, Q., & Faugeras, O. (2001). 3D articulated models and multiview tracking with physical forces. Computer Vision and Image Understanding., 81, 328.

    Article  MATH  Google Scholar 

  9. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Computer vision and pattern recognition (CVPR). IEEE.

  10. Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. In IEEE transactions on pattern analysis and machine intelligence.

  11. Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. Computer Vision and Image Understanding., 108, 52.

    Article  Google Scholar 

  12. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303.

    Article  Google Scholar 

  13. Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. In IEEE transactions on pattern analysis and machine intelligence.

  14. Fathi, A., Ren, X., & Rehg, J. M. (2011). Learning to recognize objects in egocentric activities. In 2011 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3281–3288). IEEE.

  15. Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59.

    Article  Google Scholar 

  16. Feix, T., Romero, J., Ek, C. H., Schmiedmayer, H., & Kragic, D. (2013). A metric for comparing the anthropomorphic motion capability of artificial hands. In IEEE transactions on robotics.

  17. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. In IEEE transactions on pattern analysis and machine intelligence.

  18. Girard, M., & Maciejewski, A. A. (1985). Computational modeling for the computer animation of legged figures. ACM SIGGRAPH Computer Graphics, 19, 263.

    Article  Google Scholar 

  19. Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (ECCV). Springer.

  20. Intel. (2013). Perceptual computing SDK.

  21. Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., et al. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision. Springer, London

  22. Keskin, C., Kıraç, F., Kara, Y. E., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In European conference on computer vision (ECCV).

  23. Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., & Fitzgibbon, A. (2015). Learning an efficient model of hand shape variation from depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2540–2548).

  24. Li, C., & Kitani, K. M. (2013). Pixel-level hand detection in ego-centric videos. In Computer vision and pattern recognition (CVPR).

  25. Li, P., Ling, H., Li, X., & Liao, C. (2015). 3d hand pose estimation using randomized decision forest with segmentation index points. In Proceedings of the IEEE international conference on computer vision (pp. 819–827).

  26. Martin, D. R., Fowlkes, C. C., & Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. in IEEE transactions on pattern analysis and machine intelligence.

  27. Melax, S., Keselman, L., & Orsten, S. (2013). Dynamics based 3D skeletal hand tracking. In Proceedings of the ACM SIGGRAPH symposium on interactive 3D graphics and games-I3D ’13.

  28. Mo, Z., & Neumann, U. (2006). Real-time hand pose recognition using low-resolution depth images. In 2006 IEEE computer society conference on computer vision and pattern recognition (vol. 2, pp. 1499–1505). IEEE.

  29. Moore, A. W., Connolly, A. J., Genovese, C., Gray, A., Grone, L., & Kanidoris, N, I. I., et al. (2001). Fast algorithms and efficient statistics: N-point correlation functions. In Mining the Sky. Springer.

  30. Muja, M., & Lowe, D. G. (2014). Scalable nearest neighbor algorithms for high dimensional data. In IEEE transactions on pattern analysis and machine intelligence.

  31. Oberweger, M., Riegler, G., Wohlhart, P., & Lepetit, V. (2016). Efficiently creating 3d training data for fine hand pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4957–4965).

  32. Oberweger, M., Wohlhart, P., & Lepetit, V. (2015a). Hands deep in deep learning for hand pose estimation. In Computer vision winter workshop (CVWW).

  33. Oberweger, M., Wohlhart, P., & Lepetit, V. (2015b). Training a feedback loop for hand pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 3316–3324).

  34. Ohn-Bar, E., & Trivedi, M. M. (2014a). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. In IEEE transactions on intelligent transportation systems.

  35. Ohn-Bar, E., & Trivedi, M. M. (2014b). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(6), 2368–2377.

    Article  Google Scholar 

  36. Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011). Efficient model-based 3D tracking of hand articulations using kinect. In British machine vision conference (BMVC).

  37. Pang, Y., & Ling, H. (2013). Finding the best from the second bests-inhibiting subjective bias in evaluation of visual tracking algorithms. In International conference on computer vision (ICCV).

  38. Pieropan, A., Salvi, G., Pauwels, K., & Kjellstrom, H. (2014). Audio-visual classification and detection of human manipulation actions. In International conference on intelligent robots and systems (IROS).

  39. Premaratne, P., Nguyen, Q., & Premaratne, M. (2010). Human computer interaction using hand gestures. Berlin: Springer.

    Google Scholar 

  40. PrimeSense. (2013). Nite2 middleware, Version 2.2.

  41. Qian, C., Sun, X., Wei, Y., Tang, X., & Sun, J. (2014). Realtime and robust hand tracking from depth. In Computer vision and pattern recognition (CVPR).

  42. Ren, Z., Yuan, J., & Zhang, Z. (2011). Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera. In Proceedings of the 19th ACM international conference on Multimedia. ACM.

  43. Rogez, G., Khademi, M., Supancic, III, J., Montiel, J. M. M., & Ramanan, D. (2014). 3D hand pose detection in egocentric RGB-D images. CDC4CV workshop, European conference on computer vision (ECCV).

  44. Rogez, G., Supancic, III, J., & Ramanan, D. (2015a). First-person pose recognition using egocentric workspaces. In Computer vision and pattern recognition (CVPR).

  45. Rogez, G., Supancic, J. S., & Ramanan, D. (2015b). Understanding everyday hands in action from RGB-D images. In Proceedings of the IEEE international conference on computer vision (pp. 3889–3897).

  46. Romero, J., Kjellstr, H., & Kragic, D. (2009). Monocular real-time 3D articulated hand pose estimation. In International conference on humanoid robots.

  47. Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: What have we done, and where are we going? In International conference on computer vision (ICCV). IEEE.

  48. Šarić, M. (2011). Libhand: A library for hand articulation, Version 0.9.

  49. Scharstein, D. (2002). A taxonomy and evaluation of dense two-frame stereo. International Journal of Computer Vision, 47, 7.

    Article  MATH  Google Scholar 

  50. Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In International conference on computer vision (ICCV). IEEE.

  51. Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., & Izadi, S. (2015). Accurate, robust, and flexible real-time hand tracking. In ACM conference on computer–human interaction.

  52. Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM., 56, 116.

    Article  Google Scholar 

  53. Song, S., & Xiao, J. (2014). Sliding shapes for 3D object detection in depth images. In European conference on computer vision (ECCV).

  54. Sridhar, S., Mueller, F., Oulasvirta, A., & Theobalt, C. (2015). Fast and robust hand tracking using detection-guided optimization. In Computer vision and pattern recognition (CVPR).

  55. Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using RGB and depth data. In International conference on computer vision (ICCV).

  56. Stenger, B., Thayananthan, A., Torr, P. H. S., & Cipolla, R. (2006). Model-based hand tracking using a hierarchical Bayesian filter. In IEEE transactions on pattern analysis and machine intelligence.

  57. Stokoe, W. C. (2005). Sign language structure: An outline of the visual communication systems of the American deaf. Journal of Deaf Studies and Deaf Education, 10, 3.

    Article  Google Scholar 

  58. Sun, X., Wei, Y., Liang, S., Tang, X., & Sun, J. (2015). Cascaded hand pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 824–832).

  59. Tang, D., Chang, H. J., Tejani, A., & Kim, T.-K. (2014). Latent regression forest: Structured estimation of 3D articulated hand posture. In Computer vision and pattern recognition (CVPR).

  60. Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.-K., & Shotton, J. (2015). Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE international conference on computer vision (pp. 3325–3333).

  61. Tang, D., Yu, T.H. & Kim, T.-K. (2013). Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In International conference on computer vision (ICCV).

  62. Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., & Izadi, S., et al. (2014). User-specific hand modeling from monocular depth sequences. In Computer vision and pattern recognition (CVPR). IEEE.

  63. Taylor, J., Bordeaux, L., Cashman, T., Corish, B., Keskin, C., Sharp, T., et al. (2016). Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG), 35(4), 143.

    Article  Google Scholar 

  64. Tompson, J., Stein, M., Lecun, Y., & Perlin, K. (2014). Real-time continuous pose recovery of human hands using convolutional networks. In ACM Transactions on Graphics.

  65. Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In Computer vision and pattern recognition (CVPR). IEEE.

  66. Tzionas, D., Srikantha, A., Aponte, P., & Gall, J. (2014). Capturing hand motion with an RGB-D sensor, fusing a generative model with salient points. In German Conference on Pattern Recognition (GCPR). Lecture notes in computer science. Springer.

  67. Vezhnevets, V., Sazonov, V., & Andreeva, A. (2003). A survey on pixel-based skin color detection techniques. In Proceedings of the Graphicon, Moscow, Russia.

  68. Wan, C., Yao, A., & Van Gool, L. (2016). Hand pose estimation from local surface normals. In European conference on computer vision (pp. 554–569). Springer.

  69. Wetzler, A., Slossberg, R., & Kimmel, R. (2015). Rule of thumb: Deep derotation for improved fingertip detection. In British machine vision conference (BMVC). BMVA Press.

  70. Xu, C., & Cheng, L. (2013). Efficient hand pose estimation from a single depth image. InInternational conference on computer vision (ICCV).

  71. Yang, Y., & Ramanan, D. (2013). Articulated pose estimation with flexible mixtures-of-parts. In IEEE transactions on pattern analysis and machine intelligence.

  72. Ye, Q., Yuan, S., & Kim, T.-K. (2016). Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation. In European conference on computer vision (pp. 346–361). Springer.

  73. Zhu, X., Vondrick, C., Ramanan, D., & Fowlkes, C. (2012). Do we need more training data or better models for object detection? British Machine Vision Conference (BMVC), 3, 5.

    Google Scholar 

Download references

Acknowledgements

National Science Foundation Grant 0954083, Office of Naval Research-MURI Grant N00014-10-1-0933, and the Intel Science and Technology Center-Visual Computing supported JS&DR. The European Commission FP7 Marie Curie IOF grant “Egovision4Health” (PIOF-GA-2012-328288) supported GR.

Author information

Affiliations

Authors

Corresponding author

Correspondence to James Steven Supančič III.

Additional information

Communicated by J. Rehg.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Supančič, J.S., Rogez, G., Yang, Y. et al. Depth-Based Hand Pose Estimation: Methods, Data, and Challenges. Int J Comput Vis 126, 1180–1198 (2018). https://doi.org/10.1007/s11263-018-1081-7

Download citation

Keywords

  • Hand pose
  • RGB-D sensor
  • Datasets
  • Benchmarking