Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

Supančič, James Steven; Rogez, Grégory; Yang, Yi; Shotton, Jamie; Ramanan, Deva

doi:10.1007/s11263-018-1081-7

Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

Published: 12 April 2018

Volume 126, pages 1180–1198, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

James Steven Supančič III ORCID: orcid.org/0000-0003-1631-682X¹,
Grégory Rogez^2,3,
Yi Yang⁴,
Jamie Shotton⁵ &
…
Deva Ramanan⁶

2774 Accesses
46 Citations
1 Altmetric
Explore all metrics

Abstract

Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and have released software and evaluation code. We summarize important conclusions here: (1) Coarse pose estimation appears viable for scenes with isolated hands. However, high precision pose estimation [required for immersive virtual reality and cluttered scenes (where hands may be interacting with nearby objects and surfaces) remain a challenge. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate criteria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experiments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that training data is as important as the model itself. We conclude with directions for future progress.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computer vision-based hand gesture recognition for human-robot interaction: a review

Article Open access 19 July 2023

Jing Qi, Li Ma, … Yushu Yu

A review of computer vision-based approaches for physical rehabilitation and assessment

Article Open access 19 June 2021

Bappaditya Debnath, Mary O’Brien, … Ardhendu Behera

3D point cloud-based place recognition: a survey

Article Open access 07 March 2024

Kan Luo, Hongshan Yu, … Ajmal Mian

Notes

http://www.ics.uci.edu/~jsupanci/#HandData.

References

Ballan, L., Taneja, A., Gall, J., Gool, L. J. V., & Pollefeys, M. (2012). Motion capture of hands in action using discriminative salient points. In ECCV (6).
Bray, M., Koller-Meier, E., Müller, P., Van Gool, L., & Schraudolph, N. N. (2004). 3D hand tracking by rapid stochastic gradient descent using a skinning model. In 1st European conference on visual media production (CVMP).
Bullock, I. M., Member, S., Zheng, J. Z., Rosa, S. D. L., Guertler, C., & Dollar, A. M. (2013). IEEE transactions on grasp frequency and usage in daily household and machine shop tasks, Haptics.
Camplani, M., & Salgado, L. (2012). Efficient spatio-temporal hole filling strategy for kinect depth maps. In Proceedings of SPIE.
Castellini, C., Tommasi, T., Noceti, N., Odone, F., & Caputo, B. (2011). Using object affordances to improve object recognition. In IEEE transactions on autonomous mental development.
Choi, C., Sinha, A., Hee Choi, J., Jang, S., & Ramani, K. (2015). A collaborative filtering approach to real-time hand pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 2336–2344).
Cooper, H. (2012). Sign language recognition using sub-units. The Journal of Machine Learning Research, 13, 2205.
MATH Google Scholar
Delamarre, Q., & Faugeras, O. (2001). 3D articulated models and multiview tracking with physical forces. Computer Vision and Image Understanding., 81, 328.
Article MATH Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Computer vision and pattern recognition (CVPR). IEEE.
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. In IEEE transactions on pattern analysis and machine intelligence.
Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. Computer Vision and Image Understanding., 108, 52.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303.
Article Google Scholar
Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. In IEEE transactions on pattern analysis and machine intelligence.
Fathi, A., Ren, X., & Rehg, J. M. (2011). Learning to recognize objects in egocentric activities. In 2011 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3281–3288). IEEE.
Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59.
Article Google Scholar
Feix, T., Romero, J., Ek, C. H., Schmiedmayer, H., & Kragic, D. (2013). A metric for comparing the anthropomorphic motion capability of artificial hands. In IEEE transactions on robotics.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. In IEEE transactions on pattern analysis and machine intelligence.
Girard, M., & Maciejewski, A. A. (1985). Computational modeling for the computer animation of legged figures. ACM SIGGRAPH Computer Graphics, 19, 263.
Article Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (ECCV). Springer.
Intel. (2013). Perceptual computing SDK.
Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., et al. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision. Springer, London
Keskin, C., Kıraç, F., Kara, Y. E., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In European conference on computer vision (ECCV).
Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., & Fitzgibbon, A. (2015). Learning an efficient model of hand shape variation from depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2540–2548).
Li, C., & Kitani, K. M. (2013). Pixel-level hand detection in ego-centric videos. In Computer vision and pattern recognition (CVPR).
Li, P., Ling, H., Li, X., & Liao, C. (2015). 3d hand pose estimation using randomized decision forest with segmentation index points. In Proceedings of the IEEE international conference on computer vision (pp. 819–827).
Martin, D. R., Fowlkes, C. C., & Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. in IEEE transactions on pattern analysis and machine intelligence.
Melax, S., Keselman, L., & Orsten, S. (2013). Dynamics based 3D skeletal hand tracking. In Proceedings of the ACM SIGGRAPH symposium on interactive 3D graphics and games-I3D ’13.
Mo, Z., & Neumann, U. (2006). Real-time hand pose recognition using low-resolution depth images. In 2006 IEEE computer society conference on computer vision and pattern recognition (vol. 2, pp. 1499–1505). IEEE.
Moore, A. W., Connolly, A. J., Genovese, C., Gray, A., Grone, L., & Kanidoris, N, I. I., et al. (2001). Fast algorithms and efficient statistics: N-point correlation functions. In Mining the Sky. Springer.
Muja, M., & Lowe, D. G. (2014). Scalable nearest neighbor algorithms for high dimensional data. In IEEE transactions on pattern analysis and machine intelligence.
Oberweger, M., Riegler, G., Wohlhart, P., & Lepetit, V. (2016). Efficiently creating 3d training data for fine hand pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4957–4965).
Oberweger, M., Wohlhart, P., & Lepetit, V. (2015a). Hands deep in deep learning for hand pose estimation. In Computer vision winter workshop (CVWW).
Oberweger, M., Wohlhart, P., & Lepetit, V. (2015b). Training a feedback loop for hand pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 3316–3324).
Ohn-Bar, E., & Trivedi, M. M. (2014a). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. In IEEE transactions on intelligent transportation systems.
Ohn-Bar, E., & Trivedi, M. M. (2014b). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(6), 2368–2377.
Article Google Scholar
Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011). Efficient model-based 3D tracking of hand articulations using kinect. In British machine vision conference (BMVC).
Pang, Y., & Ling, H. (2013). Finding the best from the second bests-inhibiting subjective bias in evaluation of visual tracking algorithms. In International conference on computer vision (ICCV).
Pieropan, A., Salvi, G., Pauwels, K., & Kjellstrom, H. (2014). Audio-visual classification and detection of human manipulation actions. In International conference on intelligent robots and systems (IROS).
Premaratne, P., Nguyen, Q., & Premaratne, M. (2010). Human computer interaction using hand gestures. Berlin: Springer.
Book MATH Google Scholar
PrimeSense. (2013). Nite2 middleware, Version 2.2.
Qian, C., Sun, X., Wei, Y., Tang, X., & Sun, J. (2014). Realtime and robust hand tracking from depth. In Computer vision and pattern recognition (CVPR).
Ren, Z., Yuan, J., & Zhang, Z. (2011). Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera. In Proceedings of the 19th ACM international conference on Multimedia. ACM.
Rogez, G., Khademi, M., Supancic, III, J., Montiel, J. M. M., & Ramanan, D. (2014). 3D hand pose detection in egocentric RGB-D images. CDC4CV workshop, European conference on computer vision (ECCV).
Rogez, G., Supancic, III, J., & Ramanan, D. (2015a). First-person pose recognition using egocentric workspaces. In Computer vision and pattern recognition (CVPR).
Rogez, G., Supancic, J. S., & Ramanan, D. (2015b). Understanding everyday hands in action from RGB-D images. In Proceedings of the IEEE international conference on computer vision (pp. 3889–3897).
Romero, J., Kjellstr, H., & Kragic, D. (2009). Monocular real-time 3D articulated hand pose estimation. In International conference on humanoid robots.
Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: What have we done, and where are we going? In International conference on computer vision (ICCV). IEEE.
Šarić, M. (2011). Libhand: A library for hand articulation, Version 0.9.
Scharstein, D. (2002). A taxonomy and evaluation of dense two-frame stereo. International Journal of Computer Vision, 47, 7.
Article MATH Google Scholar
Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In International conference on computer vision (ICCV). IEEE.
Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., & Izadi, S. (2015). Accurate, robust, and flexible real-time hand tracking. In ACM conference on computer–human interaction.
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM., 56, 116.
Article Google Scholar
Song, S., & Xiao, J. (2014). Sliding shapes for 3D object detection in depth images. In European conference on computer vision (ECCV).
Sridhar, S., Mueller, F., Oulasvirta, A., & Theobalt, C. (2015). Fast and robust hand tracking using detection-guided optimization. In Computer vision and pattern recognition (CVPR).
Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using RGB and depth data. In International conference on computer vision (ICCV).
Stenger, B., Thayananthan, A., Torr, P. H. S., & Cipolla, R. (2006). Model-based hand tracking using a hierarchical Bayesian filter. In IEEE transactions on pattern analysis and machine intelligence.
Stokoe, W. C. (2005). Sign language structure: An outline of the visual communication systems of the American deaf. Journal of Deaf Studies and Deaf Education, 10, 3.
Article Google Scholar
Sun, X., Wei, Y., Liang, S., Tang, X., & Sun, J. (2015). Cascaded hand pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 824–832).
Tang, D., Chang, H. J., Tejani, A., & Kim, T.-K. (2014). Latent regression forest: Structured estimation of 3D articulated hand posture. In Computer vision and pattern recognition (CVPR).
Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.-K., & Shotton, J. (2015). Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE international conference on computer vision (pp. 3325–3333).
Tang, D., Yu, T.H. & Kim, T.-K. (2013). Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In International conference on computer vision (ICCV).
Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., & Izadi, S., et al. (2014). User-specific hand modeling from monocular depth sequences. In Computer vision and pattern recognition (CVPR). IEEE.
Taylor, J., Bordeaux, L., Cashman, T., Corish, B., Keskin, C., Sharp, T., et al. (2016). Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG), 35(4), 143.
Article Google Scholar
Tompson, J., Stein, M., Lecun, Y., & Perlin, K. (2014). Real-time continuous pose recovery of human hands using convolutional networks. In ACM Transactions on Graphics.
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In Computer vision and pattern recognition (CVPR). IEEE.
Tzionas, D., Srikantha, A., Aponte, P., & Gall, J. (2014). Capturing hand motion with an RGB-D sensor, fusing a generative model with salient points. In German Conference on Pattern Recognition (GCPR). Lecture notes in computer science. Springer.
Vezhnevets, V., Sazonov, V., & Andreeva, A. (2003). A survey on pixel-based skin color detection techniques. In Proceedings of the Graphicon, Moscow, Russia.
Wan, C., Yao, A., & Van Gool, L. (2016). Hand pose estimation from local surface normals. In European conference on computer vision (pp. 554–569). Springer.
Wetzler, A., Slossberg, R., & Kimmel, R. (2015). Rule of thumb: Deep derotation for improved fingertip detection. In British machine vision conference (BMVC). BMVA Press.
Xu, C., & Cheng, L. (2013). Efficient hand pose estimation from a single depth image. InInternational conference on computer vision (ICCV).
Yang, Y., & Ramanan, D. (2013). Articulated pose estimation with flexible mixtures-of-parts. In IEEE transactions on pattern analysis and machine intelligence.
Ye, Q., Yuan, S., & Kim, T.-K. (2016). Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation. In European conference on computer vision (pp. 346–361). Springer.
Zhu, X., Vondrick, C., Ramanan, D., & Fowlkes, C. (2012). Do we need more training data or better models for object detection? British Machine Vision Conference (BMVC), 3, 5.
Google Scholar

Download references

Acknowledgements

National Science Foundation Grant 0954083, Office of Naval Research-MURI Grant N00014-10-1-0933, and the Intel Science and Technology Center-Visual Computing supported JS&DR. The European Commission FP7 Marie Curie IOF grant “Egovision4Health” (PIOF-GA-2012-328288) supported GR.

Author information

Authors and Affiliations

University of California, Irvine, USA
James Steven Supančič III
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP*, LJK, 38000, Grenoble, France
Grégory Rogez
Institute of Engineering Univ., Grenoble Alpes, France
Grégory Rogez
Baidu Institute of Deep Learning, Sunnyvale, USA
Yi Yang
Microsoft Research, Cambridge, UK
Jamie Shotton
Carnegie Mellon University, Pittsburgh, PA, USA
Deva Ramanan

Authors

James Steven Supančič III
View author publications
You can also search for this author in PubMed Google Scholar
Grégory Rogez
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jamie Shotton
View author publications
You can also search for this author in PubMed Google Scholar
Deva Ramanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to James Steven Supančič III.

Additional information

Communicated by J. Rehg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Supančič, J.S., Rogez, G., Yang, Y. et al. Depth-Based Hand Pose Estimation: Methods, Data, and Challenges. Int J Comput Vis 126, 1180–1198 (2018). https://doi.org/10.1007/s11263-018-1081-7

Download citation

Received: 03 December 2015
Accepted: 09 March 2018
Published: 12 April 2018
Issue Date: November 2018
DOI: https://doi.org/10.1007/s11263-018-1081-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

Abstract

Access this article

Similar content being viewed by others

Computer vision-based hand gesture recognition for human-robot interaction: a review

A review of computer vision-based approaches for physical rehabilitation and assessment

3D point cloud-based place recognition: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Computer vision-based hand gesture recognition for human-robot interaction: a review

A review of computer vision-based approaches for physical rehabilitation and assessment

3D point cloud-based place recognition: a survey

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation