Advertisement

Exploiting Temporal Information for 3D Human Pose Estimation

  • Mir Rayat Imtiaz HossainEmail author
  • James J. LittleEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)

Abstract

In this work, we address the problem of 3D human pose estimation from a sequence of 2D human poses. Although the recent success of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end-to-end to predict from images directly, the top-performing approaches have shown the effectiveness of dividing the task of 3D pose estimation into two steps: using a state-of-the-art 2D pose estimator to estimate the 2D pose from images and then mapping them into 3D space. They also showed that a low-dimensional representation like 2D locations of a set of joints can be discriminative enough to estimate 3D pose with high accuracy. However, estimation of 3D pose for individual frames leads to temporally incoherent estimates due to independent error in each frame causing jitter. Therefore, in this work we utilize the temporal information across a sequence of 2D joint locations to estimate a sequence of 3D poses. We designed a sequence-to-sequence network composed of layer-normalized LSTM units with shortcut connections connecting the input to the output on the decoder side and imposed temporal smoothness constraint during training. We found that the knowledge of temporal consistency improves the best reported result on Human3.6M dataset by approximately \(12.2\%\) and helps our network to recover temporally consistent 3D poses over a sequence of images even when the 2D pose detector fails.

Keywords

3D human pose Sequence-to-sequence networks Layer normalized LSTM Residual connections 

References

  1. 1.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems (NIPS), pp. 3104–3112 (2014)Google Scholar
  2. 2.
    Agarwal, A., Triggs, B.: 3D human pose from silhouettes by relevance vector regression. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2004)Google Scholar
  3. 3.
    Mori, G., Malik, J.: Recovering 3D human body configurations using shape contexts. IEEE Trans Pattern Anal. Mach. Intell. (TPAMI) 28(7), 1052–1062 (2006)CrossRefGoogle Scholar
  4. 4.
    Bo, L.F., Sminchisescu, C., Kanaujia, A., Metaxas, D.N.: Fast algorithms for large scale conditional 3D prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)Google Scholar
  5. 5.
    Shakhnarovich, G., Viola, P.A., Darrell, T.J.: Fast pose estimation with parameter-sensitive hashing. In: IEEE International Conference on Computer Vision (ICCV) (2003)Google Scholar
  6. 6.
    Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., Fua, P.: Structured prediction of 3D human pose with deep neural networks. In: British Machine Vision Conference (BMVC) (2016)Google Scholar
  7. 7.
    Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  8. 8.
    Li, S., Chan, A.B.: 3D human pose estimation from monocular images with deep convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 332–347. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-16808-1_23CrossRefGoogle Scholar
  9. 9.
    Mehta, D., Rhodin, H., Casas, D., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3D human pose estimation using transfer learning and improved CNN supervision. arXiv preprint arXiv:1611.09813 (2016)
  10. 10.
    Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: Deep kinematic pose regression. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 186–201. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_17CrossRefGoogle Scholar
  11. 11.
    Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 36(4), 44 (2017)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Nie, B.X., Wei, P., Zhu, S.C.: Monocular 3D human pose estimation by predicting depth on joints. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  13. 13.
    Lin, M., Lin, L., Liang, X., Wang, K., Chen, H.: Recurrent 3D pose sequence machines. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  14. 14.
    Park, S., Hwang, J., Kwak, N.: 3D human pose estimation using convolutional neural networks with 2D pose information. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 156–169. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_15CrossRefGoogle Scholar
  15. 15.
    Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  16. 16.
    Tekin, B., Marquez Neila, P., Salzmann, M., Fua, P.: Learning to fuse 2D and 3D image cues for monocular body pose estimation. In: International Conference on Computer Vision (ICCV) (2017)Google Scholar
  17. 17.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  18. 18.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision (ECCV) (2016)CrossRefGoogle Scholar
  19. 19.
    Ramakrishna, V., Kanade, T., Sheikh, Y.: Reconstructing 3D human pose from 2D image landmarks. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 573–586. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33765-9_41CrossRefGoogle Scholar
  20. 20.
    Zhou, X., Zhu, M., Leonardos, S., Derpanis, K.G., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4966–4975 (2016)Google Scholar
  21. 21.
    Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3D human pose reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1446–1455 (2015)Google Scholar
  22. 22.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_34CrossRefGoogle Scholar
  23. 23.
    Moreno-Noguer, F.: 3D human pose estimation from a single image via distance matrix regression. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  24. 24.
    Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  25. 25.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)Google Scholar
  26. 26.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  27. 27.
    Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
  28. 28.
    Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014)
  29. 29.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI) 36(7), 1325–1339 (2014)CrossRefGoogle Scholar
  30. 30.
    Barron, C., Kakadiaris, I.A.: Estimating anthropometry and pose from a single uncalibrated image. Compu. Vis. Image Underst. (CVIU) 81(3), 269–284 (2001)CrossRefGoogle Scholar
  31. 31.
    Parameswaran, V., Chellappa, R.: View independent human body pose estimation from a single perspective image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2004)Google Scholar
  32. 32.
    Lee, H.J., Chen, Z.: Determination of 3D human body postures from a single view. Comput. Vis., Graph. Image Process. 30, 148–168 (1985)CrossRefGoogle Scholar
  33. 33.
    Jiang, H.: 3D human pose reconstruction using millions of exemplars. In: IEEE International Conference on Pattern Recognition (ICPR), pp. 1674–1677. IEEE (2010)Google Scholar
  34. 34.
    Taylor, C.J.: Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 677–684. IEEE (2000)Google Scholar
  35. 35.
    Gupta, A., Martinez, J., Little, J.J., Woodham, R.J.: 3D pose from motion for cross-view action recognition via non-linear circulant temporal encoding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  36. 36.
    Chen, C.H., Ramanan, D.: 3D human pose estimation = 2D pose estimation + matching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  37. 37.
    Wang, C., Wang, Y., Lin, Z., Yuille, A.L., Gao, W.: Robust estimation of 3D human poses from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  38. 38.
    Varol, G., et al.: Learning from synthetic humans. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  39. 39.
    Rogez, G., Schmid, C.: MoCap-guided data augmentation for 3D pose estimation in the wild. In: Advances in Neural Information Processing Systems (NIPS) (2016)Google Scholar
  40. 40.
    Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2500–2509 (2017)Google Scholar
  41. 41.
    Fang, H., Xu, Y., Wang, W., Liu, X., Zhu, S.C.: Learning knowledge-guided pose grammar machine for 3D human pose estimation. arXiv preprint arXiv:1710.06513 (2017)
  42. 42.
    Andriluka, M., Roth, S., Schiele, B.: Monocular 3D pose estimation and tracking by detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 623–630. IEEE (2010)Google Scholar
  43. 43.
    Tekin, B., Rozantsev, A., Lepetit, V., Fua, P.: Direct prediction of 3D body poses from motion compensated sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 991–1000 (2016)Google Scholar
  44. 44.
    Du, Y., et al.: Marker-less 3D human motion capture with monocular image sequence and height-maps. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 20–36. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_2CrossRefGoogle Scholar
  45. 45.
    Sigal, L., Balan, A.O., Black, M.J.: HUMANEVA: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. (IJCV) 87(1–2), 4 (2010)CrossRefGoogle Scholar
  46. 46.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new Benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  47. 47.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  48. 48.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)Google Scholar
  49. 49.
    Radwan, I., Dhall, A., Goecke, R.: Monocular image 3D human pose estimation under self-occlusion. In: IEEE International Conference on Computer Vision (ICCV) (2013)Google Scholar
  50. 50.
    Simo-Serra, E., Quattoni, A., Torras, C., Moreno-Noguer, F.: A joint model for 2D and 3D pose estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  51. 51.
    Bo, L., Sminchisescu, C.: Twin Gaussian processes for structured prediction. Int. J. Comput. Vis. (IJCV) 87(1–2), 28 (2010)CrossRefGoogle Scholar
  52. 52.
    Kostrikov, I., Gall, J.: Depth sweep regression forests for estimating 3D human pose from images. In: British Machine Vision Conference (BMVC) (2014)Google Scholar
  53. 53.
    Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4948–4956 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of British ColumbiaVancouverCanada

Personalised recommendations