Advertisement

End-to-End 6-DoF Object Pose Estimation Through Differentiable Rasterization

  • Andrea PalazziEmail author
  • Luca Bergamini
  • Simone Calderara
  • Rita Cucchiara
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11131)

Abstract

Here we introduce an approximated differentiable renderer to refine a 6-DoF pose prediction using only 2D alignment information. To this end, a two-branched convolutional encoder network is employed to jointly estimate the object class and its 6-DoF pose in the scene. We then propose a new formulation of an approximated differentiable renderer to re-project the 3D object on the image according to its predicted pose; in this way the alignment error between the observed and the re-projected object silhouette can be measured. Since the renderer is differentiable, it is possible to back-propagate through it to correct the estimated pose at test time in an online learning fashion. Eventually we show how to leverage the classification branch to profitably re-project a representative model of the predicted class (i.e. a medoid) instead. Each object in the scene is processed independently and novel viewpoints in which both objects arrangement and mutual pose are preserved can be rendered.

Differentiable renderer code is available at: https://github.com/ndrplz/tensorflow-mesh-renderer.

Keywords

6-DoF pose estimation Differentiable rendering 

References

  1. 1.
    Agarwal, S., et al.: Building Rome in a day. Commun. ACM 54(10), 105–112 (2011)CrossRefGoogle Scholar
  2. 2.
    Aubry, M., Maturana, D., Efros, A.A., Russell, B.C., Sivic, J.: Seeing 3D chairs: exemplar part-based 2D–3D alignment using a large dataset of CAD models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3762–3769 (2014)Google Scholar
  3. 3.
    Blender Online Community: Blender - a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam (2017). http://www.blender.org
  4. 4.
    Boyer, E., Franco, J.S.: A hybrid approach for computing visual hulls of complex objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 695–701. IEEE Computer Society Press (2003)Google Scholar
  5. 5.
    Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
  6. 6.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)CrossRefGoogle Scholar
  7. 7.
    Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_38CrossRefGoogle Scholar
  8. 8.
    Collet, A., Berenson, D., Srinivasa, S.S., Ferguson, D.: Object recognition and full pose registration from a single image for robotic manipulation. In: IEEE International Conference on Robotics and Automation, ICRA 2009, pp. 48–55. IEEE (2009)Google Scholar
  9. 9.
    Collet, A., Martinez, M., Srinivasa, S.S.: The moped framework: object recognition and pose estimation for manipulation. Int. J. Robot. Res. 30(10), 1284–1306 (2011)CrossRefGoogle Scholar
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)Google Scholar
  11. 11.
    Dosovitskiy, A., Springenberg, J.T., Tatarchenko, M., Brox, T.: Learning to generate chairs, tables and cars with convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 692–705 (2017)Google Scholar
  12. 12.
    Du, X., Ang Jr., M.H., Karaman, S., Rus, D.: A general pipeline for 3D detection of vehicles. In: ICRA (2018)Google Scholar
  13. 13.
    Fitzgibbon, A., Zisserman, A.: Automatic 3D model acquisition and generation of new images from video sequences. In: 9th European Signal Processing Conference (EUSIPCO 1998), pp. 1–8. IEEE (1998)Google Scholar
  14. 14.
    Gadelha, M., Maji, S., Wang, R.: 3D shape induction from 2D views of multiple objects. 3D Vision (2017)Google Scholar
  15. 15.
    Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F.: The lumigraph. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 43–54. ACM (1996)Google Scholar
  16. 16.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  17. 17.
    Huynh, D.Q.: Metrics for 3D rotations: comparison and analysis. J. Math. Imaging Vis. 35(2), 155–164 (2009)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)Google Scholar
  19. 19.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  20. 20.
    Kolev, K., Klodt, M., Brox, T., Cremers, D.: Continuous global optimization in multiview 3D reconstruction. Int. J. Comput. Vis. 84(1), 80–96 (2009)CrossRefGoogle Scholar
  21. 21.
    Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: an accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 81(2), 155 (2009)CrossRefGoogle Scholar
  22. 22.
    Lim, J.J., Khosla, A., Torralba, A.: FPM: fine pose parts-based model with 3D CAD models. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 478–493. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_31CrossRefGoogle Scholar
  23. 23.
    Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014)Google Scholar
  24. 24.
    Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_11CrossRefGoogle Scholar
  25. 25.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Moreno-Noguer, F., Lepetit, V., Fua, P.: Accurate non-iterative O(n) solution to the PnP problem. In: IEEE 11th international conference on Computer vision, ICCV 2007, pp. 1–8. IEEE (2007)Google Scholar
  27. 27.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29CrossRefGoogle Scholar
  28. 28.
    Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. IEEE (2017)Google Scholar
  29. 29.
    Pollefeys, M., Koch, R., Vergauwen, M., Van Gool, L.: Metric 3D surface reconstruction from uncalibrated image sequences. In: Koch, R., Van Gool, L. (eds.) SMILE 1998. LNCS, vol. 1506, pp. 139–154. Springer, Heidelberg (1998).  https://doi.org/10.1007/3-540-49437-5_10CrossRefGoogle Scholar
  30. 30.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)Google Scholar
  31. 31.
    Rezende, D.J., Eslami, S.A., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3D structure from images. In: Advances in Neural Information Processing Systems, pp. 4996–5004 (2016)Google Scholar
  32. 32.
    Saponaro, P., Sorensen, S., Rhein, S., Mahoney, A.R., Kambhamettu, C.: Reconstruction of textureless regions using structure from motion and image-based interpolation. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 1847–1851. IEEE (2014)Google Scholar
  33. 33.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  34. 34.
    Sinha, A., Unmesh, A., Huang, Q., Ramani, K.: SurfNet: generating 3D shape surfaces using deep residual networks. In: Proceedings of CVPR (2017)Google Scholar
  35. 35.
    Starck, J., Hilton, A.: Model-based human shape reconstruction from multiple views. Comput. Vis. Image Underst. 111(2), 179–194 (2008)CrossRefGoogle Scholar
  36. 36.
    Stark, M., Goesele, M., Schiele, B.: Back to the future: learning shape models from 3D CAD data. In: BMVC, vol. 2, p. 5. Citeseer (2010)Google Scholar
  37. 37.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694 (2015)Google Scholar
  38. 38.
    Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 322–337. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_20CrossRefGoogle Scholar
  39. 39.
    Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)Google Scholar
  40. 40.
    Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519 (2015)Google Scholar
  41. 41.
    Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR, vol. 1, p. 3 (2017)Google Scholar
  42. 42.
    Vogiatzis, G., Esteban, C.H., Torr, P.H., Cipolla, R.: Multiview stereo via volumetric graph-cuts and occlusion robust photo-consistency. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2241–2246 (2007)CrossRefGoogle Scholar
  43. 43.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)Google Scholar
  44. 44.
    Wiles, O., Zisserman, A.: SilNet: single-and multi-view reconstruction by learning from silhouettes. In: British Machine Vision Conference (2017)Google Scholar
  45. 45.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 5–32 (1992)zbMATHGoogle Scholar
  46. 46.
    Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: Advances in Neural Information Processing Systems, pp. 82–90 (2016)Google Scholar
  47. 47.
    Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: Advances in Neural Information Processing Systems, pp. 1696–1704 (2016)Google Scholar
  48. 48.
    Yang, J., Reed, S.E., Yang, M.H., Lee, H.: Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In: Advances in Neural Information Processing Systems, pp. 1099–1107 (2015)Google Scholar
  49. 49.
    Zhou, X., Zhu, M., Leonardos, S., Derpanis, K.G., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4966–4975 (2016)Google Scholar
  50. 50.
    Zhu, M., Zhou, X., Daniilidis, K.: Single image pop-up from discriminatively learned parts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 927–935 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Andrea Palazzi
    • 1
    Email author
  • Luca Bergamini
    • 1
  • Simone Calderara
    • 1
  • Rita Cucchiara
    • 1
  1. 1.University of Modena and Reggio EmiliaModenaItaly

Personalised recommendations