Advertisement

DeepIM: Deep Iterative Matching for 6D Pose Estimation

  • Yi LiEmail author
  • Gu Wang
  • Xiangyang Ji
  • Yu Xiang
  • Dieter Fox
Article
  • 17 Downloads

Abstract

Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.

Keywords

3D object recognition 6D object pose estimation Object tracking 

Notes

Acknowledgements

We thank Lirui Wang at University of Washington for his contribution in this project. This work was funded in part by a Siemens Grant. We would also like to thank NVIDIA for generously providing the DGX station used for this research via the NVIDIA Robotics Lab and the UW NVIDIA AI Lab (NVAIL). This work was also Supported by National Key R&D Program of China 2017YFB1002202, NSFC Projects 61620106005, 61325003, Beijing Municipal Sci. & Tech. Commission Z181100008918014 and THU Initiative Scientific Research Program.

References

  1. Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (SURF). Computer Vision and Image Understanding, 110(3), 346–359.CrossRefGoogle Scholar
  2. Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-d shapes. In P. J. Besl & N. D. McKay (Eds.), Sensor fusion IV: Control paradigms and data structures (Vol. 1611, pp. 586–607). Bellingham: International Society for Optics and Photonics. CrossRefGoogle Scholar
  3. Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., & Rother, C. (2014). Learning 6D object pose estimation using 3D object coordinates. In: European conference on computer vision (ECCV).Google Scholar
  4. Brachmann, E., Michel, F., Krull, A., Ying Yang, M., Gumhold, S., & Rother, C. (2016). Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In: IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3364–3372).Google Scholar
  5. Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., & Dollar, A. M. (2015). The YCB object and model set: Towards common benchmarks for manipulation research. In: 2015 International conference on advanced robotics (ICAR), IEEE (pp. 510–517).Google Scholar
  6. Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In: IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  7. Collet, A., Martinez, M., & Srinivasa, S. S. (2011). The MOPED framework: Object recognition and pose estimation for manipulation. International Journal of Robotics Research (IJRR), 30(10), 1284–1306.CrossRefGoogle Scholar
  8. Costante, G., & Ciarfuglia, T. A. (2018). LS-VO: Learning dense optical subspace for robust visual odometry estimation. IEEE Robotics and Automation Letters, 3(3), 1735–1742.CrossRefGoogle Scholar
  9. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1, 886–893.Google Scholar
  10. Deng, X., Mousavian, A., Xiang, Y., Xia, F., Bretl, T., & Fox, D. (2019). PoseRBPF: A Rao-blackwellized particle filter for 6D object pose tracking. In Robotics: Science and systems (RSS).Google Scholar
  11. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In: IEEE international conference on computer vision (ICCV), pp 2758–2766.Google Scholar
  12. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. IEEE International Journal of Computer Vision (ICCV), 88(2), 303–338.CrossRefGoogle Scholar
  13. Garon, M., & Lalonde, J. F. (2017). Deep 6-DOF tracking. IEEE Transactions on Visualization and Computer Graphics, 23(11), 2410–2418.CrossRefGoogle Scholar
  14. Garon, M., Boulet, P. O., Doironz, J. P., Beaulieu, L., & Lalonde, J. F. (2016). Real-time high resolution 3D data on the hololens. In IEEE international symposium on mixed and augmented reality (ISMAR-Adjunct), IEEE (pp. 189–191).Google Scholar
  15. Girshick, R. (2015). Fast R-CNN. In: IEEE international conference on computer vision (ICCV) (pp. 1440–1448).Google Scholar
  16. Gu, C., & Ren, X. (2010). Discriminative mixture-of-templates for viewpoint classification. In European conference on computer vision (ECCV) (pp. 408–421).Google Scholar
  17. Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P., Navab, N., Fua, P., et al. (2012a). Gradient response maps for real-time detection of textureless objects. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(5), 876–888.CrossRefGoogle Scholar
  18. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., & Navab, N. (2012b). Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Asian conference on computer vision (ACCV).Google Scholar
  19. Hinterstoisser, S., Lepetit, V., Rajkumar, N., & Konolige, K. (2016). Going further with point pair features. In European conference on computer vision (ECCV) (pp. 834–848).Google Scholar
  20. Hodan, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., & Zabulis, X. (2017). T-less: An RGB-D dataset for 6D pose estimation of texture-less objects. In IEEE winter conference on applications of computer vision (WACV), IEEE (pp. 880–888).Google Scholar
  21. Johnson, A. E., & Hebert, M. (1999). Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 5, 433–449.CrossRefGoogle Scholar
  22. Jurie, F., & Dhome, M. (2001). Real time 3D template matching. In IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 1, p. I).Google Scholar
  23. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., & Navab, N. (2017). SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1521–1529).Google Scholar
  24. Kendall, A., & Cipolla, R. (2017). Geometric loss functions for camera pose regression with deep learning. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  25. Krull, A., Brachmann, E., Michel, F., Ying Yang, M., Gumhold, S., & Rother, C. (2015). Learning analysis-by-synthesis for 6D pose estimation in RGB-D images. In IEEE international conference on computer vision (ICCV) (pp. 954–962).Google Scholar
  26. Lin, C. H., & Lucey, S. (2017). Inverse compositional spatial transformer networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2568–2576).Google Scholar
  27. Liu, M. Y., Tuzel, O., Veeraraghavan, A., & Chellappa, R. (2010). Fast directional chamfer matching. In: IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1696–1703).Google Scholar
  28. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (ECCV) (pp. 21–37).Google Scholar
  29. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3431–3440).Google Scholar
  30. Lowe, D. G. (1999). Object recognition from local scale-invariant features. IEEE international conference on computer vision (ICCV) (Vol. 2, pp. 1150–1157).Google Scholar
  31. Manhardt, F., Kehl, W., Navab, N., & Tombari, F. (2018). Deep model-based 6D pose refinement in RGB. In European conference on computer vision (ECCV) (pp. 800–815).Google Scholar
  32. Mellado, N., Aiger, D., & Mitra, N. J. (2014). Super 4pcs fast global pointcloud registration via smart indexing. Computer Graphics Forum, 33, 205–215.CrossRefGoogle Scholar
  33. Mian, A. S., Bennamoun, M., & Owens, R. (2006). Three-dimensional model-based object recognition and segmentation in cluttered scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 28(10), 1584–1601.CrossRefGoogle Scholar
  34. Michel, F., Kirillov, A., Brachmann, E., Krull, A., Gumhold, S., Savchynskyy, B., & Rother, C. (2017). Global hypothesis generation for 6D object pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  35. Mousavian, A., Anguelov, D., Flynn, J., & Košecká, J. (2017). 3D bounding box estimation using deep learning and geometry. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5632–5640).Google Scholar
  36. Nistér, D. (2005). Preemptive RANSAC for live structure and motion estimation. Machine Vision and Applications, 16(5), 321–329.CrossRefGoogle Scholar
  37. Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Training a feedback loop for hand pose estimation. In IEEE international conference on computer vision (ICCV).Google Scholar
  38. Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3D classification and segmentation. IEEE Computer Vision and Pattern Recognition (CVPR), 1(2), 4.Google Scholar
  39. Rad, M., & Lepetit, V. (2017). BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In IEEE international conference on computer vision (ICCV).Google Scholar
  40. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 779–788).Google Scholar
  41. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS).Google Scholar
  42. Rothganger, F., Lazebnik, S., Schmid, C., & Ponce, J. (2006). 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision (IJCV), 66(3), 231–259.CrossRefGoogle Scholar
  43. Rusinkiewicz, S., & Levoy, M. (2001). Efficient variants of the ICP algorithm. In: Third international conference on 3-D digital imaging and modeling, 2001. Proceedings. IEEE (pp. 145–152).Google Scholar
  44. Rusu, R. B., Blodow, N., & Beetz, M. (2009). Fast point feature histograms (FPFH) for 3D registration. In IEEE international conference on robotics and automation (ICRA), Citeseer (pp. 3212–3217).Google Scholar
  45. Salvi, J., Matabosch, C., Fofi, D., & Forest, J. (2007). A review of recent range image registration methods with accuracy evaluation. Image and Vision Computing, 25(5), 578–596.CrossRefGoogle Scholar
  46. Saxena, A., Pandya, H., Kumar, G., Gaud, A., & Krishna, K. M. (2017). Exploring convolutional networks for end-to-end visual servoing. In IEEE international conference on robotics and automation (ICRA) (pp. 3817–3823).Google Scholar
  47. Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., & Fitzgibbon, A. (2013). Scene coordinate regression forests for camera relocalization in RGB-D images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2930–2937).Google Scholar
  48. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  49. Sundermeyer, M., Marton, Z. C., Durner, M., Brucker, M., & Triebel, R. (2018). Implicit 3D orientation learning for 6D object detection from RGB images. In European conference on computer vision (ECCV) (pp. 699–715).Google Scholar
  50. Tam, G. K., Cheng, Z. Q., Lai, Y. K., Langbein, F. C., Liu, Y., Marshall, D., et al. (2013). Registration of 3D point clouds and meshes: A survey from rigid to nonrigid. IEEE Transactions on Visualization and Computer Graphics, 19(7), 1199–1217.CrossRefGoogle Scholar
  51. Tekin, B., Sinha, S. N., & Fua, P. (2017). Real-time seamless single shot 6D object pose prediction. arXiv preprint arXiv:1711.08848.
  52. Theiler, P. W., Wegner, J. D., & Schindler, K. (2015). Globally consistent registration of terrestrial laser scans via graph optimization. ISPRS Journal of Photogrammetry and Remote Sensing, 109, 126–138.CrossRefGoogle Scholar
  53. Tjaden, H., Schwanecke, U., & Schömer, E. (2017). Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 124–132).Google Scholar
  54. Tombari, F., Salti, S., & Di Stefano, L. (2010). Unique signatures of histograms for local surface description. In European conference on computer vision (ECCV), Springer (pp. 356–369).Google Scholar
  55. Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., & Birchfield, S. (2018). Deep object pose estimation for semantic robotic grasping of household objects. In Conference on robot learning (pp. 306–316).Google Scholar
  56. Wang, C., Xu, D., Zhu, Y., Martín-Martín, R., Lu, C., Fei-Fei, L., & Savarese, S. (2019). Densefusion: 6D object pose estimation by iterative dense fusion. arXiv preprint arXiv:1901.04780.
  57. Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In IEEE international conference on robotics and automation (ICRA), IEEE (pp. 2043–2050).Google Scholar
  58. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D shapenets: A deep representation for volumetric shapes. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1912–1920).Google Scholar
  59. Xiang, Y., Schmidt, T., Narayanan, V., & Fox, D. (2018). PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In Robotics: Science and systems (RSS).Google Scholar
  60. Yang, J., Li, H., Campbell, D., & Jia, Y. (2016). GO-ICP: a globally optimal solution to 3D ICP point-set registration. arXiv preprint arXiv:1605.03344.
  61. Zeng, A., Yu, K. T., Song, S., Suo, D., Walker, E., Rodriguez, A., & Xiao, J. (2017). Multi-view self-supervised deep learning for 6D pose estimation in the Amazon picking challenge. In IEEE international conference on robotics and automation (ICRA) (pp. 1386–1383).Google Scholar
  62. Zhou, Q. Y., Park, J., & Koltun, V. (2016). Fast global registration. In European conference on computer vision (ECCV), Springer (pp. 766–782).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Yi Li
    • 1
    • 2
    Email author
  • Gu Wang
    • 2
  • Xiangyang Ji
    • 2
  • Yu Xiang
    • 3
  • Dieter Fox
    • 1
    • 3
  1. 1.University of WashingtonSeattleUSA
  2. 2.Tsinghua University and BNRistBeijingChina
  3. 3.NVIDIASeattleUSA

Personalised recommendations