Deep Model-Based 6D Pose Refinement in RGB

  • Fabian ManhardtEmail author
  • Wadim Kehl
  • Nassir Navab
  • Federico Tombari
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)


We present a novel approach for model-based 6D pose refinement in color data. Building on the established idea of contour-based pose tracking, we teach a deep neural network to predict a translational and rotational update. At the core, we propose a new visual loss that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model. In contrast to previous work our method is correspondence-free, segmentation-free, can handle occlusion and is agnostic to geometrical symmetry as well as visual ambiguities. Additionally, we observe a strong robustness towards rough initialization. The approach can run in real-time and produces pose accuracies that come close to 3D ICP without the need for depth data. Furthermore, our networks are trained from purely synthetic data and will be published together with the refinement code at to ensure reproducibility.


Pose estimation Pose refinement Tracking 



We would like to thank Toyota Motor Corporation for funding and supporting this work.

Supplementary material

474202_1_En_49_MOESM2_ESM.pdf (12.1 mb)
Supplementary material 2 (pdf 12368 KB)


  1. 1.
    Abadi, M., et al.: TensorFlow: Large-scale machine learning on heterogeneous systems. In: OSDI (2016).
  2. 2.
    Bhagavatula, C., Zhu, C., Luu, K., Savvides, M.: Faster than real-time facial alignment: a 3D spatial transformer network approach in unconstrained poses. In: ICCV (2017).
  3. 3.
    Bibby, C., Reid, I.: Robust real-time visual tracking using pixel-wise posteriors. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 831–844. Springer, Heidelberg (2008). Scholar
  4. 4.
    Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). Scholar
  5. 5.
    Brachmann, E., Michel, F., Krull, A., Yang, M.Y., Gumhold, S., Rother, C.: uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In: CVPR (2016)Google Scholar
  6. 6.
    Brox, T., Rosenhahn, B., Gall, J., Cremers, D.: Combined region and motion-based 3D tracking of rigid and articulated objects. TPAMI 32(3), 402–415 (2010)CrossRefGoogle Scholar
  7. 7.
    Choi, C., Christensen, H.: RGB-D object tracking: a particle filter approach on GPU. In: IROS (2013)Google Scholar
  8. 8.
    Dambreville, S., Sandhu, R., Yezzi, A., Tannenbaum, A.: A geometric approach to joint 2D region-based segmentation and 3D pose estimation using a 3D shape prior. SIAM J. Imaging Sci. 3, 110–132 (2010)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Drummond, T., Cipolla, R.: Real-time visual tracking of complex structures. TPAMI 24, 932–946 (2002)CrossRefGoogle Scholar
  10. 10.
    Garon, M., Lalonde, J.F.: Deep 6-DOF tracking. In: ISMAR (2017). Scholar
  11. 11.
    Hexner, J., Hagege, R.R.: 2D–3D pose estimation of heterogeneous objects using a region based approach. IJCV 118, 95–112 (2016)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). Scholar
  13. 13.
    Hinterstoisser, S., Lepetit, V., Wohlhart, P., Konolige, K.: On pre-trained image features and synthetic images for deep learning. CoRR abs/1710.10710 (2017).
  14. 14.
    Hodaň, T., Matas, J., Obdržálek, Š.: On evaluation of 6D object pose estimation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 606–619. Springer, Cham (2016). Scholar
  15. 15.
    Holloway, R.L.: Registration error analysis for augmented reality. Presence Teleoper. Virtual Environ. 6(4), 413–432 (1997). Scholar
  16. 16.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS (2015).
  17. 17.
    Kehl, W., Manhardt, F., Ilic, S., Tombari, F., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: ICCV (2017)Google Scholar
  18. 18.
    Kehl, W., Tombari, F., Ilic, S., Navab, N.: Real-time 3D model tracking in color and depth on a single CPU core. In: CVPR (2017)Google Scholar
  19. 19.
    Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: CVPR (2017).
  20. 20.
    Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: ICCV (2015)Google Scholar
  21. 21.
    Krull, A., Michel, F., Brachmann, E., Gumhold, S., Ihrke, S., Rother, C.: 6-DOF model based tracking via object coordinate regression. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9006, pp. 384–399. Springer, Cham (2015). Scholar
  22. 22.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  23. 23.
    Park, Y., Lepetit, V.: Multiple 3D object tracking for augmented reality. In: ISMAR (2008)Google Scholar
  24. 24.
    Pauwels, K., Rubio, L., Diaz, J., Ros, E.: Real-time model-based rigid object pose estimation and tracking combining dense and sparse visual cues. In: CVPR (2013)Google Scholar
  25. 25.
    Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: ICRA (2017).
  26. 26.
    Prisacariu, V.A., Murray, D.W., Reid, I.D.: Real-time 3D tracking and reconstruction on mobile phones. TVCG 21, 557–570 (2015)Google Scholar
  27. 27.
    Prisacariu, V.A., Reid, I.D.: PWP3D: real-time segmentation and tracking of 3D objects. IJCV 98, 335–354 (2012)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Rad, M., Lepetit, V.: BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: ICCV, pp. 3848–3856 (2017).
  29. 29.
    Rosenhahn, B., Brox, T., Cremers, D., Seidel, H.-P.: A comparison of shape matching methods for contour based pose estimation. In: Reulke, R., Eckardt, U., Flach, B., Knauer, U., Polthier, K. (eds.) IWCIA 2006. LNCS, vol. 4040, pp. 263–276. Springer, Heidelberg (2006). Scholar
  30. 30.
    Schmaltz, C., et al.: Region-based pose tracking. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds.) IbPRIA 2007. LNCS, vol. 4478, pp. 56–63. Springer, Heidelberg (2007). Scholar
  31. 31.
    Schmaltz, C., Rosenhahn, B., Brox, T., Weickert, J.: Region-based pose tracking with occlusions using 3D models. MVA 23, 557–577 (2012)Google Scholar
  32. 32.
    Seo, B.K., Park, H., Park, J.I., Hinterstoisser, S., Ilic, S.: Optimal local searching for fast and robust textureless 3D object tracking in highly cluttered backgrounds. In: TVCG (2014)Google Scholar
  33. 33.
    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: ICLR Workshop (2016).
  34. 34.
    Tan, D.J., Tombari, F., Ilic, S., Navab, N.: A versatile learning-based 3D temporal tracker: scalable, robust. In: ICCV, Online (2015)Google Scholar
  35. 35.
    Tateno, K., Kotake, D., Uchiyama, S.: Model-based 3D object tracking with online texture update. In: MVA (2009)Google Scholar
  36. 36.
    Tejani, A., Tang, D., Kouskouridas, R., Kim, T.-K.: Latent-class hough forests for 3D object detection and pose estimation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 462–477. Springer, Cham (2014). Scholar
  37. 37.
    Tjaden, H., Schwanecke, U., Schömer, E.: Real-time monocular segmentation and pose tracking of multiple objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 423–438. Springer, Cham (2016). Scholar
  38. 38.
    Tjaden, H., Schwanecke, U., Schömer, E.: Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In: ICCV (2017).
  39. 39.
    Ummenhofer, B., et al.: DeMoN: depth and motion network for learning monocular stereo. In: CVPR (2017)Google Scholar
  40. 40.
    Vacchetti, L., Lepetit, V., Fua, P.: Stable real-time 3D tracking using online and offline information. TPAMI 26, 1385–1391 (2004)CrossRefGoogle Scholar
  41. 41.
    Wang, S., Clark, R., Wen, H., Trigoni, N.: DeepVO: Towards End to End Visual Odometry with Deep Recurrent Convolutional Neural Networks. In: ICRA (2017)Google Scholar
  42. 42.
    Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). Scholar
  43. 43.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017).

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Fabian Manhardt
    • 1
    Email author
  • Wadim Kehl
    • 2
  • Nassir Navab
    • 1
  • Federico Tombari
    • 1
  1. 1.Technical University of MunichGarching b. MuenchenGermany
  2. 2.Toyota Research InstituteLos AltosUSA

Personalised recommendations