Object Pose Estimation from Monocular Image Using Multi-view Keypoint Correspondence

  • Jogendra Nath Kundu
  • M. V. RahulEmail author
  • Aditya Ganeshan
  • R. Venkatesh Babu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11131)


Understanding the geometry and pose of objects in 2D images is a fundamental necessity for a wide range of real world applications. Driven by deep neural networks, recent methods have brought significant improvements to object pose estimation. However, they suffer due to scarcity of keypoint/pose-annotated real images and hence can not exploit the object’s 3D structural information effectively. In this work, we propose a data-efficient method which utilizes the geometric regularity of intraclass objects for pose estimation. First, we learn pose-invariant local descriptors of object parts from simple 2D RGB images. These descriptors, along with keypoints obtained from renders of a fixed 3D template model are then used to generate keypoint correspondence maps for a given monocular real image. Finally, a pose estimation network predicts 3D pose of the object using these correspondence maps. This pipeline is further extended to a multi-view approach, which assimilates keypoint information from correspondence sets generated from multiple views of the 3D template model. Fusion of multi-view information significantly improves geometric comprehension of the system which in turn enhances the pose estimation performance. Furthermore, use of correspondence framework responsible for the learning of pose invariant keypoint descriptor also allows us to effectively alleviate the data-scarcity problem. This enables our method to achieve state-of-the-art performance on multiple real-image viewpoint estimation datasets, such as Pascal3D+ and ObjectNet3D. To encourage reproducible research, we have released the codes for our proposed approach (Code:


Pose estimation 3D structure Keypoint estimation Correspondence network Convolutional neural network 

Supplementary material

478822_1_En_23_MOESM1_ESM.pdf (1 mb)
Supplementary material 1 (pdf 1060 KB)


  1. 1.
    Arbeláez, P., Pont-Tuset, J., Barron, J.T., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014)Google Scholar
  2. 2.
    Aubry, M., Maturana, D., Efros, A.A., Russell, B.C., Sivic, J.: Seeing 3D chairs: exemplar part-based 2D–3D alignment using a large dataset of cad models. In: CVPR (2014)Google Scholar
  3. 3.
    Berg, A.C., Berg, T.L., Malik, J.: Shape matching and object recognition using low distortion correspondences. In: CVPR (2005)Google Scholar
  4. 4.
    Borotschnig, H., Paletta, L., Prantl, M., Pinz, A.: Appearance-based active object recognition. Image Vis. Comput. 18(9), 715–727 (2000)CrossRefGoogle Scholar
  5. 5.
    Choy, C.B., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. In: NIPS (2016)Google Scholar
  6. 6.
    Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111(1), 98–136 (2015)CrossRefGoogle Scholar
  7. 7.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  8. 8.
    Grabner, A., Roth, P.M., Lepetit, V.: 3D pose estimation and 3D model retrieval for objects in the wild. In: CVPR (2018)Google Scholar
  9. 9.
    Han, K., et al.: SCNet: learning semantic correspondence. In: ICCV (2017)Google Scholar
  10. 10.
    He, X., Zhou, Y., Zhou, Z., Bai, S., Bai, X.: Triplet-center loss for multi-view 3D object retrieval. In: CVPR (2018)Google Scholar
  11. 11.
    Huang, H., Kalogerakis, E., Chaudhuri, S., Ceylan, D., Kim, V.G., Yumer, E.: Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM Trans. Graph. 37(1), 6 (2017)CrossRefGoogle Scholar
  12. 12.
    Kanezaki, A., Matsushita, Y., Nishida, Y.: RotationNet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: CVPR (2018)Google Scholar
  13. 13.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  14. 14.
    Kundu, J.N., Uppala, P.K., Pahuja, A., Babu, R.V.: Adadepth: unsupervised content congruent adaptation for depth estimation. In: CVPR (2018)Google Scholar
  15. 15.
    Liu, C., Yuen, J., Torralba, A.: SIFT flow: dense correspondence across scenes and its applications. In: Hassner, T., Liu, C. (eds.) Dense Image Correspondences for Computer Vision, pp. 15–49. Springer, Cham (2016). Scholar
  16. 16.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Mahendran, S., Ali, H., Vidal, R.: 3D pose regression using convolutional neural networks. In: ICCV (2017)Google Scholar
  18. 18.
    Paletta, L., Pinz, A.: Active object recognition by view integration and reinforcement learning. Robot. Auton. Syst. 31(1), 71–86 (2000)CrossRefGoogle Scholar
  19. 19.
    Poirson, P., Ammirato, P., Fu, C.Y., Liu, W., Kosecka, J., Berg, A.C.: Fast single shot detection and pose estimation. In: 3DV (2016)Google Scholar
  20. 20.
    Qi, C.R., Su, H., Niessner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view CNNs for object classification on 3D data. In: CVPR (2016)Google Scholar
  21. 21.
    Rhodin, H., et al.: Learning monocular 3D human pose estimation from multi-view images. In: CVPR (2018)Google Scholar
  22. 22.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Schmidt, T., Newcombe, R., Fox, D.: Self-supervised visual descriptor learning for dense correspondence. IEEE Robot. Autom. Lett. 2, 420 (2017)CrossRefGoogle Scholar
  24. 24.
    Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: ICCV (2015)Google Scholar
  25. 25.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: CVPR (2015)Google Scholar
  26. 26.
    Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  27. 27.
    Taniai, T., Sinha, S.N., Sato, Y.: Joint recovery of dense correspondence and cosegmentation in two images. In: CVPR (2016)Google Scholar
  28. 28.
    Tulsiani, S., Efros, A.A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. In: CVPR (2018)Google Scholar
  29. 29.
    Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: CVPR (2015)Google Scholar
  30. 30.
    Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016). Scholar
  31. 31.
    Xiang, Y., et al.: ObjectNet3D: a large scale database for 3D object recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 160–176. Springer, Cham (2016). Scholar
  32. 32.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3D object detection in the wild. In: WACV (2014)Google Scholar
  33. 33.
    Yu, W., Sun, X., Yang, K., Rui, Y., Yao, H.: Hierarchical semantic image matching using CNN feature pyramid. Comput. Vis. Image Underst. 169, 40–51 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Indian Institute of ScienceBengaluruIndia

Personalised recommendations