Advertisement

Joint Viewpoint and Keypoint Estimation with Real and Synthetic Data

  • Pau Panareda BustoEmail author
  • Juergen Gall
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11824)

Abstract

The estimation of viewpoints and keypoints effectively enhance object detection methods by extracting valuable traits of the object instances. While the output of both processes differ, i.e., angles vs. list of characteristic points, they indeed share the same focus on how the object is placed in the scene, inducing that there is a certain level of correlation between them. Therefore, we propose a convolutional neural network that jointly computes the viewpoint and keypoints for different object categories. By training both tasks together, each task improves the accuracy of the other. Since the labelling of object keypoints is very time consuming for human annotators, we also introduce a new synthetic dataset with automatically generated viewpoint and keypoints annotations. Our proposed network can also be trained on datasets that contain viewpoint and keypoints annotations or only one of them. The experiments show that the proposed approach successfully exploits this implicit correlation between the tasks and outperforms previous techniques that are trained independently .

Notes

Acknowledgement

The work has been supported by the ERC Starting Grant ARCA (677650).

References

  1. 1.
    Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: IEEE International Conference on Automatic Face & Gesture Recognition, pp. 468–475 (2017)Google Scholar
  2. 2.
    Chang, A.X., et al.: Shapenet: An information-rich 3D model repository. CoRR abs/1512.3012 (2015)Google Scholar
  3. 3.
    Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5669–5678 (2017)Google Scholar
  4. 4.
    Divon, G., Tal, A.: Viewpoint estimation–insights & model. In: IEEE European Conference on Computer Vision, pp. 252–268 (2018)Google Scholar
  5. 5.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  6. 6.
    Fenzi, M., Leal-Taixe, L., Rosenhahn, B., Ostermann, J.: Class generative models based on feature regression for pose estimation of object categories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 755–762 (2013)Google Scholar
  7. 7.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361 (2012)Google Scholar
  8. 8.
    Ghodrati, A., Pedersoli, M., Tuytelaars, T.: Is 2D information enough for viewpoint estimation? In: British Machine Vision Conference, pp. 1–12 (2014)Google Scholar
  9. 9.
    Grabner, A., Roth, P.M., Lepetit, V.: 3D pose estimation and 3D model retrieval for objects in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3022–3031 (2018)Google Scholar
  10. 10.
    He, K., Sigal, L., Sclaroff, S.: Parameterizing object detectors in the continuous pose space. In: IEEE European Conference on Computer Vision, pp. 450–465 (2014)Google Scholar
  11. 11.
    Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: IEEE International Conference on Computer Vision, pp. 1521–1529 (2017)Google Scholar
  12. 12.
    Keys, R.G.: Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 29(6), 1153–1160 (1981)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Liebelt, J., Schmid, C.: Multi-view object class detection with a 3D geometric model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1688–1695 (2010)Google Scholar
  14. 14.
    Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014)Google Scholar
  15. 15.
    Massa, F., Marlet, R., Aubry, M.: Crafting a multi-task CNN for viewpoint estimation. In: British Machine Vision Conference (2016)Google Scholar
  16. 16.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: IEEE European Conference on Computer Vision, pp. 483–499 (2016)Google Scholar
  17. 17.
    Panareda Busto, P., Gall, J.: Viewpoint refinement and estimation with adapted synthetic data. Comput. Vis. Image Underst. 169, 75–89 (2018)CrossRefGoogle Scholar
  18. 18.
    Panareda Busto, P., Liebelt, J., Gall, J.: Adaptation of synthetic data for coarse-to-fine viewpoint refinement. In: British Machine Vision Conference (2015)Google Scholar
  19. 19.
    Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: IEEE International Conference on Robotics and Automation, pp. 2011–2018 (2017)Google Scholar
  20. 20.
    Peng, X., Sun, B., Ali, K., Saenko, K.: Learning deep object detectors from 3D models. In: IEEE International Conference on Computer Vision, pp. 1278–1286 (2015)Google Scholar
  21. 21.
    Pepik, B., Stark, M., Gehler, P., Schiele, B.: Teaching 3D geometry to deformable part models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3362–3369 (2012)Google Scholar
  22. 22.
    Pepik, B., Stark, M., Gehler, P., Ritschel, T., Schiele, B.: 3D object class detection in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition: Workshops, pp. 1–10 (2015)Google Scholar
  23. 23.
    Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T., Schiele, B.: Learning people detection models from few training samples. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1480 (2011)Google Scholar
  24. 24.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  25. 25.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: IEEE International Conference on Computer Vision, pp. 2686–2694 (2015)Google Scholar
  26. 26.
    Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)Google Scholar
  27. 27.
    Torki, M., Elgammal, A.: Regression from local features for viewpoint and pose estimation. In: IEEE International Conference on Computer Vision, pp. 2603–2610 (2011)Google Scholar
  28. 28.
    Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)Google Scholar
  29. 29.
    Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519 (2015)Google Scholar
  30. 30.
    Wang, Y., et al.: 3D pose estimation for fine-grained object categories. In: IEEE European Conference on Computer Vision: Workshops (2018)Google Scholar
  31. 31.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)Google Scholar
  32. 32.
    Wu, J., et al.: Single image 3d interpreter network. In: IEEE European Conference on Computer Vision, pp. 365–382 (2016)Google Scholar
  33. 33.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: a benchmark for 3D object detection in the wild. In: IEEE Winter Conference on Applications of Computer Vision, pp. 75–82 (2014)Google Scholar
  34. 34.
    Xiang, Y., et al.: Objectnet3D: a large scale database for 3D object recognition. In: IEEE European Conference on Computer Vision, pp. 160–176 (2016)Google Scholar
  35. 35.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1385–1392 (2011)Google Scholar
  36. 36.
    Zhou, X., Karpur, A., Luo, L., Huang, Q.: Starmap for category-agnostic keypoint and viewpoint estimation. In: IEEE European Conference on Computer Vision, pp. 318–334 (2018)Google Scholar
  37. 37.
    Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: IEEE International Conference on Computer Vision, pp. 4903–4911 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Computer Vision GroupUniversity of BonnBonnGermany

Personalised recommendations