Advertisement

StarMap for Category-Agnostic Keypoint and Viewpoint Estimation

  • Xingyi ZhouEmail author
  • Arjun Karpur
  • Linjie Luo
  • Qixing Huang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)

Abstract

Semantic keypoints provide concise abstractions for a variety of visual understanding tasks. Existing methods define semantic keypoints separately for each category with a fixed number of semantic labels in fixed indices. As a result, this keypoint representation is in-feasible when objects have a varying number of parts, e.g. chairs with varying number of legs. We propose a category-agnostic keypoint representation, which combines a multi-peak heatmap (StarMap) for all the keypoints and their corresponding features as 3D locations in the canonical viewpoint (CanViewFeature) defined for each instance. Our intuition is that the 3D locations of the keypoints in canonical object views contain rich semantic and compositional information. Using our flexible representation, we demonstrate competitive performance in keypoint detection and localization compared to category-specific state-of-the-art methods. Moreover, we show that when augmented with an additional depth channel (DepthMap) to lift the 2D keypoints to 3D, our representation can achieve state-of-the-art results in viewpoint estimation. Finally, we show that our category-agnostic keypoint representation can be generalized to novel categories.

Keywords

3D vision Category-agnostic Keypoint estimation Viewpoint estimation Pose estimation 

Notes

Acknowledgement

We thank Shubham Tulsiani and Angela Lin for the helpful discussions.

Supplementary material

474172_1_En_20_MOESM1_ESM.pdf (40.1 mb)
Supplementary material 1 (pdf 41066 KB)

References

  1. 1.
    Altwaijry, H., Veit, A., Belongie, S.J., Tech, C.: Learning to detect and match keypoints with deep architectures. In: BMVC (2016)Google Scholar
  2. 2.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014Google Scholar
  3. 3.
    Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting people using mutually consistent poselet activations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 168–181. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15567-3_13CrossRefGoogle Scholar
  4. 4.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR, vol. 1, p. 7 (2017)Google Scholar
  5. 5.
    Cignoni, P., Callieri, M., Corsini, M., Dellepiane, M., Ganovelli, F., Ranzuglia, G.: MeshLab: an open-source mesh processing tool. In: Scarano, V., Chiara, R.D., Erra, U. (eds.) Eurographics Italian Chapter Conference, The Eurographics Association (2008).  https://doi.org/10.2312/LocalChapterEvents/ItalChap/ItalianChapConf2008/129-136
  6. 6.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  7. 7.
    Horn, B.K.: Closed-form solution of absolute orientation using unit quaternions. JOSA A 4(4), 629–642 (1987)CrossRefGoogle Scholar
  8. 8.
    Huang, X., Shen, C., Boix, X., Zhao, Q.: Salicon: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (2015)Google Scholar
  9. 9.
    Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. arXiv (2018)Google Scholar
  10. 10.
    Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: Computer Vision and Pattern Regognition (CVPR) (2015)Google Scholar
  11. 11.
    Kendall, A., Grimes, M., Cipolla, R.: PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2938–2946. IEEE (2015)Google Scholar
  12. 12.
    Lepetit, V., Moreno-Noguer, F., Fua, P.: EPnP: an accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 81(2), 155 (2009)CrossRefGoogle Scholar
  13. 13.
    Li, S., Chan, A.B.: 3d human pose estimation from monocular images with deep convolutional neural network. In: Asian Conference on Computer Vision, pp. 332–347. Springer, Cham (2014)Google Scholar
  14. 14.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  15. 15.
    Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: Advances in Neural Information Processing Systems, pp. 1601–1609 (2014)Google Scholar
  16. 16.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Lu, C.P., Hager, G.D., Mjolsness, E.: Fast and globally convergent pose estimation from video images. IEEE Trans. Pattern Anal. Mach. Intell. 22(6), 610–622 (2000)CrossRefGoogle Scholar
  18. 18.
    Mahendran, S., Ali, H., Vidal, R.: Joint object category and 3d pose estimation from 2d images. arXiv preprint arXiv:1711.07426 (2017)
  19. 19.
    Mousavian, A., Anguelov, D., Flynn, J., Košecká, J.: 3d bounding box estimation using deep learning and geometry. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5640. IEEE (2017)Google Scholar
  20. 20.
    Newell, A., Deng, J.: Pixels to graphs by associative embedding. In: Advances in Neural Information Processing Systems. pp. 2168–2177 (2017)Google Scholar
  21. 21.
    Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems, pp. 2274–2284 (2017)Google Scholar
  22. 22.
    Newell, A., Yang, K., Deng, J.: Stacked Hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29CrossRefGoogle Scholar
  23. 23.
    Papadopoulos, D.P., Uijlings, J.R., Keller, F., Ferrari, V.: Extreme clicking for efficient object annotation. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4940–4949. IEEE (2017)Google Scholar
  24. 24.
    Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DOF object pose from semantic keypoints. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. IEEE (2017)Google Scholar
  25. 25.
    Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1263–1272. IEEE (2017)Google Scholar
  26. 26.
    Ronchi, M.R., Perona, P.: Benchmarking and error diagnosis in multi-instance pose estimation. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  28. 28.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3d model views. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694 (2015)Google Scholar
  29. 29.
    Szeto, R., Corso, J.J.: Click here: human-localized keypoints as guidance for viewpoint estimation. arXiv preprint arXiv:1703.09859 (2017)
  30. 30.
    Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.: The vitruvian manifold: inferring dense correspondences for one-shot human pose estimation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 103–110. IEEE (2012)Google Scholar
  31. 31.
    Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)Google Scholar
  32. 32.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems. pp. 1799–1807 (2014)Google Scholar
  33. 33.
    Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)Google Scholar
  34. 34.
    Tulsiani, S., Carreira, J., Malik, J.: Pose induction for novel object categories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 64–72 (2015)Google Scholar
  35. 35.
    Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519 (2015)Google Scholar
  36. 36.
    Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: Computer Vision and Pattern Regognition (CVPR) (2017)Google Scholar
  37. 37.
    Wei, L., Huang, Q., Ceylan, D., Vouga, E., Li, H.: Dense human body correspondences using convolutional networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, pp. 1544–1553 (2016)Google Scholar
  38. 38.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)Google Scholar
  39. 39.
    Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W.T., Tenenbaum, J.B.: MarrNet: 3D shape reconstruction via 2.5D sketches. In: Advances In Neural Information Processing Systems (2017)Google Scholar
  40. 40.
    Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_22CrossRefGoogle Scholar
  41. 41.
    Xiang, Y., et al.: ObjectNet3D: a large scale database for 3D object recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 160–176. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_10CrossRefGoogle Scholar
  42. 42.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond PASCAL: a benchmark for 3d object detection in the wild. In: 2014 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 75–82. IEEE (2014)Google Scholar
  43. 43.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  44. 44.
    Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: The IEEE International Conference on Computer Vision (ICCV), vol. 2 (2017)Google Scholar
  45. 45.
    Yi, L., et al.: A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. (TOG) 35(6), 210 (2016)CrossRefGoogle Scholar
  46. 46.
    Yuan, S., Garcia-Hernando, G., Stenger, B., Moon, G., Chang, J.Y., Lee, K.M., Molchanov, P., Kautz, J., Honari, S., Ge, L., et al.: 3d hand pose estimation: From current achievements to future goals. arXiv preprint arXiv:1712.03917 (2017)
  47. 47.
    Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 834–849. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_54CrossRefGoogle Scholar
  48. 48.
    Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3d-guided cycle consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126 (2016)Google Scholar
  49. 49.
    Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3d human pose estimation in the wild: a weakly-supervised approach. In: The IEEE International Conference on Computer Vision (ICCV), October 2017Google Scholar
  50. 50.
    Zhou, X., Karpur, A., Gan, C., Luo, L., Huang, Q.: Unsupervised domain adaptation for 3d keypoint prediction from a single depth scan. arXiv preprint arXiv:1712.05765 (2017)
  51. 51.
    Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: Deep kinematic pose regression. arXiv preprint arXiv:1609.05317 (2016)
  52. 52.
    Zhou, X., Wan, Q., Zhang, W., Xue, X., Wei, Y.: Model-based deep hand pose estimation. arXiv preprint arXiv:1606.06854 (2016)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.The University of Texas at AustinAustinUSA
  2. 2.Snap Inc.VeniceUSA

Personalised recommendations