Advertisement

ShapeCodes: Self-supervised Feature Learning by Lifting Views to Viewgrids

  • Dinesh Jayaraman
  • Ruohan Gao
  • Kristen Grauman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)

Abstract

We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation. The main idea is a self-supervised training objective that, given only a single 2D image, requires all unseen views of the object to be predictable from learned features. We implement this idea as an encoder-decoder convolutional neural network. The network maps an input image of an unknown category and unknown viewpoint to a latent space, from which a deconvolutional decoder can best “lift” the image to its complete viewgrid showing the object from all viewing angles. Our class-agnostic training procedure encourages the representation to capture fundamental shape primitives and semantic regularities in a data-driven manner—without manual semantic labels. Our results on two widely-used shape datasets show (1) our approach successfully learns to perform “mental rotation” even for objects unseen during training, and (2) the learned latent space is a powerful representation for object recognition, outperforming several existing unsupervised feature learning methods.

Notes

Acknowledgements

This research is supported in part by DARPA Lifelong Learning Machines, ONR PECASE N00014-15-1-2291, an IBM Open Collaborative Research Award, and Berkeley DeepDrive.

Supplementary material

474218_1_En_8_MOESM1_ESM.pdf (5.9 mb)
Supplementary material 1 (pdf 6057 KB)

References

  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)Google Scholar
  2. 2.
    Avidan, S., Shashua, A.: Novel view synthesis in tensor space. In: CVPR (1997)Google Scholar
  3. 3.
    Bengio, Y.: Learning deep architectures for ai. Foundations and trends\({\textregistered }\). Mach. Learn. 2(1), 1–127 (2009)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: ICML (2017)Google Scholar
  5. 5.
    Chang, A.X., et al.: ShapeNet: an information-rich 3d model repository. Technical report arXiv:1512.03012 [cs.GR], Stanford University—Princeton University—Toyota Technological Institute at Chicago (2015)
  6. 6.
    Chen, C.Y., Grauman, K.: Inferring unseen views of people. In: CVPR (2014)Google Scholar
  7. 7.
    Choy, C., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d–r2n2: a unified approach for single and multi-view 3d object reconstrution. In: ECCV (2016)Google Scholar
  8. 8.
    Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223 (2011)Google Scholar
  9. 9.
    Cohen, T.S., Welling, M.: Transformation properties of learned visual representations. In: ICLR (2014)Google Scholar
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  11. 11.
    Ding, W., Taylor, G.: “Mental rotation” by optimizing transforming distance. In: NIPS Workshop (2014)Google Scholar
  12. 12.
    Doersch, C., Gupta, A., Efros, A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)Google Scholar
  13. 13.
    Dosovitskiy, A., Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: CVPR (2015)Google Scholar
  14. 14.
    Fan, H., Su, H., Guibas, L.: A point set generation network for 3D object reconstruction from a single image. In: Conference on Computer Vision and Pattern Recognition (CVPR), vol. 38 (2017)Google Scholar
  15. 15.
    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: learning to predict new views from the world’s imagery. In: CVPR (2015)Google Scholar
  16. 16.
    Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: CVPR (2018)Google Scholar
  17. 17.
    Gao, R., Jayaraman, D., Grauman, K.: Object-centric representation learning from unlabeled videos. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 248–263. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54193-8_16CrossRefGoogle Scholar
  18. 18.
    Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_29CrossRefGoogle Scholar
  19. 19.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014Google Scholar
  20. 20.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  21. 21.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)Google Scholar
  22. 22.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  23. 23.
    Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 44–51. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-21735-7_6CrossRefGoogle Scholar
  24. 24.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)Google Scholar
  26. 26.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: ICCV (2015)Google Scholar
  27. 27.
    Jayaraman, D., Grauman, K.: Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 489–505. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_30CrossRefGoogle Scholar
  28. 28.
    Jayaraman, D., Grauman, K.: Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR (2016)Google Scholar
  29. 29.
    Jayaraman, D., Grauman, K.: End-to-end policy learning for active visual categorization. In: TPAMI (2018)Google Scholar
  30. 30.
    Jayaraman, D., Grauman, K.: Learning to look around: intelligently exploring unseen environments for unknown tasks. In: CVPR (2018)Google Scholar
  31. 31.
    Jayaraman, D., Grauman, K.: Learning image representations tied to egomotion from unlabeled video. Int. J. Comput. Vis. 125, 136–161 (2017)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Ji, D., Kwon, J., McFarland, M., Savarese, S.: Deep view morphing. In: CVPR (2017)Google Scholar
  33. 33.
    Johns, E., Leutenegger, S., Davison, A.: Pairwise decomposition of image sequences for active multi-view recognition. In: CVPR (2016)Google Scholar
  34. 34.
    Kang, S.B.: A survey of image-based rendering techniques. In: Videometrics SPIE International Symposium on Electronic Imaging: Science and Technology (1999)Google Scholar
  35. 35.
    Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)Google Scholar
  36. 36.
    Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  37. 37.
    Kulkarni, T., Whitney, W., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS (2015)Google Scholar
  38. 38.
    Kutulakos, K., Seitz, S.: A theory of shape by space carving. In: IJCV (2000)Google Scholar
  39. 39.
    Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR (2017)Google Scholar
  40. 40.
    Lenc, K., Vedaldi, A.: Understanding image representations by measuring their equivariance and equivalence. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 991–999 (2015)Google Scholar
  41. 41.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  42. 42.
    van der Maaten, L., Hinton, G.: Visualizing data using t-sne. JMLR 9, 2579–2625 (2008)zbMATHGoogle Scholar
  43. 43.
    Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 52–59. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-21735-7_7CrossRefGoogle Scholar
  44. 44.
    Matusik, W., Buehler, C., Raskar, R., Gortler, S., McMillan, L.: Image-based visual hulls. In: SIGGRAPH (2000)Google Scholar
  45. 45.
    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_5CrossRefGoogle Scholar
  46. 46.
    Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)Google Scholar
  47. 47.
    Poier, G., Schinagl, D., Bischof, H.: Learning pose specific representations by predicting different views. In: CVPR (2018)Google Scholar
  48. 48.
    Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view cnns for object classification on 3d data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656 (2016)Google Scholar
  49. 49.
    Ramanathan, V., Pinz, A.: Active object categorization on a humanoid robot. In: VISAPP (2011)Google Scholar
  50. 50.
    Rezende, D., Eslami, S., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3d structure from images. In: NIPS (2016)Google Scholar
  51. 51.
    Schiele, B., Crowley, J.: Transinformation for active object recognition. In: ICCV (1998)Google Scholar
  52. 52.
    Seitz, S., Dyer, C.: View morphing. In: SIGGRAPH (1996)Google Scholar
  53. 53.
    Shepard, R.N., Metzler, J.: Mental rotation of three-dimensional objects. Science 171, 701–703 (1971)CrossRefGoogle Scholar
  54. 54.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  55. 55.
    Song, S., et al.: Im2pano3d: extrapolating 360 structure and semantics beyond the field of view. In: CVPR (2018)Google Scholar
  56. 56.
    Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multiview convolutional neural netowrks for 3d shape recognition. In: ICCV (2015)Google Scholar
  57. 57.
    Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3d models from single images with a convolutional network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 322–337. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_20CrossRefGoogle Scholar
  58. 58.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV (2015)Google Scholar
  59. 59.
    Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. In: ICCV (2017)Google Scholar
  60. 60.
    Wu, J., et al.: Single image 3d interpreter network. In: ECCV (2016)Google Scholar
  61. 61.
    Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In: NIPS (2016)Google Scholar
  62. 62.
    Wu, Z., et al.: 3d shapenets: A deep representation for volumetric shapes. In: CVPR (2015)Google Scholar
  63. 63.
    Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3d voxel patterns for object category recognition. In: CVPR (2015)Google Scholar
  64. 64.
    Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In: NIPS (2016)Google Scholar
  65. 65.
    Yang, J., Reed, S., Yang, M.H., Lee, H.: Weakly supervised disentangling with recurrent transformations in 3d view synthesis. In: NIPS (2015)Google Scholar
  66. 66.
    Zhang, R., Isola, P., Efros, A.: Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: CVPR (2017)Google Scholar
  67. 67.
    Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_18CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Dinesh Jayaraman
    • 1
    • 2
  • Ruohan Gao
    • 2
  • Kristen Grauman
    • 2
    • 3
  1. 1.UC BerkeleyBerkeleyUSA
  2. 2.UT AustinAustinUSA
  3. 3.Facebook AI ResearchMenlo ParkUSA

Personalised recommendations