ShapeCodes: Self-supervised Feature Learning by Lifting Views to Viewgrids

  • Dinesh JayaramanEmail author
  • Ruohan Gao
  • Kristen Grauman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)


We introduce an unsupervised feature learning approach that embeds 3D shape information into a single-view image representation. The main idea is a self-supervised training objective that, given only a single 2D image, requires all unseen views of the object to be predictable from learned features. We implement this idea as an encoder-decoder convolutional neural network. The network maps an input image of an unknown category and unknown viewpoint to a latent space, from which a deconvolutional decoder can best “lift” the image to its complete viewgrid showing the object from all viewing angles. Our class-agnostic training procedure encourages the representation to capture fundamental shape primitives and semantic regularities in a data-driven manner—without manual semantic labels. Our results on two widely-used shape datasets show (1) our approach successfully learns to perform “mental rotation” even for objects unseen during training, and (2) the learned latent space is a powerful representation for object recognition, outperforming several existing unsupervised feature learning methods.



This research is supported in part by DARPA Lifelong Learning Machines, ONR PECASE N00014-15-1-2291, an IBM Open Collaborative Research Award, and Berkeley DeepDrive.

Supplementary material

474218_1_En_8_MOESM1_ESM.pdf (5.9 mb)
Supplementary material 1 (pdf 6057 KB)


  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)Google Scholar
  2. 2.
    Avidan, S., Shashua, A.: Novel view synthesis in tensor space. In: CVPR (1997)Google Scholar
  3. 3.
    Bengio, Y.: Learning deep architectures for ai. Foundations and trends\({\textregistered }\). Mach. Learn. 2(1), 1–127 (2009)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: ICML (2017)Google Scholar
  5. 5.
    Chang, A.X., et al.: ShapeNet: an information-rich 3d model repository. Technical report arXiv:1512.03012 [cs.GR], Stanford University—Princeton University—Toyota Technological Institute at Chicago (2015)
  6. 6.
    Chen, C.Y., Grauman, K.: Inferring unseen views of people. In: CVPR (2014)Google Scholar
  7. 7.
    Choy, C., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d–r2n2: a unified approach for single and multi-view 3d object reconstrution. In: ECCV (2016)Google Scholar
  8. 8.
    Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223 (2011)Google Scholar
  9. 9.
    Cohen, T.S., Welling, M.: Transformation properties of learned visual representations. In: ICLR (2014)Google Scholar
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  11. 11.
    Ding, W., Taylor, G.: “Mental rotation” by optimizing transforming distance. In: NIPS Workshop (2014)Google Scholar
  12. 12.
    Doersch, C., Gupta, A., Efros, A.: Unsupervised visual representation learning by context prediction. In: ICCV (2015)Google Scholar
  13. 13.
    Dosovitskiy, A., Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: CVPR (2015)Google Scholar
  14. 14.
    Fan, H., Su, H., Guibas, L.: A point set generation network for 3D object reconstruction from a single image. In: Conference on Computer Vision and Pattern Recognition (CVPR), vol. 38 (2017)Google Scholar
  15. 15.
    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: learning to predict new views from the world’s imagery. In: CVPR (2015)Google Scholar
  16. 16.
    Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: CVPR (2018)Google Scholar
  17. 17.
    Gao, R., Jayaraman, D., Grauman, K.: Object-centric representation learning from unlabeled videos. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 248–263. Springer, Cham (2017). Scholar
  18. 18.
    Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016). Scholar
  19. 19.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014Google Scholar
  20. 20.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  21. 21.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)Google Scholar
  22. 22.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  23. 23.
    Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 44–51. Springer, Heidelberg (2011). Scholar
  24. 24.
    Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)Google Scholar
  26. 26.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: ICCV (2015)Google Scholar
  27. 27.
    Jayaraman, D., Grauman, K.: Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 489–505. Springer, Cham (2016). Scholar
  28. 28.
    Jayaraman, D., Grauman, K.: Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR (2016)Google Scholar
  29. 29.
    Jayaraman, D., Grauman, K.: End-to-end policy learning for active visual categorization. In: TPAMI (2018)Google Scholar
  30. 30.
    Jayaraman, D., Grauman, K.: Learning to look around: intelligently exploring unseen environments for unknown tasks. In: CVPR (2018)Google Scholar
  31. 31.
    Jayaraman, D., Grauman, K.: Learning image representations tied to egomotion from unlabeled video. Int. J. Comput. Vis. 125, 136–161 (2017)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Ji, D., Kwon, J., McFarland, M., Savarese, S.: Deep view morphing. In: CVPR (2017)Google Scholar
  33. 33.
    Johns, E., Leutenegger, S., Davison, A.: Pairwise decomposition of image sequences for active multi-view recognition. In: CVPR (2016)Google Scholar
  34. 34.
    Kang, S.B.: A survey of image-based rendering techniques. In: Videometrics SPIE International Symposium on Electronic Imaging: Science and Technology (1999)Google Scholar
  35. 35.
    Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)Google Scholar
  36. 36.
    Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  37. 37.
    Kulkarni, T., Whitney, W., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS (2015)Google Scholar
  38. 38.
    Kutulakos, K., Seitz, S.: A theory of shape by space carving. In: IJCV (2000)Google Scholar
  39. 39.
    Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR (2017)Google Scholar
  40. 40.
    Lenc, K., Vedaldi, A.: Understanding image representations by measuring their equivariance and equivalence. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 991–999 (2015)Google Scholar
  41. 41.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  42. 42.
    van der Maaten, L., Hinton, G.: Visualizing data using t-sne. JMLR 9, 2579–2625 (2008)zbMATHGoogle Scholar
  43. 43.
    Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 52–59. Springer, Heidelberg (2011). Scholar
  44. 44.
    Matusik, W., Buehler, C., Raskar, R., Gortler, S., McMillan, L.: Image-based visual hulls. In: SIGGRAPH (2000)Google Scholar
  45. 45.
    Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). Scholar
  46. 46.
    Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)Google Scholar
  47. 47.
    Poier, G., Schinagl, D., Bischof, H.: Learning pose specific representations by predicting different views. In: CVPR (2018)Google Scholar
  48. 48.
    Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view cnns for object classification on 3d data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656 (2016)Google Scholar
  49. 49.
    Ramanathan, V., Pinz, A.: Active object categorization on a humanoid robot. In: VISAPP (2011)Google Scholar
  50. 50.
    Rezende, D., Eslami, S., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3d structure from images. In: NIPS (2016)Google Scholar
  51. 51.
    Schiele, B., Crowley, J.: Transinformation for active object recognition. In: ICCV (1998)Google Scholar
  52. 52.
    Seitz, S., Dyer, C.: View morphing. In: SIGGRAPH (1996)Google Scholar
  53. 53.
    Shepard, R.N., Metzler, J.: Mental rotation of three-dimensional objects. Science 171, 701–703 (1971)CrossRefGoogle Scholar
  54. 54.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  55. 55.
    Song, S., et al.: Im2pano3d: extrapolating 360 structure and semantics beyond the field of view. In: CVPR (2018)Google Scholar
  56. 56.
    Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multiview convolutional neural netowrks for 3d shape recognition. In: ICCV (2015)Google Scholar
  57. 57.
    Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3d models from single images with a convolutional network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 322–337. Springer, Cham (2016). Scholar
  58. 58.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: ICCV (2015)Google Scholar
  59. 59.
    Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. In: ICCV (2017)Google Scholar
  60. 60.
    Wu, J., et al.: Single image 3d interpreter network. In: ECCV (2016)Google Scholar
  61. 61.
    Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In: NIPS (2016)Google Scholar
  62. 62.
    Wu, Z., et al.: 3d shapenets: A deep representation for volumetric shapes. In: CVPR (2015)Google Scholar
  63. 63.
    Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3d voxel patterns for object category recognition. In: CVPR (2015)Google Scholar
  64. 64.
    Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In: NIPS (2016)Google Scholar
  65. 65.
    Yang, J., Reed, S., Yang, M.H., Lee, H.: Weakly supervised disentangling with recurrent transformations in 3d view synthesis. In: NIPS (2015)Google Scholar
  66. 66.
    Zhang, R., Isola, P., Efros, A.: Split-brain autoencoders: unsupervised learning by cross-channel prediction. In: CVPR (2017)Google Scholar
  67. 67.
    Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Dinesh Jayaraman
    • 1
    • 2
    Email author
  • Ruohan Gao
    • 2
  • Kristen Grauman
    • 2
    • 3
  1. 1.UC BerkeleyBerkeleyUSA
  2. 2.UT AustinAustinUSA
  3. 3.Facebook AI ResearchMenlo ParkUSA

Personalised recommendations