Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance

  • Zhixin ShuEmail author
  • Mihir Sahasrabudhe
  • Rıza Alp Güler
  • Dimitris Samaras
  • Nikos Paragios
  • Iasonas Kokkinos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11214)


In this work we introduce Deforming Autoencoders, a generative model for images that disentangles shape from appearance in an unsupervised manner. As in the deformable template paradigm, shape is represented as a deformation between a canonical coordinate system (‘template’) and an observed image, while appearance is modeled in deformation-invariant, template coordinates. We introduce novel techniques that allow this approach to be deployed in the setting of autoencoders and show that this method can be used for unsupervised group-wise image alignment. We show experiments with expression morphing in humans, hands, and digits, face manipulation, such as shape and appearance interpolation, as well as unsupervised landmark localization. We also achieve a more powerful form of unsupervised disentangling in template coordinates, that successfully decomposes face images into shading and albedo, allowing us to further manipulate face images.



This work was supported by a gift from Adobe, NSF grants CNS-1718014 and DMS 1737876, the Partner University Fund, and the SUNY2020 Infrastructure Transportation Security Center. Rıza Alp Güler was supported by the European Horizons 2020 grant no 643666 (I-Support).

Supplementary material

474197_1_En_40_MOESM1_ESM.pdf (15.5 mb)
Supplementary material 1 (pdf 15870 KB)


  1. 1.
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)Google Scholar
  2. 2.
    Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: CVPR (2017)Google Scholar
  3. 3.
    Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Interpretable transformations with encoder-decoder networks. In: CVPR (2017)Google Scholar
  4. 4.
    Sengupta, S., Kanazawa, A., Castillo, C.D., Jacobs, D.: SfSNet: learning shape, reflectance and illuminance of faces in the wild. arXiv preprint arXiv:1712.01261 (2017)
  5. 5.
    Memisevic, R., Hinton, G.E.: Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural Comput. 22, 1473–1492 (2010)CrossRefGoogle Scholar
  6. 6.
    Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Harmonic networks: deep translation and rotation equivariance (2016)Google Scholar
  7. 7.
    Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded image generation network for novel 3D view synthesis. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 702–711. IEEE (2017)Google Scholar
  8. 8.
    Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., Ranzato, M.: Fader networks: manipulating images by sliding attributes. CoRR abs/1706.00409 (2017)Google Scholar
  9. 9.
    Edwards, G.J., Cootes, T.F., Taylor, C.J.: Face recognition using active appearance models. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 581–595. Springer, Heidelberg (1998). Scholar
  10. 10.
    Matthews, I., Baker, S.: Active appearance models revisited. IJCV 60, 135–164 (2004)CrossRefGoogle Scholar
  11. 11.
    Learned-Miller, E.G.: Data driven image models through continuous joint alignment. PAMI 28, 236–250 (2006)CrossRefGoogle Scholar
  12. 12.
    Kokkinos, I., Yuille, A.L.: Unsupervised learning of object deformation models. In: ICCV (2007)Google Scholar
  13. 13.
    Frey, B.J., Jojic, N.: Transformation-invariant clustering using the EM algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25(1), 1–17 (2003)CrossRefGoogle Scholar
  14. 14.
    Jojic, N., Frey, B.J., Kannan, A.: Epitomic analysis of appearance and shape. In: 9th IEEE International Conference on Computer Vision (ICCV 2003), 14–17 October 2003, Nice, France, pp. 34–43 (2003)Google Scholar
  15. 15.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. CoRR abs/1506.02025 (2015)Google Scholar
  16. 16.
    Papandreou, G., Kokkinos, I., Savalle, P.: Modeling local and global deformations in deep learning: epitomic convolution, multiple instance learning, and sliding window detection. In: CVPR (2015)Google Scholar
  17. 17.
    Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)Google Scholar
  18. 18.
    Neverova, N., Kokkinos, I.: Mass displacement networks. Arxiv (2017)Google Scholar
  19. 19.
    Trigeorgis, G., Snape, P., Nicolaou, M.A., Antonakos, E., Zafeiriou, S.: Mnemonic descent method: a recurrent process applied for end-to-end face alignment. In: Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition (2016)Google Scholar
  20. 20.
    Güler, R.A., Trigeorgis, G., Antonakos, E., Snape, P., Zafeiriou, S., Kokkinos, I.: DenseReg: fully convolutional dense shape regression in-the-wild. In: CVPR (2017)Google Scholar
  21. 21.
    Cole, F., Belanger, D., Krishnan, D., Sarna, A., Mosseri, I., Freeman, W.T.: Face synthesis from facial identity features (2018)Google Scholar
  22. 22.
    Sengupta, S., Kanazawa, A., Castillo, C.D., Jacobs, D.W.: SfSNet : learning shape, reflectance and illuminance of faces in the wild. In: CVPR (2018)Google Scholar
  23. 23.
    Hinton, G.E.: A parallel computation that assigns canonical object-based frames of reference. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, IJCAI 1981, 24–28 August 1981, Vancouver, BC, Canada, pp. 683–685(1981)Google Scholar
  24. 24.
    Olshausen, B.A., Anderson, C.H., Essen, D.C.V.: A multiscale dynamic routing circuit for forming size- and position-invariant object representations. J. Comput. Neurosci. 2(1), 45–62 (1995)CrossRefGoogle Scholar
  25. 25.
    Malsburg, C.: The correlation theory of brain function. Internal Report 81–2. Gottingen Max-Planck-Institute for Biophysical Chemistry (1981)Google Scholar
  26. 26.
    Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 44–51. Springer, Heidelberg (2011). Scholar
  27. 27.
    Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. CoRR abs/1710.09829 (2017)Google Scholar
  28. 28.
    Bristow, H., Valmadre, J., Lucey, S.: Dense semantic correspondence where every pixel is a classifier. In: ICCV (2015)Google Scholar
  29. 29.
    Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: CVPR (2016)Google Scholar
  30. 30.
    Gaur, U., Manjunath, B.S.: Weakly supervised manifold learning for dense semantic object correspondence. In: ICCV (2017)Google Scholar
  31. 31.
    Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised object learning from dense equivariant image labelling (2017)Google Scholar
  32. 32.
    Amit, Y., Grenander, U., Piccioni, M.: Structural image restoration through deformable templates. J. Am. Stat. Assoc. 86(414), 376–387 (1991)CrossRefGoogle Scholar
  33. 33.
    Yuille, A.L.: Deformable templates for face recognition. J. Cogn. Neurosci. 3(1), 59–70 (1991)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Blanz, V.T., Vetter, T.: Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1063–1074 (2003)CrossRefGoogle Scholar
  35. 35.
    Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  36. 36.
    Afifi, M.: Gender recognition and biometric identification using a large dataset of hand images. CoRR abs/1711.04322 (2017)Google Scholar
  37. 37.
    Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 94–108. Springer, Cham (2014). Scholar
  38. 38.
    Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV) (2015)Google Scholar
  39. 39.
    Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)Google Scholar
  40. 40.
    Li, C., Wand, M.: Precomputed Real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). Scholar
  41. 41.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arxiv (2016)Google Scholar
  42. 42.
    Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning deep representation for face alignment with auxiliary attributes. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 918–930 (2016)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Zhixin Shu
    • 1
    Email author
  • Mihir Sahasrabudhe
    • 2
  • Rıza Alp Güler
    • 2
    • 3
  • Dimitris Samaras
    • 1
  • Nikos Paragios
    • 2
    • 4
  • Iasonas Kokkinos
    • 5
    • 6
  1. 1.Stony Brook UniversityStony BrookUSA
  2. 2.CentraleSupélec, Université Paris-SaclayGif-sur-YvetteFrance
  3. 3.INRIARocquencourtFrance
  4. 4.TheraPanaceaParisFrance
  5. 5.Univeristy College LondonLondonUK
  6. 6.Facebook AI ResearchParisFrance

Personalised recommendations