X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes

  • Olivia WilesEmail author
  • A. Sophia Koepke
  • Andrew Zisserman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11217)


The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing.

We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another face in a driving frame to produce a generated frame with the identity of the source frame but the pose and expression of the face in the driving frame. Second, we propose a method for training the network fully self-supervised using a large collection of video data. Third, we show that the generation process can be driven by other modalities, such as audio or pose codes, without any further training of the network.

The generation results for driving a face with another face are compared to state-of-the-art self-supervised/supervised methods. We show that our approach is more robust than other methods, as it makes fewer assumptions about the input data. We also show examples of using our framework for video face editing.



The authors are grateful to Hadar Averbuch-Elor for helpfully running their model on our data and to Vicky Kalogeiton for suggestions/comments. This work was funded by an EPSRC studentship and EPSRC Programme Grant Seebibyte EP/M013774/1.

Supplementary material

Supplementary material 1 (avi 23473 KB)

Supplementary material 2 (mp4 51240 KB)

474201_1_En_41_MOESM3_ESM.pdf (19.8 mb)
Supplementary material 3 (pdf 20257 KB)


  1. 1.
    Averbuch-Elor, H., Cohen-Or, D., Kopf, J., Cohen, M.F.: Bringing portraits to life. ACM Trans. Graph. (Proceeding of SIGGRAPH Asia 2017) 36(6), 196 (2017)Google Scholar
  2. 2.
    Bas, A., Smith, W.A.P., Awais, M., Kittler, J.: 3D morphable models as spatial transformer networks. In: Proceedings of ICCV Workshop on Geometry Meets Deep Learning (2017)Google Scholar
  3. 3.
    Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of ACM SIGGRAPH (1999)Google Scholar
  4. 4.
    Booth, J., Roussos, A., Ponniah, A., Dunaway, D., Zafeiriou, S.: Large scale 3D morphable models. IJCV 126(2–4), 233–254 (2018)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Cao, J., Hu, Y., Yu, B., He, R., Sun, Z.: Load balanced GANs for multi-view face image synthesis. arXiv preprint arXiv:1802.07447 (2018)
  6. 6.
    Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of ICCV (2017)Google Scholar
  7. 7.
    Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)Google Scholar
  8. 8.
    Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of CVPR (2017)Google Scholar
  9. 9.
    Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). Scholar
  10. 10.
    Dale, K., Sunkavalli, K., Johnson, M.K., Vlasic, D., Matusik, W., Pfister, H.: Video face replacement. ACM Trans. Graph. (TOG) 30(6), 130 (2011)CrossRefGoogle Scholar
  11. 11.
    Denton, E.L., Birodkar, V.: Unsupervised learning of disentangled representations from video. In: NIPS (2017)Google Scholar
  12. 12.
    Ding, H., Sricharan, K., Chellappa, R.: ExprGAN: facial expression editing with controllable expression intensity. In: Proceedings of AAAI (2018)Google Scholar
  13. 13.
    Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of CVPR (2016)Google Scholar
  14. 14.
    Hassner, T., Harel, S., Paz, E., Enbar, R.: Effective face frontalization in unconstrained images. In: Proceedings of CVPR (2015)Google Scholar
  15. 15.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of CVPR (2017)Google Scholar
  16. 16.
    Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (TOG) 36(4), 94 (2017)CrossRefGoogle Scholar
  17. 17.
    Kim, H., et al.: Deep video portraits. In: Proceedings of ACM SIGGRAPH (2018)Google Scholar
  18. 18.
    King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)Google Scholar
  19. 19.
    Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: Proceedings of First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies (2011)Google Scholar
  20. 20.
    Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast face-swap using convolutional neural networks. In: Proceedings of ICCV (2017)Google Scholar
  21. 21.
    Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: NIPS (2015)Google Scholar
  22. 22.
    Kumar, A., Alavi, A., Chellappa, R.: KEPLER: keypoint and pose estimation of unconstrained faces by learning efficient H-CNN regressors. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition (2017)Google Scholar
  23. 23.
    Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)Google Scholar
  24. 24.
    Nirkin, Y., Masi, I., Tran, A.T., Hassner, T., Medioni, G.: On face segmentation, face swapping, and face perception. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2018)Google Scholar
  25. 25.
    Olszewski, K., et al.: Realistic dynamic facial textures from a single image using GANs. In: Proceedings of ICCV (2017)Google Scholar
  26. 26.
    Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of BMVC (2015)Google Scholar
  27. 27.
    Paszke, A., et al.: Automatic differentiation in PyTorch (2017)Google Scholar
  28. 28.
    Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. (TOG) 22(3), 313–318 (2003)CrossRefGoogle Scholar
  29. 29.
    Pătrăucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory. In: NIPS (2016)Google Scholar
  30. 30.
    Qiao, F., Yao, N., Jiao, Z., Li, Z., Chen, H., Wang, H.: Geometry-contrastive generative adversarial network for facial expression synthesis. arXiv preprint arXiv:1802.01822 (2018)
  31. 31.
    Rav-Acha, A., Kohli, P., Rother, C., Fitzgibbon, A.: Unwrap mosaics: a new representation for video editing. ACM Trans. Graph. (TOG) 27(3), 17 (2008)CrossRefGoogle Scholar
  32. 32.
    Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). Scholar
  33. 33.
    Roth, J., Tong, Y., Liu, X.: Adaptive 3D face reconstruction from unconstrained photo collections. In: Proceedings of CVPR (2016)Google Scholar
  34. 34.
    Saito, S., Wei, L., Hu, L., Nagano, K., Li, H.: Photorealistic facial texture inference using deep neural networks. In: Proceedings of CVPR (2017)Google Scholar
  35. 35.
    Saragih, J.M., Lucey, S., Cohn, J.F.: Real-time avatar animation from a single image. In: Proceedings of International Conference on Automatic Face and Gesture Recognition (2011)Google Scholar
  36. 36.
    Shlizerman, E., Dery, L., Schoen, H., Kemelmacher-Shlizerman, I.: Audio to body dynamics. In: Proceedings of CVPR (2018)Google Scholar
  37. 37.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)Google Scholar
  38. 38.
    Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. (TOG) 36(4), 95 (2017)CrossRefGoogle Scholar
  39. 39.
    Tewari, A., et al.: Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of ICCV (2017)Google Scholar
  40. 40.
    Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2Face: real-time face capture and reenactment of RGB videos. In: Proceedings of CVPR (2016)Google Scholar
  41. 41.
    Tran, A.T., Hassner, T., Masi, I., Paz, E., Nirkin, Y., Medioni, G.: Extreme 3D face reconstruction: Seeing through occlusions. In: Proceedings of CVPR (2018)Google Scholar
  42. 42.
    Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: Proceedings of CVPR (2017)Google Scholar
  43. 43.
    Vlasic, D., Brand, M., Pfister, H., Popović, J.: Face transfer with multilinear models. ACM Trans. Graph. (TOG) 24(3), 426–433 (2005)CrossRefGoogle Scholar
  44. 44.
    Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Interpretable transformations with encoder-decoder networks. In: Proceedings of ICCV (2017)Google Scholar
  45. 45.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of ICCV (2017)Google Scholar
  46. 46.
    Zollhöfer, M., Thies, J., Garrido, P., Bradley, D., Beeler, T., Pérez, P., Stamminger, M., Nießner, M., Theobalt, C.: State of the art on monocular 3D face reconstruction, tracking, and applications. In: Proceedings of Eurographics (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Olivia Wiles
    • 1
    Email author
  • A. Sophia Koepke
    • 1
  • Andrew Zisserman
    • 1
  1. 1.Visual Geometry GroupUniversity of OxfordOxfordUK

Personalised recommendations