Geometrical Insights for Implicit Generative Modeling

  • Leon BottouEmail author
  • Martin Arjovsky
  • David Lopez-Paz
  • Maxime Oquab
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11100)


Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion. A careful look at the geometries induced by these distances on the space of probability measures reveals interesting differences. In particular, we can establish surprising approximate global convergence guarantees for the 1-Wasserstein distance, even when the parametric generator has a nonconvex parametrization.



We would like to thank Joan Bruna, Marco Cuturi, Arthur Gretton, Yann Ollivier, and Arthur Szlam for stimulating discussions and also for pointing out numerous related works.


  1. 1.
    Aizerman, M.A., Braverman, É.M., Rozonoér, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837 (1964)zbMATHGoogle Scholar
  2. 2.
    Amari, S.I., Nagaoka, H.: Methods of Information Geometry, vol. 191. American Mathematical Society (2007)Google Scholar
  3. 3.
    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of the 34nd International Conference on Machine Learning, ICML 2017, Sydney, Australia, 7–9 August 2017Google Scholar
  4. 4.
    Aronszajn, N.: Theory of reproducing kernels. Trans. Am. Mathe. Soc. 68, 337–404 (1950)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Arora, S., Ge, R., Liang, Y., Ma, T., Zhang, Y.: Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573 (2017)
  6. 6.
    Auffinger, A., Ben Arous, G.: Complexity of random smooth functions of many variables. Ann. Probab. 41(6), 4214–4247 (2013)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Bellemare, M.G., et al.: The cramer distance as a solution to biased Wasserstein gradients. arXiv preprint arXiv:1705.10743 (2017)
  8. 8.
    Berti, P., Pratelli, L., Rigo, P., et al.: Gluing lemmas and skorohod representations. Electron. Commun. Probab. 20 (2015)Google Scholar
  9. 9.
    Borkar, V.S.: Stochastic approximation with two time scales. Syst. Control Lett. 29(5), 291–294 (1997)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. CoRR abs/1606.04838 (2016)Google Scholar
  11. 11.
    Bouchacourt, D., Mudigonda, P.K., Nowozin, S.: DISCO nets: DISsimilarity cOefficients networks. In: Advances in Neural Information Processing Systems, vol. 29, pp. 352–360 (2016)Google Scholar
  12. 12.
    Burago, D., Burago, Y., Ivanov, S.: A Course in Metric Geometry. Volume 33 of AMS Graduate Studies in Mathematics, American Mathematical Society (2001)Google Scholar
  13. 13.
    Challis, E., Barber, D.: Affine independent variational inference. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2186–2194. Curran Associates, Inc. (2012)Google Scholar
  14. 14.
    Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)zbMATHGoogle Scholar
  15. 15.
    Denton, E., Chintala, S., Szlam, A., Fergus, R.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 1486–1494. Curran Associates, Inc. (2015)Google Scholar
  16. 16.
    Dereich, S., Scheutzow, M., Schottstedt, R.: Constructive quantization: approximation by empirical measures. Annales de l’I.H.P. Probabilités et statistiques 49(4), 1183–1203 (2013)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Dziugaite, G.K., Roy, D.M., Ghahramani, Z.: Training generative neural networks via maximum mean discrepancy optimization. In: Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI, pp. 258–267 (2015)Google Scholar
  18. 18.
    Fournier, N., Guillin, A.: On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theor. Relat. Fields 162(3), 707–738 (2015)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Freeman, C.D., Bruna, J.: Topology and geometry of half-rectified network optimization. arXiv preprint arXiv:1611.01540 (2016)
  20. 20.
    Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014)Google Scholar
  21. 21.
    Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028 (2017)
  23. 23.
    Hammersley, J.M.: The distribution of distance in a hypersphere. Ann. Mathe. Stat. 21(3), 447–452 (1950)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics, 2nd edn. Springer, New York (2009)zbMATHGoogle Scholar
  25. 25.
    Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
  26. 26.
    Khinchin, A.Y.: Sur la loi des grandes nombres. Comptes Rendus de l’Académie des Sciences (1929)Google Scholar
  27. 27.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. CoRR abs/1312.6114 (2013)Google Scholar
  28. 28.
    Kocaoglu, M., Snyder, C., Dimakis, A.G., Vishwanath, S.: CausalGAN: learning causal implicit generative models with adversarial training. arXiv preprint arXiv:1709.02023 (2017)
  29. 29.
    Konda, V.R., Tsitsiklis, J.N.: Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 796–819 (2004)Google Scholar
  30. 30.
    Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: A probabilistic programming language for scene perception. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, CVPR 2015, pp. 4390–4399 (2015)Google Scholar
  31. 31.
    Lee, M.W., Nevatia, R.: Dynamic human pose estimation using Markov Chain Monte Carlo approach. In: 7th IEEE Workshop on Applications of Computer Vision/IEEE Workshop on Motion and Video Computing (WACV/MOTION 2005), pp. 168–175 (2005)Google Scholar
  32. 32.
    Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., Póczos, B.: MMD GAN: towards deeper understanding of moment matching network. arXiv preprint arXiv:1705.08584 (2017)
  33. 33.
    Li, Y., Swersky, K., Zemel, R.: Generative moment matching networks. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML 2015, vol. 37, pp. 1718–1727 (2015)Google Scholar
  34. 34.
    Liu, S., Bousquet, O., Chaudhuri, K.: Approximation and convergence properties of generative adversarial learning. arXiv preprint arXiv:1705.08991 (2017). to appear in NIPS 2017
  35. 35.
    Milgrom, P., Segal, I.: Envelope theorems for arbitrary choice sets. Econometrica 70(2), 583–601 (2002)MathSciNetCrossRefGoogle Scholar
  36. 36.
    von Mises, R.: On the asymptotic distribution of differentiable statistical functions. Ann. Mathe. Stat. 18(3), 309–348 (1947)MathSciNetCrossRefGoogle Scholar
  37. 37.
    Müller, A.: Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 29(2), 429–443 (1997)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Neal, R.M.: Annealed importance sampling. Stat. Comput. 11(2), 125–139 (2001)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theor. 56(11), 5847–5861 (2010)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Nowozin, S., Cseke, B., Tomioka, R.: f-GAN: training generative neural samplers using variational divergence minimization. In: Advances in Neural Information Processing Systems, vol. 29, pp. 271–279 (2016)Google Scholar
  41. 41.
    Rachev, S.T., Klebanov, L., Stoyanov, S.V., Fabozzi, F.: The Methods of Distances in the Theory of Probability and Statistics. Springer, New York (2013)CrossRefGoogle Scholar
  42. 42.
    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  43. 43.
    Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, ICML 2014, pp. 1278–1286 (2014)Google Scholar
  44. 44.
    Romaszko, L., Williams, C.K., Moreno, P., Kohli, P.: Vision-as-inverse-graphics: obtaining a rich 3D explanation of a scene from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 851–859 (2017)Google Scholar
  45. 45.
    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242 (2016)Google Scholar
  46. 46.
    Schoenberg, I.J.: Metric spaces and positive definite functions. Trans. Am. Mathe. Soc. 44, 522–536 (1938)MathSciNetCrossRefGoogle Scholar
  47. 47.
    Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002)zbMATHGoogle Scholar
  48. 48.
    Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Ann. Stat. 41(5), 2263–2291 (2013)MathSciNetCrossRefGoogle Scholar
  49. 49.
    Serfling, R.J.: Approximation Theorems of Mathematical Statistics. Wiley, New York; Chichester (1980)CrossRefGoogle Scholar
  50. 50.
    Sriperumbudur, B.: On the optimal estimation of probability measures in weak and strong topologies. Bernoulli 22(3), 1839–1893 (2016)MathSciNetCrossRefGoogle Scholar
  51. 51.
    Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Schölkopf, B., Lanckriet, G.R.: On the empirical estimation of integral probability metrics. Electron. J. Stat. 6, 1550–1599 (2012)MathSciNetCrossRefGoogle Scholar
  52. 52.
    Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.R.: Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res. 12, 2389–2410 (2011)MathSciNetzbMATHGoogle Scholar
  53. 53.
    Székely, G.J., Rizzo, M.L.: Energy statistics: a class of statistics based on distances. J. Stat. Plan. Infer. 143(8), 1249–1272 (2013)MathSciNetCrossRefGoogle Scholar
  54. 54.
    Székely, J.G.: E-statistics: The energy of statistical samples. Technical report, 02–16, Bowling Green State University, Department of Mathematics and Statistics (2002)Google Scholar
  55. 55.
    Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: International Conference on Learning Representations (2016)Google Scholar
  56. 56.
    Villani, C.: Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin (2009)CrossRefGoogle Scholar
  57. 57.
    Zinger, A.A., Kakosyan, A.V., Klebanov, L.B.: A characterization of distributions by mean values of statistics and certain probabilistic metrics. J. Sov. Mathe. 4(59), 914–920 (1992). Translated from Problemy Ustoichivosti Stokhasticheskikh Modelei-Trudi seminara, pp. 47–55 (1989)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Leon Bottou
    • 1
    Email author
  • Martin Arjovsky
    • 2
  • David Lopez-Paz
    • 3
  • Maxime Oquab
    • 4
  1. 1.Facebook AI ResearchNew YorkUSA
  2. 2.New York UniversityNew YorkUSA
  3. 3.Facebook AI ResearchParisFrance
  4. 4.InriaParisFrance

Personalised recommendations