The Information Geometry of Mirror Descent

  • Garvesh RaskuttiEmail author
  • Sayan Mukherjee
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9389)


We prove the equivalence of two online learning algorithms, mirror descent and natural gradient descent. Both mirror descent and natural gradient descent are generalizations of online gradient descent when the parameter of interest lies on a non-Euclidean manifold. Natural gradient descent selects the steepest descent direction along a Riemannian manifold by multiplying the standard gradient by the inverse of the metric tensor. Mirror descent induces non-Euclidean structure by solving iterative optimization problems using different proximity functions. In this paper, we prove that mirror descent induced by a Bregman divergence proximity functions is equivalent to the natural gradient descent algorithm on the Riemannian manifold in the dual co-ordinate system. We use techniques from convex analysis and connections between Riemannian manifolds, Bregman divergences and convexity to prove this result. This equivalence between natural gradient descent and mirror descent, implies that (1) mirror descent is the steepest descent direction along the Riemannian manifold corresponding to the choice of Bregman divergence and (2) mirror descent with log-likelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical Cramér-Rao lower bound.


Riemannian Manifold Gradient Descent Steep Descent Exponential Family Natural Gradient 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



GR was partially supported by the NSF under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. SM was supported by grants: NIH (Systems Biology): 5P50-GM081883, AFOSR: FA9550-10-1-0436, and NSF CCF-1049290.


  1. 1.
    Amari, S.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Amari, S., Cichocki, A.: Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci. 58(1), 183–195 (2010)Google Scholar
  3. 3.
    Amari, S.-I., Barndoff-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R.: Differential Geometry in Statistical Inference. IMS Lecture Notes - Monograph Series. Institute of Mathematical Statistic, Hayward (1987)zbMATHGoogle Scholar
  4. 4.
    Azoury, K.S., Warmuth, M.K.: Relative loss bounds for on-line density estimation with the exponential family of dsitributions. Mach. Learn. 43(3), 211–246 (2001)CrossRefzbMATHGoogle Scholar
  5. 5.
    Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Barndorff-Nielson, O.E.: Information and Exponential Families. Wiley, Chichester (1978)Google Scholar
  7. 7.
    Bonnabel, S.: Stochastic gradient descent on Riemannian manifiolds. Technical report, Mines Paris Tech (2011)Google Scholar
  8. 8.
    Bregman, L.M.: The relaxation method for finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 191–204 (1967)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Brown, L.D.: Fundamentals of Statistical Exponential Families. Institute of Mathematical Statistics, Hayward (1986)zbMATHGoogle Scholar
  10. 10.
    DoCarmo, M.P.: Riemannian Geometry. Springer Series in Statistics. Birkhauser, Boston (1992)Google Scholar
  11. 11.
    Cramér, H.: Mathematical Methods of Statistics. Princeton University Press, Princeton (1946)zbMATHGoogle Scholar
  12. 12.
    Efron, B.: Defining the curvature of a statistical problem (with applications to second order efficiency). Ann. Stat. 3(6), 1189–1242 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Efron, B.: The geometry of exponential families. Ann. Stat. 6, 362–376 (1978)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Fisher, R.A.: Theory of statistical estimation. Math. Proc. Cambridge Philos. Soc. 22, 700–725 (1925)CrossRefzbMATHGoogle Scholar
  15. 15.
    Lafferty, J.: Additive models, boosting, and inference for generalized divergences. In: COLT (1999)Google Scholar
  16. 16.
    Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)Google Scholar
  17. 17.
    Nielsen, F., Garcia, V.: Statistical exponential families: a digest with flash cards. Technical report, École Polytechnique (2011)Google Scholar
  18. 18.
    Rao, C.R.: Information and accuracy obtainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Rao, C.R.: Asymptotic efficiency and limiting information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 531–546 (1961)Google Scholar
  20. 20.
    Reid, M.D., Williamson, R.C.: Information, divergence and risk for binary experiments. J. Mach. Learn. Res. 12, 731–817 (2011)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Rockafeller, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)CrossRefGoogle Scholar
  22. 22.
    Wainwright, M.J., Jordan, M.I.: A variational principle for graphical models. In: New Directions in Statistical Signal Processing. MIT Press, Cambridge, MA (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Statistics and Computer ScienceUniversity of Wisconsin-MadisonMadisonUSA
  2. 2.Wisconsin Institute of Discovery, Optimization GroupMadisonUSA
  3. 3.Departments of Statistical Science, Computer Science, and MathematicsDuke UniversityDurhamUSA
  4. 4.Institute for Genome Sciences & PolicyDuke UniversityDurhamUSA

Personalised recommendations