The Information Geometry of Mirror Descent
We prove the equivalence of two online learning algorithms, mirror descent and natural gradient descent. Both mirror descent and natural gradient descent are generalizations of online gradient descent when the parameter of interest lies on a non-Euclidean manifold. Natural gradient descent selects the steepest descent direction along a Riemannian manifold by multiplying the standard gradient by the inverse of the metric tensor. Mirror descent induces non-Euclidean structure by solving iterative optimization problems using different proximity functions. In this paper, we prove that mirror descent induced by a Bregman divergence proximity functions is equivalent to the natural gradient descent algorithm on the Riemannian manifold in the dual co-ordinate system. We use techniques from convex analysis and connections between Riemannian manifolds, Bregman divergences and convexity to prove this result. This equivalence between natural gradient descent and mirror descent, implies that (1) mirror descent is the steepest descent direction along the Riemannian manifold corresponding to the choice of Bregman divergence and (2) mirror descent with log-likelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical Cramér-Rao lower bound.
KeywordsRiemannian Manifold Gradient Descent Steep Descent Exponential Family Natural Gradient
GR was partially supported by the NSF under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. SM was supported by grants: NIH (Systems Biology): 5P50-GM081883, AFOSR: FA9550-10-1-0436, and NSF CCF-1049290.
- 2.Amari, S., Cichocki, A.: Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci. 58(1), 183–195 (2010)Google Scholar
- 6.Barndorff-Nielson, O.E.: Information and Exponential Families. Wiley, Chichester (1978)Google Scholar
- 7.Bonnabel, S.: Stochastic gradient descent on Riemannian manifiolds. Technical report, Mines Paris Tech (2011)Google Scholar
- 10.DoCarmo, M.P.: Riemannian Geometry. Springer Series in Statistics. Birkhauser, Boston (1992)Google Scholar
- 15.Lafferty, J.: Additive models, boosting, and inference for generalized divergences. In: COLT (1999)Google Scholar
- 16.Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)Google Scholar
- 17.Nielsen, F., Garcia, V.: Statistical exponential families: a digest with flash cards. Technical report, École Polytechnique (2011)Google Scholar
- 19.Rao, C.R.: Asymptotic efficiency and limiting information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 531–546 (1961)Google Scholar
- 22.Wainwright, M.J., Jordan, M.I.: A variational principle for graphical models. In: New Directions in Statistical Signal Processing. MIT Press, Cambridge, MA (2006)Google Scholar