Information geometry connecting Wasserstein distance and Kullback–Leibler divergence via the entropyrelaxed transportation problem
Abstract
Two geometrical structures have been extensively studied for a manifold of probability distributions. One is based on the Fisher information metric, which is invariant under reversible transformations of random variables, while the other is based on the Wasserstein distance of optimal transportation, which reflects the structure of the distance between underlying random variables. Here, we propose a new informationgeometrical theory that provides a unified framework connecting the Wasserstein distance and Kullback–Leibler (KL) divergence. We primarily considered a discrete case consisting of n elements and studied the geometry of the probability simplex \(S_{n1}\), which is the set of all probability distributions over n elements. The Wasserstein distance was introduced in \(S_{n1}\) by the optimal transportation of commodities from distribution \({\varvec{p}}\) to distribution \({\varvec{q}}\), where \({\varvec{p}}\), \({\varvec{q}} \in S_{n1}\). We relaxed the optimal transportation by using entropy, which was introduced by Cuturi. The optimal solution was called the entropyrelaxed stochastic transportation plan. The entropyrelaxed optimal cost \(C({\varvec{p}}, {\varvec{q}})\) was computationally much less demanding than the original Wasserstein distance but does not define a distance because it is not minimized at \({\varvec{p}}={\varvec{q}}\). To define a proper divergence while retaining the computational advantage, we first introduced a divergence function in the manifold \(S_{n1} \times S_{n1}\) composed of all optimal transportation plans. We fully explored the information geometry of the manifold of the optimal transportation plans and subsequently constructed a new oneparameter family of divergences in \(S_{n1}\) that are related to both the Wasserstein distance and the KLdivergence.
Keywords
Wasserstein distance Kullback–Leibler divergence Optimal transportation Information geometry1 Introduction
Information geometry [1] studies the properties of a manifold of probability distributions and is useful for various applications in statistics, machine learning, signal processing, and optimization. Two geometrical structures have been introduced from two distinct backgrounds. One is based on the invariance principle, where the geometry is invariant under reversible transformations of random variables. The Fisher information matrix, for example, is a unique invariant Riemannian metric from the invariance principle [1, 2, 3]. Moreover, two dually coupled affine connections are used as invariant connections [1, 4], which are useful in various applications.
The other geometrical structure was introduced through the transportation problem, where one distribution of commodities is transported to another distribution. The minimum transportation cost defines a distance between the two distributions, which is called the Wasserstein, Kantorovich or earthmover distance [5, 6]. This structure provides a tool to study the geometry of distributions by taking the metric of the supporting manifold into account.
Let \(\chi = \left\{ 1, \ldots , n \right\} \) be the support of a probability measure \({\varvec{p}}\). The invariant geometry provides a structure that is invariant under permutations of elements of \(\chi \) and results in an efficient estimator in statistical estimation. On the other hand, when we consider a picture over \(n^2\) pixels \(\chi = \left\{ (ij); i, j=1, \ldots , n \right\} \) and regard it as a distribution over \(\chi \), the pixels have a proper distance structure in \(\chi \). Spatially close pixels tend to take similar values. A permutation of \(\chi \) destroys such a neighboring structure, suggesting that the invariance might not play a useful role. The Wasserstein distance takes such a structure into account and is therefore useful for problems with metric structure in support \(\chi \) (see, e.g., [7, 8, 9]).
Cuturi modified the transportation problem such that the cost is minimized under an entropy constraint [7]. This is called the entropyrelaxed optimal transportation problem and is computationally less demanding than the original transportation problem. In addition to the advantage in computational cost, Cuturi showed that the quasidistance defined by the entropyrelaxed optimal solution yields superior results in many applications compared to the original Wasserstein distance and informationgeometric divergences such as the KL divergence.
We followed the entropyrelaxed framework that Cuturi et al. proposed [7, 8, 9] and introduced a Lagrangian function, which is a linear combination of the transportation cost and entropy. Given a distribution \({\varvec{p}}\) of commodity on the senders side and \({\varvec{q}}\) on the receivers side, the constrained optimal transportation plan is the minimizer of the Lagrangian function. The minimum value \(C({\varvec{p}}, {\varvec{q}})\) is a function of \({\varvec{p}}\) and \({\varvec{q}}\), which we called the Cuturi function. However, this does not define the distance between \({\varvec{p}}\) and \({\varvec{q}}\) because it is nonzero at \({\varvec{p}} = {\varvec{q}}\) and is not minimized when \({\varvec{p}}\) = \({\varvec{q}}\).
To define a proper distancelike function in \(S_{n1}\), we introduced a divergence between \({\varvec{p}}\) and \({\varvec{q}}\) derived from the optimal transportation plan. A divergence is a general metric concept that includes the square of a distance but is more flexible, allowing nonsymmetricity between \({\varvec{p}}\) and \({\varvec{q}}\). A manifold equipped with a divergence yields a Riemannian metric with a pair of dual affine connections. Dually coupled geodesics are defined, which possess remarkable properties, generalizing the Riemannian geometry [1].
We studied the geometry of the entropyrelaxed optimal transportation plans within the framework of information geometry. They form an exponential family of probability distributions defined in the product manifold \(S_{n1} \times S_{n1}\). Therefore, a dually flat structure was introduced. The mflat coordinates are the expectation parameters \(({\varvec{p}}, {\varvec{q}})\) and their dual, eflat coordinates (canonical parameters) are \(({\varvec{\alpha }}, {\varvec{\beta }})\), which are assigned from the minimax duality of nonlinear optimization problems. We can naturally defined a canonical divergence, that is the KL divergence \(KL[({\varvec{p}}, {\varvec{q}}) : ({\varvec{p}'}, {\varvec{q}'})]\) between the two optimal transportation plans for \(({\varvec{p}}, {\varvec{q}})\) and \(({\varvec{p}'}, {\varvec{q}'})\), sending \({\varvec{p}}\) to \({\varvec{q}}\) and \({\varvec{p}'}\) to \({\varvec{q}'}\), respectively.
To define a divergence from \({\varvec{p}}\) to \({\varvec{q}}\) in \(S_{n1}\), we used a reference distribution \(\varvec{r}\). Given \(\varvec{r}\), we defined a divergence between \({\varvec{p}}\) and \({\varvec{q}}\) by \(KL[({\varvec{r}}, {\varvec{p}}) : ({\varvec{r}}, {\varvec{q}})]\). There are a number of potential choices for \(\varvec{r}\): one is to use \({\varvec{r}} = {\varvec{p}}\) and another is to use the arithmetic or geometric mean of \({\varvec{p}}\) and \({\varvec{q}}\). These options yield oneparameter families of divergences connecting the Wasserstein distance and KLdivergence. Our work uncovers a novel direction for studying the geometry of a manifold of probability distributions by integrating the Wasserstein distance and KL divergence.
2 Entropyconstrained transportation problem
It should be noted that if some components of \(\varvec{p}\) and \(\varvec{q}\) are allowed to be 0, we do not need to treat \(\chi _S\) and \(\chi _R\) separately, i.e., we can consider both \(\chi _S\) and \(\chi _R\) to be equal to \(\chi \). Under such a situation, we simply considered both \(\varvec{p}\) and \(\varvec{q}\) as elements of \(\bar{S}_{n1}\).
3 Solution to the entropyconstrained problem: Cuturi function
Theorem 1
Theorem 2
The Cuturi function \(C_{\lambda } ({\varvec{p}}, {\varvec{q}})\) is a convex function of \(({\varvec{p}}, {\varvec{q}})\).
Proof
4 Geometry of optimal transportation plans
The transportation problem is related to various problems in information theory such as the ratedistortion theory. We provide detailed studies on the transportation plans in the informationgeometric framework in Sect. 7, but here we introduce the manifold of the optimal transportation plans, which are determined by the senders and receivers probability distributions \(\varvec{p}\) and \(\varvec{q}\).
Theorem 3
The dual potential \(\varphi _{\lambda }\) is equal to the Cuturi function \(C_{\lambda }\).
Proof
We summarize the Legendre relationship below.
Theorem 4
Theorem 5
Remark 1
The \({\varvec{p}}\)part and \({\varvec{q}}\)part of \({\mathbf{G}}^{1}_{\lambda }\) are equal to the corresponding Fisher information in \(S_{s1}\) and \(S_{r1}\) in the ecoordinate systems.
Remark 2
The \({\varvec{p}}\)part and the \({\varvec{q}}\)part of \({\mathbf{G}}_{\lambda }\) are not equal to the corresponding Fisher information in the mcoordinate system. This is because \(({\varvec{p}}, {\varvec{q}})\)part of \(\mathbf{G}\) is not 0.
5 \(\lambda \)divergences in \(S_{n1}\)
5.1 Derivation of \(\lambda \)divergences
Theorem 6
Proof
 1.When \(\lambda \rightarrow \infty \), \(D_{\lambda }\) converges to \(KL[{\varvec{p}}:{\varvec{q}}]\). This is because \(\mathbf{P}^*\) converges to \({\varvec{p}} \otimes {\varvec{q}}\) in the limit and we easily have$$\begin{aligned} KL[ {\varvec{p}} \otimes {\varvec{p}}: {\varvec{p}} \otimes {\varvec{q}}] = KL[\varvec{p} : \varvec{q}]. \end{aligned}$$(57)
 2.
When \(\lambda \rightarrow 0\), \(D_{\lambda }\) with \(\gamma _{\lambda } = \lambda /(1+\lambda )\) converges to 0, because \(KL \left[ \mathbf{P}_{0}^*({\varvec{p}}, {\varvec{p}}): {\mathbf{P}}^*_{0}({\varvec{p}}, {\varvec{q}})\right] \) takes a finite value (see Example 1 in the next section). This suggests that it is preferable to use a scaling factor other than \(\gamma _{\lambda }= \lambda /(1+\lambda )\) when \(\lambda \) is small. When \(\lambda =0\), \(C_\lambda =\varphi _\lambda \) is not differentiable. Hence, we cannot construct the Bregmanlike divergence from \(C_0\) [Eq. (52)] in a simple example given in Sect. 5.3.
5.2 Other choices of reference distribution r
5.3 Examples of \(\lambda \)divergence
Below, we consider the case where \(\varvec{r}=\varvec{p}\). We show two simple examples, where \(D_{\lambda }({\varvec{p}}, {\varvec{q}})\) can be analytically computed.
Example 1
Example 2
6 Applications of \(\lambda \)divergence
6.1 Cluster center (barycenter)
6.2 Statistical estimation
Theorem 7
6.3 Pattern classifier
7 Information geometry of transportation plans
7.1 Free problem
7.2 Ratedistortion problem
We considered a communication channel in which \({\varvec{p}}\) is a probability distribution on the senders terminals. The channel is noisy and \(P_{ij}/p_i\) is the probability that \(x_j\) is received when \(x_i\) is sent. The costs \(m_{ij}\) are regarded as the distortion of \(x_i\) changing to \(x_j\). The rate distortionproblem in information theory searches for \(\mathbf{P}\), which minimizes the mutual information of the sender and receiver under the constraint of distortion \(\langle \mathbf{M}, {\mathbf{P}} \rangle \). The problem is formulated by maximizing \(\varphi _{\lambda }(\mathbf{P})\) under the senders constraint \({\varvec{p}}\), where \({\varvec{q}}\) is free (R. Belavkin, personal communication; see also [16]).
7.3 Transportation problem
Lemma
For any \({\varvec{a}}\), \({\varvec{b}}\), transformation \(T_{{\varvec{a}}{\varvec{b}}}\) does not change the interaction terms \({\varTheta }_{ij}\). Moreover, the egeodesic connecting \({\mathbf{P}}\) and \(T_{\varvec{a b}}{} \mathbf{P}\) is orthogonal to \(M({\varvec{p, q}})\).
Proof
By calculating the mixed coordinates of \(T_{\varvec{a b}}{} \mathbf{P}\), we easily see that the \({\varTheta }\)part does not change. Hence, the egeodesic connecting \(\mathbf{P}\) and \(T_{\varvec{ab}}{} \mathbf{P}\) is given, in terms of the mixed coordinates, by keeping the last part fixed while changing the first part. This is included in \(E\left( {\varTheta }\right) \). Therefore, the geodesic is orthogonal to \(M(\varvec{p},\varvec{q})\). \(\square \)
Since the optimal solution is given by applying \(T_{\varvec{a b}}\) to \(\mathbf{K}\), even when \({\mathbf{{K}}}\) is not normalized, such that the terminal conditions [Eq. (4)] are satisfied, we have the following theorem:
Theorem 8
The optimal solution \(\mathbf{P}^{*}\) is given by eprojecting \(\mathbf{K}\) to \(M({\varvec{p}}, {\varvec{q}})\).
7.4 Iterative algorithm (Sinkhorn algorithm) for obtaining a and b
We need to calculate \(\varvec{a}\) and \(\varvec{b}\) when \(\varvec{p}\) and \(\varvec{q}\) are given for obtaining the optimal transportation plan. The Sinkhorn algorithm is well known for this purpose [5]. It is an iterative algorithm for obtaining the eprojection of \(\mathbf{K}\) to \(M({\varvec{p}}, {\varvec{q}})\).
 1.
Begin with \(\mathbf{P}_0=\mathbf{K}\).
 2.For \(t=0, 1, 2, \ldots \), eproject \(P_{2t}\) to \(M({\varvec{p}}, \cdot )\) to obtain$$\begin{aligned} \mathbf{P}_{2t+1} = T_{A \cdot } \mathbf{P}_{2t}. \end{aligned}$$(121)
 3.To obtain \(\mathbf{P}_{2t+2}\), eproject \({\mathbf{P}}_{2t+1}\) to \(M(\cdot , {\varvec{q}})\),$$\begin{aligned} \mathbf{P}_{2t+2} = T_{\cdot B} \mathbf{P}_{2t+1}. \end{aligned}$$(122)
 4.
Repeat until convergence.
Figure 6 schematically illustrates the iterative eprojection algorithm for finding the optimal solution \(\mathbf{P}^{*}\).
8 Conclusions and additional remarks

1. Uniqueness of the optimal plan
The original Wasserstein distance is obtained by solving a linear programming problem. Hence, the solution is not unique in some cases and is not necessarily a continuous function of \({\mathbf{M}}\). However, the entropyconstrained solution is unique and continuous with respect to \(\mathbf{M}\) [7]. While \(\varphi _{\lambda }({\varvec{p}}, {\varvec{q}})\) converges to \(\varphi _0({\varvec{p}}, {\varvec{q}})\) as \(\lambda \rightarrow 0\), \(\varphi _0({\varvec{p}}, {\varvec{q}})\) is not necessarily differentiable with respect to \({\varvec{p}}\) and \({\varvec{q}}\).

2. Integrated information theory of consciousness
Given a joint probability distribution \(\mathbf{P}\), the amount of integrated information is measured by the amount of interactions of information among different terminals. We used a disconnected model in which no information is transferred through branches connecting different terminals. The geometric measure of integrated information theory is given by the KLdivergence from \(\mathbf{P}\) to the submanifold of disconnected models [13, 14]. However, the Wasserstein divergence can be considered as such a measure when the cost of transferring information through different terminals depends on the physical positions of the terminals [15]. We can use the entropyconstrained divergence \(D_{\lambda }\) to define the amount of information integration.

3. f divergence
We used the KLdivergence in a dually flat manifold for defining \(D_{\lambda }\). It is possible to use any other divergences, for example, the fdivergence instead of KLdivergence. We would obtain similar results.

4. q entropy
Muzellec et al. used the \(\alpha \)entropy (Tsallis qentropy) instead of the Shannon entropy for regularization [16]. This yields the qentropyrelaxed framework.

5. Comparison of \(C_{\lambda }\) and \(D_{\lambda }\)
Although \(D_{\lambda }\) satisfies the criterion of a divergence, it might differ considerably from the original \(C_{\lambda }\). In particular, when \(C_\lambda ({\varvec{p}}, {\varvec{q}})\) includes a piecewise linear term such as \(\sum d_ip_i  q_i\) for constant \(d_i\), \(D_\lambda \) defined in Eq. (52) eliminates this term. When this term is important, we can use \(\{ C_\lambda ({\varvec{p}}, {\varvec{q}}) \}^2\) instead of \(C_\lambda ({\varvec{p}}, {\varvec{q}})\) for defining a new divergence \(D_\lambda \) in Eq. (52). In our accompanying paper [17], we define a new type of divergence that retains the properties of \(C_{\lambda }\) and is closer to \(C_{\lambda }\).
References
 1.Amari, S.: Information Geometry and Its Applications. Springer, Japan (2016)CrossRefzbMATHGoogle Scholar
 2.Chentsov, N.N.: Statistical Decision Rules and Optimal Inference. Nauka (1972) (translated in English, AMS (1982)) Google Scholar
 3.Rao, C.R.: Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945)MathSciNetzbMATHGoogle Scholar
 4.Ay, N., Jost, J., Le, H.V., Schwachhöfer, L.: Information Geometry. Springer, Cham (2017)CrossRefzbMATHGoogle Scholar
 5.Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkhauser, Basel (2015)CrossRefzbMATHGoogle Scholar
 6.Villani, C.: Topics in Optimal Transportation. Graduate Studies in Math. AMS, Providence (2013)Google Scholar
 7.Cuturi, M.: Sinkhorn distances: light speed computation of optimal transport. In: Advances in Neural Information Processing Systems, pp. 2292–2300 (2013)Google Scholar
 8.Cuturi, M., Avis, D.: Ground metric learning. J. Mach. Learn. Res. 15, 533–564 (2014)MathSciNetzbMATHGoogle Scholar
 9.Cuturi, M., Peyré, G.: A smoothed dual formulation for variational Wasserstein problems. SIAM J. Imaging Sci. 9, 320–343 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
 10.Montavon, G., Muller K., Cuturi, M.: Wasserstein training for Boltzmann machines (2015). arXiv:1507.01972v1
 11.Belavkin, R.V.: Optimal measures and Markov transition kernels. J. Glob. Optim. 55, 387–416 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
 12.Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 35, 876–879 (1964)MathSciNetCrossRefzbMATHGoogle Scholar
 13.Oizumi, M., Tsuchiya, N., Amari, S.: Unified framework for information integration based on information geometry. Proc. Natl. Acad. Sci. 113, 14817–14822 (2016)MathSciNetCrossRefGoogle Scholar
 14.Amari, S., Tsuchiya, N., Oizumi, M.: Geometry of information integration (2017). arXiv:1709.02050
 15.Oizumi, M., Albantakis, L., Tononi, G.: From the phenomenology to the mechanisms of consciousness: integrated information theory 3.0. PLoS Comput. Biol. 10, e1003588 (2014)CrossRefGoogle Scholar
 16.Muzellec, B., Nock, R., Patrini, G., Nielsen, F.: Tsallis regularized optimal transport and ecological inference (2016). arXiv:1609.04495v1
 17.Amari, S, Karakida, R. Oizumi, M., Cuturi, M.: New divergence derived from Cuturi function (in preparation) Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.