Information geometry connecting Wasserstein distance and Kullback–Leibler divergence via the entropy-relaxed transportation problem
Abstract
Two geometrical structures have been extensively studied for a manifold of probability distributions. One is based on the Fisher information metric, which is invariant under reversible transformations of random variables, while the other is based on the Wasserstein distance of optimal transportation, which reflects the structure of the distance between underlying random variables. Here, we propose a new information-geometrical theory that provides a unified framework connecting the Wasserstein distance and Kullback–Leibler (KL) divergence. We primarily considered a discrete case consisting of n elements and studied the geometry of the probability simplex \(S_{n-1}\), which is the set of all probability distributions over n elements. The Wasserstein distance was introduced in \(S_{n-1}\) by the optimal transportation of commodities from distribution \({\varvec{p}}\) to distribution \({\varvec{q}}\), where \({\varvec{p}}\), \({\varvec{q}} \in S_{n-1}\). We relaxed the optimal transportation by using entropy, which was introduced by Cuturi. The optimal solution was called the entropy-relaxed stochastic transportation plan. The entropy-relaxed optimal cost \(C({\varvec{p}}, {\varvec{q}})\) was computationally much less demanding than the original Wasserstein distance but does not define a distance because it is not minimized at \({\varvec{p}}={\varvec{q}}\). To define a proper divergence while retaining the computational advantage, we first introduced a divergence function in the manifold \(S_{n-1} \times S_{n-1}\) composed of all optimal transportation plans. We fully explored the information geometry of the manifold of the optimal transportation plans and subsequently constructed a new one-parameter family of divergences in \(S_{n-1}\) that are related to both the Wasserstein distance and the KL-divergence.
Keywords
Wasserstein distance Kullback–Leibler divergence Optimal transportation Information geometry1 Introduction
Information geometry [1] studies the properties of a manifold of probability distributions and is useful for various applications in statistics, machine learning, signal processing, and optimization. Two geometrical structures have been introduced from two distinct backgrounds. One is based on the invariance principle, where the geometry is invariant under reversible transformations of random variables. The Fisher information matrix, for example, is a unique invariant Riemannian metric from the invariance principle [1, 2, 3]. Moreover, two dually coupled affine connections are used as invariant connections [1, 4], which are useful in various applications.
The other geometrical structure was introduced through the transportation problem, where one distribution of commodities is transported to another distribution. The minimum transportation cost defines a distance between the two distributions, which is called the Wasserstein, Kantorovich or earth-mover distance [5, 6]. This structure provides a tool to study the geometry of distributions by taking the metric of the supporting manifold into account.
Let \(\chi = \left\{ 1, \ldots , n \right\} \) be the support of a probability measure \({\varvec{p}}\). The invariant geometry provides a structure that is invariant under permutations of elements of \(\chi \) and results in an efficient estimator in statistical estimation. On the other hand, when we consider a picture over \(n^2\) pixels \(\chi = \left\{ (ij); i, j=1, \ldots , n \right\} \) and regard it as a distribution over \(\chi \), the pixels have a proper distance structure in \(\chi \). Spatially close pixels tend to take similar values. A permutation of \(\chi \) destroys such a neighboring structure, suggesting that the invariance might not play a useful role. The Wasserstein distance takes such a structure into account and is therefore useful for problems with metric structure in support \(\chi \) (see, e.g., [7, 8, 9]).
Cuturi modified the transportation problem such that the cost is minimized under an entropy constraint [7]. This is called the entropy-relaxed optimal transportation problem and is computationally less demanding than the original transportation problem. In addition to the advantage in computational cost, Cuturi showed that the quasi-distance defined by the entropy-relaxed optimal solution yields superior results in many applications compared to the original Wasserstein distance and information-geometric divergences such as the KL divergence.
We followed the entropy-relaxed framework that Cuturi et al. proposed [7, 8, 9] and introduced a Lagrangian function, which is a linear combination of the transportation cost and entropy. Given a distribution \({\varvec{p}}\) of commodity on the senders side and \({\varvec{q}}\) on the receivers side, the constrained optimal transportation plan is the minimizer of the Lagrangian function. The minimum value \(C({\varvec{p}}, {\varvec{q}})\) is a function of \({\varvec{p}}\) and \({\varvec{q}}\), which we called the Cuturi function. However, this does not define the distance between \({\varvec{p}}\) and \({\varvec{q}}\) because it is non-zero at \({\varvec{p}} = {\varvec{q}}\) and is not minimized when \({\varvec{p}}\) = \({\varvec{q}}\).
To define a proper distance-like function in \(S_{n-1}\), we introduced a divergence between \({\varvec{p}}\) and \({\varvec{q}}\) derived from the optimal transportation plan. A divergence is a general metric concept that includes the square of a distance but is more flexible, allowing non-symmetricity between \({\varvec{p}}\) and \({\varvec{q}}\). A manifold equipped with a divergence yields a Riemannian metric with a pair of dual affine connections. Dually coupled geodesics are defined, which possess remarkable properties, generalizing the Riemannian geometry [1].
We studied the geometry of the entropy-relaxed optimal transportation plans within the framework of information geometry. They form an exponential family of probability distributions defined in the product manifold \(S_{n-1} \times S_{n-1}\). Therefore, a dually flat structure was introduced. The m-flat coordinates are the expectation parameters \(({\varvec{p}}, {\varvec{q}})\) and their dual, e-flat coordinates (canonical parameters) are \(({\varvec{\alpha }}, {\varvec{\beta }})\), which are assigned from the minimax duality of nonlinear optimization problems. We can naturally defined a canonical divergence, that is the KL divergence \(KL[({\varvec{p}}, {\varvec{q}}) : ({\varvec{p}'}, {\varvec{q}'})]\) between the two optimal transportation plans for \(({\varvec{p}}, {\varvec{q}})\) and \(({\varvec{p}'}, {\varvec{q}'})\), sending \({\varvec{p}}\) to \({\varvec{q}}\) and \({\varvec{p}'}\) to \({\varvec{q}'}\), respectively.
To define a divergence from \({\varvec{p}}\) to \({\varvec{q}}\) in \(S_{n-1}\), we used a reference distribution \(\varvec{r}\). Given \(\varvec{r}\), we defined a divergence between \({\varvec{p}}\) and \({\varvec{q}}\) by \(KL[({\varvec{r}}, {\varvec{p}}) : ({\varvec{r}}, {\varvec{q}})]\). There are a number of potential choices for \(\varvec{r}\): one is to use \({\varvec{r}} = {\varvec{p}}\) and another is to use the arithmetic or geometric mean of \({\varvec{p}}\) and \({\varvec{q}}\). These options yield one-parameter families of divergences connecting the Wasserstein distance and KL-divergence. Our work uncovers a novel direction for studying the geometry of a manifold of probability distributions by integrating the Wasserstein distance and KL divergence.
2 Entropy-constrained transportation problem
Transportation from the sending terminals \(\chi _S\) to the receiving terminals \(\chi _R\)
It should be noted that if some components of \(\varvec{p}\) and \(\varvec{q}\) are allowed to be 0, we do not need to treat \(\chi _S\) and \(\chi _R\) separately, i.e., we can consider both \(\chi _S\) and \(\chi _R\) to be equal to \(\chi \). Under such a situation, we simply considered both \(\varvec{p}\) and \(\varvec{q}\) as elements of \(\bar{S}_{n-1}\).
3 Solution to the entropy-constrained problem: Cuturi function
Theorem 1
Theorem 2
The Cuturi function \(C_{\lambda } ({\varvec{p}}, {\varvec{q}})\) is a convex function of \(({\varvec{p}}, {\varvec{q}})\).
Proof
4 Geometry of optimal transportation plans
The transportation problem is related to various problems in information theory such as the rate-distortion theory. We provide detailed studies on the transportation plans in the information-geometric framework in Sect. 7, but here we introduce the manifold of the optimal transportation plans, which are determined by the senders and receivers probability distributions \(\varvec{p}\) and \(\varvec{q}\).
Theorem 3
The dual potential \(\varphi _{\lambda }\) is equal to the Cuturi function \(C_{\lambda }\).
Proof
We summarize the Legendre relationship below.
Theorem 4
Theorem 5
Remark 1
The \({\varvec{p}}\)-part and \({\varvec{q}}\)-part of \({\mathbf{G}}^{-1}_{\lambda }\) are equal to the corresponding Fisher information in \(S_{s-1}\) and \(S_{r-1}\) in the e-coordinate systems.
Remark 2
The \({\varvec{p}}\)-part and the \({\varvec{q}}\)-part of \({\mathbf{G}}_{\lambda }\) are not equal to the corresponding Fisher information in the m-coordinate system. This is because \(({\varvec{p}}, {\varvec{q}})\)-part of \(\mathbf{G}\) is not 0.
5 \(\lambda \)-divergences in \(S_{n-1}\)
5.1 Derivation of \(\lambda \)-divergences
Theorem 6
Proof
- 1.When \(\lambda \rightarrow \infty \), \(D_{\lambda }\) converges to \(KL[{\varvec{p}}:{\varvec{q}}]\). This is because \(\mathbf{P}^*\) converges to \({\varvec{p}} \otimes {\varvec{q}}\) in the limit and we easily have$$\begin{aligned} KL[ {\varvec{p}} \otimes {\varvec{p}}: {\varvec{p}} \otimes {\varvec{q}}] = KL[\varvec{p} : \varvec{q}]. \end{aligned}$$(57)
- 2.
When \(\lambda \rightarrow 0\), \(D_{\lambda }\) with \(\gamma _{\lambda } = \lambda /(1+\lambda )\) converges to 0, because \(KL \left[ \mathbf{P}_{0}^*({\varvec{p}}, {\varvec{p}}): {\mathbf{P}}^*_{0}({\varvec{p}}, {\varvec{q}})\right] \) takes a finite value (see Example 1 in the next section). This suggests that it is preferable to use a scaling factor other than \(\gamma _{\lambda }= \lambda /(1+\lambda )\) when \(\lambda \) is small. When \(\lambda =0\), \(C_\lambda =\varphi _\lambda \) is not differentiable. Hence, we cannot construct the Bregman-like divergence from \(C_0\) [Eq. (52)] in a simple example given in Sect. 5.3.
5.2 Other choices of reference distribution r
5.3 Examples of \(\lambda \)-divergence
Below, we consider the case where \(\varvec{r}=\varvec{p}\). We show two simple examples, where \(D_{\lambda }({\varvec{p}}, {\varvec{q}})\) can be analytically computed.
Example 1
Example 2
6 Applications of \(\lambda \)-divergence
6.1 Cluster center (barycenter)
6.2 Statistical estimation
Theorem 7
6.3 Pattern classifier
\(\lambda \)-separating hyperplane
7 Information geometry of transportation plans
7.1 Free problem
7.2 Rate-distortion problem
e-projection in the rate-distortion problem
We considered a communication channel in which \({\varvec{p}}\) is a probability distribution on the senders terminals. The channel is noisy and \(P_{ij}/p_i\) is the probability that \(x_j\) is received when \(x_i\) is sent. The costs \(m_{ij}\) are regarded as the distortion of \(x_i\) changing to \(x_j\). The rate distortion-problem in information theory searches for \(\mathbf{P}\), which minimizes the mutual information of the sender and receiver under the constraint of distortion \(\langle \mathbf{M}, {\mathbf{P}} \rangle \). The problem is formulated by maximizing \(\varphi _{\lambda }(\mathbf{P})\) under the senders constraint \({\varvec{p}}\), where \({\varvec{q}}\) is free (R. Belavkin, personal communication; see also [16]).
7.3 Transportation problem
m-flat submanifolds in the transportation problem
Orthogonal foliations of \(S_{n^2-1}\) with the mixed coordinates
Lemma
For any \({\varvec{a}}\), \({\varvec{b}}\), transformation \(T_{{\varvec{a}}{\varvec{b}}}\) does not change the interaction terms \({\varTheta }_{ij}\). Moreover, the e-geodesic connecting \({\mathbf{P}}\) and \(T_{\varvec{a b}}{} \mathbf{P}\) is orthogonal to \(M({\varvec{p, q}})\).
Proof
By calculating the mixed coordinates of \(T_{\varvec{a b}}{} \mathbf{P}\), we easily see that the \({\varTheta }\)-part does not change. Hence, the e-geodesic connecting \(\mathbf{P}\) and \(T_{\varvec{ab}}{} \mathbf{P}\) is given, in terms of the mixed coordinates, by keeping the last part fixed while changing the first part. This is included in \(E\left( {\varTheta }\right) \). Therefore, the geodesic is orthogonal to \(M(\varvec{p},\varvec{q})\). \(\square \)
Since the optimal solution is given by applying \(T_{\varvec{a b}}\) to \(\mathbf{K}\), even when \({\mathbf{{K}}}\) is not normalized, such that the terminal conditions [Eq. (4)] are satisfied, we have the following theorem:
Theorem 8
The optimal solution \(\mathbf{P}^{*}\) is given by e-projecting \(\mathbf{K}\) to \(M({\varvec{p}}, {\varvec{q}})\).
7.4 Iterative algorithm (Sinkhorn algorithm) for obtaining a and b
We need to calculate \(\varvec{a}\) and \(\varvec{b}\) when \(\varvec{p}\) and \(\varvec{q}\) are given for obtaining the optimal transportation plan. The Sinkhorn algorithm is well known for this purpose [5]. It is an iterative algorithm for obtaining the e-projection of \(\mathbf{K}\) to \(M({\varvec{p}}, {\varvec{q}})\).
- 1.
Begin with \(\mathbf{P}_0=\mathbf{K}\).
- 2.For \(t=0, 1, 2, \ldots \), e-project \(P_{2t}\) to \(M({\varvec{p}}, \cdot )\) to obtain$$\begin{aligned} \mathbf{P}_{2t+1} = T_{A \cdot } \mathbf{P}_{2t}. \end{aligned}$$(121)
- 3.To obtain \(\mathbf{P}_{2t+2}\), e-project \({\mathbf{P}}_{2t+1}\) to \(M(\cdot , {\varvec{q}})\),$$\begin{aligned} \mathbf{P}_{2t+2} = T_{\cdot B} \mathbf{P}_{2t+1}. \end{aligned}$$(122)
- 4.
Repeat until convergence.
Sinkhorn algorithm as iterative e-projections
Figure 6 schematically illustrates the iterative e-projection algorithm for finding the optimal solution \(\mathbf{P}^{*}\).
8 Conclusions and additional remarks
-
1. Uniqueness of the optimal plan
The original Wasserstein distance is obtained by solving a linear programming problem. Hence, the solution is not unique in some cases and is not necessarily a continuous function of \({\mathbf{M}}\). However, the entropy-constrained solution is unique and continuous with respect to \(\mathbf{M}\) [7]. While \(\varphi _{\lambda }({\varvec{p}}, {\varvec{q}})\) converges to \(\varphi _0({\varvec{p}}, {\varvec{q}})\) as \(\lambda \rightarrow 0\), \(\varphi _0({\varvec{p}}, {\varvec{q}})\) is not necessarily differentiable with respect to \({\varvec{p}}\) and \({\varvec{q}}\).
-
2. Integrated information theory of consciousness
Given a joint probability distribution \(\mathbf{P}\), the amount of integrated information is measured by the amount of interactions of information among different terminals. We used a disconnected model in which no information is transferred through branches connecting different terminals. The geometric measure of integrated information theory is given by the KL-divergence from \(\mathbf{P}\) to the submanifold of disconnected models [13, 14]. However, the Wasserstein divergence can be considered as such a measure when the cost of transferring information through different terminals depends on the physical positions of the terminals [15]. We can use the entropy-constrained divergence \(D_{\lambda }\) to define the amount of information integration.
-
3. f -divergence
We used the KL-divergence in a dually flat manifold for defining \(D_{\lambda }\). It is possible to use any other divergences, for example, the f-divergence instead of KL-divergence. We would obtain similar results.
-
4. q -entropy
Muzellec et al. used the \(\alpha \)-entropy (Tsallis q-entropy) instead of the Shannon entropy for regularization [16]. This yields the q-entropy-relaxed framework.
-
5. Comparison of \(C_{\lambda }\) and \(D_{\lambda }\)
Although \(D_{\lambda }\) satisfies the criterion of a divergence, it might differ considerably from the original \(C_{\lambda }\). In particular, when \(C_\lambda ({\varvec{p}}, {\varvec{q}})\) includes a piecewise linear term such as \(\sum d_i|p_i - q_i|\) for constant \(d_i\), \(D_\lambda \) defined in Eq. (52) eliminates this term. When this term is important, we can use \(\{ C_\lambda ({\varvec{p}}, {\varvec{q}}) \}^2\) instead of \(C_\lambda ({\varvec{p}}, {\varvec{q}})\) for defining a new divergence \(D_\lambda \) in Eq. (52). In our accompanying paper [17], we define a new type of divergence that retains the properties of \(C_{\lambda }\) and is closer to \(C_{\lambda }\).
References
- 1.Amari, S.: Information Geometry and Its Applications. Springer, Japan (2016)CrossRefMATHGoogle Scholar
- 2.Chentsov, N.N.: Statistical Decision Rules and Optimal Inference. Nauka (1972) (translated in English, AMS (1982)) Google Scholar
- 3.Rao, C.R.: Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945)MathSciNetMATHGoogle Scholar
- 4.Ay, N., Jost, J., Le, H.V., Schwachhöfer, L.: Information Geometry. Springer, Cham (2017)CrossRefMATHGoogle Scholar
- 5.Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkhauser, Basel (2015)CrossRefMATHGoogle Scholar
- 6.Villani, C.: Topics in Optimal Transportation. Graduate Studies in Math. AMS, Providence (2013)Google Scholar
- 7.Cuturi, M.: Sinkhorn distances: light speed computation of optimal transport. In: Advances in Neural Information Processing Systems, pp. 2292–2300 (2013)Google Scholar
- 8.Cuturi, M., Avis, D.: Ground metric learning. J. Mach. Learn. Res. 15, 533–564 (2014)MathSciNetMATHGoogle Scholar
- 9.Cuturi, M., Peyré, G.: A smoothed dual formulation for variational Wasserstein problems. SIAM J. Imaging Sci. 9, 320–343 (2016)MathSciNetCrossRefMATHGoogle Scholar
- 10.Montavon, G., Muller K., Cuturi, M.: Wasserstein training for Boltzmann machines (2015). arXiv:1507.01972v1
- 11.Belavkin, R.V.: Optimal measures and Markov transition kernels. J. Glob. Optim. 55, 387–416 (2013)MathSciNetCrossRefMATHGoogle Scholar
- 12.Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 35, 876–879 (1964)MathSciNetCrossRefMATHGoogle Scholar
- 13.Oizumi, M., Tsuchiya, N., Amari, S.: Unified framework for information integration based on information geometry. Proc. Natl. Acad. Sci. 113, 14817–14822 (2016)MathSciNetCrossRefGoogle Scholar
- 14.Amari, S., Tsuchiya, N., Oizumi, M.: Geometry of information integration (2017). arXiv:1709.02050
- 15.Oizumi, M., Albantakis, L., Tononi, G.: From the phenomenology to the mechanisms of consciousness: integrated information theory 3.0. PLoS Comput. Biol. 10, e1003588 (2014)CrossRefGoogle Scholar
- 16.Muzellec, B., Nock, R., Patrini, G., Nielsen, F.: Tsallis regularized optimal transport and ecological inference (2016). arXiv:1609.04495v1
- 17.Amari, S, Karakida, R. Oizumi, M., Cuturi, M.: New divergence derived from Cuturi function (in preparation) Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.