# The Wasserstein Space

- 325 Downloads

## Abstract

The Kantorovich problem described in the previous chapter gives rise to a metric structure, the *Wasserstein distance*, in the space of probability measures \(P(\mathcal X)\) on a space \(\mathcal X\). The resulting metric space, a subspace of \(P(\mathcal X)\), is commonly known as the *Wasserstein space* \(\mathcal W\) (although, as Villani [125, pages 118–119] puts it, this terminology is “very questionable”; see also Bobkov and Ledoux [25, page 4]). In Chap. 4, we shall see that this metric is in a sense canonical when dealing with warpings, that is, deformations of the space \(\mathcal X\) (for example, in Theorem 4.2.4). In this chapter, we give the fundamental properties of the Wasserstein space. After some basic definitions, we describe the topological properties of that space in Sect. 2.2. It is then explained in Sect. 2.3 how \(\mathcal W\) can be endowed with a sort of infinite-dimensional Riemannian structure. Measurability issues are dealt with in the somewhat technical Sect. 2.4.

The Kantorovich problem described in the previous chapter gives rise to a metric structure, the *Wasserstein distance*, in the space of probability measures \(P(\mathcal X)\) on a space \(\mathcal X\). The resulting metric space, a subspace of \(P(\mathcal X)\), is commonly known as the *Wasserstein space* \(\mathcal W\) (although, as Villani [125, pages 118–119] puts it, this terminology is “very questionable”; see also Bobkov and Ledoux [25, page 4]). In Chap. 4, we shall see that this metric is in a sense canonical when dealing with warpings, that is, deformations of the space \(\mathcal X\) (for example, in Theorem 4.2.4). In this chapter, we give the fundamental properties of the Wasserstein space. After some basic definitions, we describe the topological properties of that space in Sect. 2.2. It is then explained in Sect. 2.3 how \(\mathcal W\) can be endowed with a sort of infinite-dimensional Riemannian structure. Measurability issues are dealt with in the somewhat technical Sect. 2.4.

## 2.1 Definition, Notation, and Basic Properties

*p-Wasserstein space*on \(\mathcal X\) is defined as

*Π*(

*μ*,

*ν*) is defined to be the set of measures \(\pi \in P(\mathcal X^2)\) having

*μ*and

*ν*as marginals in the sense of ( 1.2). The

*p-Wasserstein distance*between

*μ*and

*ν*is defined as the minimal total transportation cost between

*μ*and

*ν*in the Kantorovich problem with respect to the cost function

*c*

_{p}(

*x*,

*y*) = ∥

*x*−

*y*∥

^{p}:

*μ*and

*ν*is finite when both measures are in \(\mathcal W_p(\mathcal X)\), because

*W*

_{p}is finite on \([\mathcal W_p(\mathcal X)]^2=\mathcal W_p(\mathcal X)\times \mathcal W_p(\mathcal X)\); it is nonnegative and symmetric and it is easy to see that

*W*

_{p}(

*μ*,

*ν*) = 0 if and only if

*μ*=

*ν*. A proof that

*W*

_{p}is a metric (satisfies the triangle inequality) on \(\mathcal W_p\) can be found in Villani [124, Chapter 7].

The aforementioned setting is by no means the most general one can consider. Firstly, one can define *W*_{p} and \(\mathcal W_p\) for 0 < *p* < 1 by removing the power 1∕*p* from the infimum and the limit case *p* = 0 yields the total variation distance. Another limit case can be defined as *W*_{∞}(*μ*, *ν*) =lim_{p→∞}*W*_{p}(*μ*, *ν*). Moreover, *W*_{p} and \(\mathcal W_p\) can be defined whenever \(\mathcal X\) is a complete and separable metric space (or even only separable; see Clément and Desch [36]): one fixes some *x*_{0} in \(\mathcal X\) and replaces ∥*x*∥ by *d*(*x*, *x*_{0}). Although the topological properties below still hold at that level of generality (except when *p* = 0 or *p* = *∞*), for the sake of simplifying the notation we restrict the discussion to Banach spaces. It will always be assumed without explicit mention that 1 ≤ *p* < *∞*.

*μ*such that

*W*

_{p}(

*μ*,

*δ*

_{0}) <

*∞*with

*δ*

_{x}being a Dirac measure at

*x*. Of course,

*W*

_{p}(

*μ*,

*ν*) can be finite even if \(\mu ,\nu \notin \mathcal W_p(\mathcal X)\). But if \(\mu \in \mathcal W_p(\mathcal X)\) and \(\nu \notin \mathcal W_p(\mathcal X)\), then

*W*

_{p}(

*μ*,

*ν*) is always infinite. This can be seen from the triangle inequality

*q*≥

*p*, then \(\mathcal W_q(\mathcal X)\subseteq \mathcal W_p(\mathcal X)\). This property extends to the distances in the form:

*π*∈

*Π*(

*μ*,

*ν*) be optimal with respect to

*q*. Jensen’s inequality for the convex function

*z*↦

*z*

^{q∕p}gives

*W*

_{p}be finite while

*W*

_{q}is infinite. A converse can be established, however, if

*μ*and

*ν*are bounded:

*d*

_{K}and let

*π*be now optimal with respect to

*p*, then

*π*(

*K*×

*K*) = 1 and

**t**and

**s**are optimal maps are related to

*compatibility*of the measures

*μ*,

*ν*=

**t**

*#μ*and

*ρ*=

**s**

*#μ*(see Sect. 2.3.2) and will be of conceptual importance in the context of Fréchet means (see Sect. 3.1).

We also recall the notation *B*_{R}(*x*_{0}) = {*x* : ∥*x* − *x*_{0}∥ < *R*} and \(\overline B_R(x_0)=\{x:\|x-x_0\|\le R\}\) for open and closed balls in \(\mathcal X\).

## 2.2 Topological Properties

### 2.2.1 Convergence, Compact Subsets

The topology of a space is determined by the collection of its closed sets. Since \(\mathcal W_p(\mathcal X)\) is a metric space, whether a set is closed or not depends on which sequences in \(\mathcal W_p(\mathcal X)\) converge. The following characterisation from Villani [124, Theorem 7.12] will be very useful.

### Theorem 2.2.1 (Convergence in Wasserstein Space)

*Let*\(\mu ,\mu _n\in \mathcal W_p(\mathcal X)\)

*. Then the following are equivalent:*

- 1.
*W*_{p}(*μ*_{n},*μ*) → 0*as n*→*∞;* - 2.
*μ*_{n}→*μ weakly and*\({\int _{\mathcal {X}}{\|x\|{ }^p} \mathrm {d}{\mu _n(x)}} \to \int _{\mathcal {X}}{\|x\|{ }^p}\mathrm {d}{\mu (x)}\)*;* - 3.
*μ*_{n}→*μ weakly and*$$\displaystyle \begin{aligned} \sup_n{\int_{\{x:\|x\|>R\}}^{} \! \|x\|{}^p \, \mathrm{d}\mu_n(x)} \to 0, \qquad R\to\infty; \end{aligned} $$(2.4) - 4.
*for any C*> 0*and any continuous*\(f:\mathcal X\to \mathbb {R}\)*such that*|*f*(*x*)|≤*C*(1 + ∥*x*∥^{p})*for all x,*$$\displaystyle \begin{aligned} {\int_{\mathcal{X}}{f(x)}\mathrm{d}{\mu_n(x)}} \to{\int_{\mathcal{X}}{f(x)}\mathrm{d}{\mu(x)}}. \end{aligned}$$ - 5.
*(Le Gouic and Loubes [*87*, Lemma 14]) μ*_{n}→*μ weakly and there exists*\(\nu \in \mathcal W_p(\mathcal X)\)*such that W*_{p}(*μ*_{n},*ν*) →*W*_{p}(*μ*,*ν*).

Consequently, the Wasserstein topology is finer than the weak topology induced on \(\mathcal W_p(\mathcal X)\) from \(P(\mathcal X)\). Indeed, let \(\mathcal A\subseteq \mathcal W_p(\mathcal X)\) be weakly closed. If \(\mu _n\in \mathcal A\) converge to *μ* in \(\mathcal W_p(\mathcal X)\), then *μ*_{n} → *μ* weakly, so \(\mu \in \mathcal A\). In other words, the Wasserstein topology has more closed sets than the induced weak topology. Moreover, each \(\mathcal W_p(\mathcal X)\) is a weakly closed subset of \(P(\mathcal X)\) by the same arguments that lead to ( 1.3). In view of Theorem 2.2.1, a common strategy to establish Wasserstein convergence is to first show tightness and obtain weak convergence, hence a candidate limit, and then show that the stronger Wasserstein convergence actually holds. In some situations, the last part is automatic:

### Corollary 2.2.2

*Let* \(K\subset \mathcal X\) *be a bounded set and suppose that μ*_{n}(*K*) = 1 *for all n *≥ 1*. Then W*_{p}(*μ*_{n}, *μ*) → 0 *if and only if μ*_{n} →* μ weakly.*

### Proof

This is immediate from (2.4).

*μ*

_{n}→

*μ*and

*ν*

_{n}→

*ν*in \(\mathcal W_p(\mathcal X)\), then it is obvious that

*W*

_{p}(

*μ*

_{n},

*ν*

_{n}) →

*W*

_{p}(

*μ*,

*ν*). But if the convergence is only weak, then the Wasserstein distance is still lower semicontinuous:

Before giving some examples, it will be convenient to formulate Theorem 2.2.1 in probabilistic terms. Let *X*, *X*_{n} be random elements on \(\mathcal X\) with laws \(\mu ,\mu _n\in \mathcal W_p(\mathcal X)\). Assume without loss of generality that *X*, *X*_{n} are defined on the same probability space \((\varOmega ,\mathcal F,\mathbb {P})\) and write *W*_{p}(*X*_{n}, *X*) to denote *W*_{p}(*μ*_{n}, *μ*). Then *W*_{p}(*X*_{n}, *X*) → 0 if and only if *X*_{n} → *X* weakly and \(\mathbb {E} \|X_n\|{ }^p\to \mathbb {E} \|X\|{ }^p\).

An early example of the use of Wasserstein metric in statistics is due to Bickel and Freedman [21]. Let *X*_{n} be independent and identically distributed random variables with mean zero and variance 1 and let *Z* be a standard normal random variable. Then \(Z_n=\sum _{i=1}^nX_i/\sqrt {n}\) converge weakly to *Z* by the central limit theorem. But \(\mathbb {E} Z_n^2=1=\mathbb {E} Z^2\), so *W*_{2}(*Z*_{n}, *Z*) → 0. Let \(Z_n^*\) be a bootstrapped version of *Z*_{n} constructed by resampling the *X*_{n}’s. If \(W_2(Z_n^*,Z_n)\to 0\), then \(W_2(Z_n^*,Z)\to 0\) and in particular \(Z_n^*\) has the same asymptotic distribution as *Z*_{n}.

*p*≤ 2 by the last condition of the theorem. If in addition \(\mathbb {E} X_1^4<\infty \), then

*W*

_{4}(

*Z*

_{n},

*Z*) → 0 and all moments up to order 4 converge.

Condition (2.4) is called *uniform integrability* of the function *x*↦∥*x*∥^{p} with respect to the collection (*μ*_{n}). Of course, it holds for a single measure \(\mu \in \mathcal W_p(\mathcal X)\) by the dominated convergence theorem. This condition allows us to characterise compact sets in the Wasserstein space. One should beware that when \(\mathcal X\) is infinite-dimensional, (2.4) alone is not sufficient in order to conclude that *μ*_{n} has a convergent subsequence: take *μ*_{n} to be Dirac measures at *e*_{n} with (*e*_{n}) an orthonormal basis of a Hilbert space \(\mathcal X\) (or any sequence with ∥*e*_{n}∥ = 1 that has no convergent subsequence, if \(\mathcal X\) is a Banach space). The uniform integrability (2.4) must be accompanied with tightness, which is a consequence of (2.4) only when \(\mathcal X=\mathbb {R}^d\).

### Proposition 2.2.3 (Compact Sets in \(\mathcal W_p\))

*A weakly tight set*\(\mathcal K{\,\subseteq \,} \mathcal W_p\)

*is Wasserstein-tight (has a compact closure in*\(\mathcal W_p\)

*) if and only if*

*Moreover,*(2.6)

*is equivalent to the existence of a monotonically divergent function*\(g:\mathbb R_+\to \mathbb R_+\)

*such that*

The proof is on page 41 of the supplement.

### Remark 2.2.4

*For any sequence* (*μ*_{n}) *in* \(\mathcal W_p\) *(tight or not) there exists a monotonically divergent g with* \({\int _{\mathcal X} \|x\|{ }^p g(\|x\|) \, \mathrm {d}\mu _n(x)}<\infty \) *for all n.*

### Corollary 2.2.5 (Measures with Common Support)

*Let*\(K\subseteq \mathcal X\)

*be a compact set. Then*

*is compact.*

### Proof

This is immediate, since \(\mathcal K\) is weakly tight and the supremum in (2.6) vanishes when *R* is larger than the finite quantity sup_{x ∈ K}∥*x*∥. Finally, *K* is closed, so \(\mathcal K\) is weakly closed, hence Wasserstein closed, by the portmanteau Lemma 1.7.1.

*uniform absolute continuity*

*𝜖*> 0, choose

*R*=

*R*

_{𝜖}> 0 such that the supremum in (2.4) is smaller than

*𝜖*∕2, and set

*δ*=

*𝜖*∕(2

*R*

^{p}). If

*μ*

_{n}(

*A*) ≤

*δ*, then

### 2.2.2 Dense Subsets and Completeness

If we identify a measure \(\mu \in \mathcal W_p(\mathcal X)\) with a random variable *X* (having distribution *μ*), then *X* has a finite *p*-th moment in the sense that the real-valued random variable ∥*X*∥ is in *L*_{p}. In view of that, it should not come as a surprise that \(\mathcal W_p(\mathcal X)\) enjoys topological properties similar to *L*_{p} spaces. In this subsection, we give some examples of useful dense subsets of \(\mathcal W_p(\mathcal X)\) and then “show” that like \(\mathcal X\) itself, it is a complete separable metric space. In the next subsection, we describe some of the negative properties that \(\mathcal W_p(\mathcal X)\) has, again in similarity with *L*_{p} spaces.

We first show that \(\mathcal W_p(\mathcal X)\) is separable. The core idea of the proof is the feasibility of approximating any measure with discrete measures as follows.

*μ*be a probability measure on \(\mathcal X\), and let

*X*

_{1},

*X*

_{2}, … be a sequence of independent random elements in \(\mathcal X\) with probability distribution

*μ*. Then the

*empirical measure*

*μ*

_{n}is defined as the random measure \((1/n)\sum _{i=1}^n\delta \{X_i\}\). The law of large numbers shows that for any (measurable) bounded or nonnegative \(f:\mathcal X\to \mathbb {R}\), almost surely

*f*(

*x*) = ∥

*x*∥

^{p}, we obtain convergence of moments of order

*p*. Hence by Theorem 2.2.1, if \(\mu \in \mathcal W_p(\mathcal X)\), then

*μ*

_{n}→

*μ*in \(\mathcal W_p(\mathcal X)\) if and only if

*μ*

_{n}→

*μ*weakly. We know that integrals of bounded functions converge with probability one, but the null set where convergence fails may depend on the chosen function and there are uncountably many such functions. When \(\mathcal X=\mathbb {R}^d\), by the portmanteau Lemma 1.7.1 we can replace the collection \(C_b(\mathcal X)\) by indicator functions of rectangles of the form (−

*∞*,

*a*

_{1}] ×⋯ × (−

*∞*,

*a*

_{d}] for \(a=(a_1,\dots ,a_d)\in \mathbb {R}^d\). It turns out that the countable collection provided by rational vectors

*a*suffices (see the proof of Theorem 4.4.1 where this is done in a more complicated setting). For more general spaces \(\mathcal X\), we need to find another countable collection {

*f*

_{j}} such that convergence of the integrals of

*f*

_{j}for all

*j*suffices for weak convergence. Such a collection exists, by using bounded Lipschitz functions (Dudley [47, Theorem 11.4.1]); an alternative construction can be found in Ambrosio et al. [12, Section 5.1]. Thus:

### Proposition 2.2.6 (Empirical Measures in \(\mathcal W_p\))

*For any* \(\mu \in P(\mathcal X)\) *and the corresponding sequence of empirical measures μ*_{n}*, W*_{p}(*μ*_{n}, *μ*) → 0 *almost surely if and only if* \(\mu \in \mathcal W_p(\mathcal X)\).

Indeed, if \(\mu \notin \mathcal W_p(\mathcal X)\), then *W*_{p}(*μ*_{n}, *μ*) is infinite for all *n*, since *μ*_{n} is compactly supported, hence in \(\mathcal W_p(\mathcal X)\).

Proposition 2.2.6 is the basis for constructing dense subsets of the Wasserstein space.

### Theorem 2.2.7 (Dense Subsets of \(\mathcal W_p\))

*The following collections of measures are dense in*\(\mathcal W_p(\mathcal X)\)

*:*

- 1.
*finitely supported measures with rational weights;* - 2.
*compactly supported measures;* - 3.
*finitely supported measures with rational weights on a dense subset*\(A\subseteq \mathcal X\)*;* - 4.
*if*\(\mathcal X=\mathbb {R}^d\)*, the collection of absolutely continuous and compactly supported measures;* - 5.
*if*\(\mathcal X=\mathbb {R}^d\)*, the collection of absolutely continuous measures with strictly positive and bounded analytic densities.*

*In particular,* \(\mathcal W_p\) *is separable (the third set is countable as* \(\mathcal X\) *is separable).*

This is a simple consequence of Proposition 2.2.6 and approximations, and the details are given on page 43 in the supplement.

### Proposition 2.2.8 (Completeness)

*The Wasserstein space* \(\mathcal W_p(\mathcal X)\) *is complete.*

One may find two different proofs in Villani [125, Theorem 6.18] and Ambrosio et al. [12, Proposition 7.1.5]. On page 43 of the supplement, we sketch an alternative argument based on completeness of the weak topology.

### 2.2.3 Negative Topological Properties

In the previous subsection, we have shown that \(\mathcal W_p(\mathcal X)\) is separable and complete like *L*_{p} spaces. Just like them, however, the Wasserstein space is neither locally compact nor *σ*-compact. For this reason, existence proofs of Fréchet means in \(\mathcal W_p(\mathcal X)\) require tools that are more specific to this space, and do not rely upon local compactness (see Sect. 3.1).

### Proposition 2.2.9 (\(\mathcal W_p\) is Not Locally Compact)

*Let*\(\mu \in \mathcal W_p(\mathcal X)\)

*and let 𝜖*> 0

*. Then the Wasserstein ball*

*is not compact.*

Ambrosio et al. [12, Remark 7.1.9] show this when *μ* is a Dirac measure, and we extend their argument on page 43 of the supplement.

From this, we deduce:

### Corollary 2.2.10

*The Wasserstein space* \(\mathcal W_p(\mathcal X)\) *is not σ-compact.*

### Proof

If \(\mathcal K\) is a compact set in \(\mathcal W_p(\mathcal X)\), then its interior is empty by Proposition 2.2.9. A countable union of compact sets has an empty interior (hence cannot equal the entire space \(\mathcal W_p(\mathcal X)\)) by the Baire property, which holds on the complete metric space \(\mathcal W_p(\mathcal X)\) by the Baire category theorem (Dudley [47, Theorem 2.5.2]).

### 2.2.4 Covering Numbers

*𝜖*> 0 the covering number

*f*: [0,

*∞*) → [0,

*∞*] such that

*f*provides a certain measure of how compact \(\mathcal K\) is. If \(\mathcal K=\mathcal W_p(K)\) is the set of measures supported on a compact \(K\subseteq \mathbb {R}^d\), then

*f*(

*L*) can be taken infinite for

*L*large, and

*L*can be treated as a constant in the theorem. Otherwise

*L*increases as Open image in new window, at a speed that depends on

*f*: the faster

*f*diverges, the slower

*L*grows with decreasing

*𝜖*and the better the bound becomes.

### Theorem 2.2.11

*Let 𝜖*> 0

*and L*=

*f*

^{−1}(1∕

*𝜖*

^{p})

*. If d𝜖*≤

*L, then*

*with C*_{1}(*d*) = 3^{d}*eθ*_{d}*,* \(C_2(d,p)=(p+d)\log 3+ (p+2)\log 2 + \log \theta _d\) *and* \(\theta _d=d[5+\log d+\log \log d]\).

Since *𝜖* > 0 is small and *L* is increasing in *𝜖*, the restriction that *d𝜖* ≤ *L* is typically not binding. We provide some examples before giving the proof.

*d*-dimensional unit ball, then

*L*can be taken equal to one, independently of

*𝜖*. We obtain

*f*(

*L*) =

*e*

^{L}and \(\widetilde N(\epsilon )\) is a constant times

*𝜖*

^{−d}[log 1∕

*𝜖*]

^{d}. The exponent

*p*appears only in the constant.

*p*+

*δ*, that is,

*f*(

*L*) =

*L*

^{δ}. Then

*L*∼

*𝜖*

^{−p∕δ}and

*δ*<

*∞*) the behaviour of \(\widetilde N(\epsilon )\) depends more strongly upon

*p*: if

*p′*<

*p*, then we can replace

*δ*by

*δ′*=

*δ*+

*p*−

*p′*>

*δ*, leading to a smaller magnitude of \(\widetilde N(\epsilon )\).

Example 4: if *f*(*L*) is only \(\log L\), then \(\widetilde N\) behaves like \(\epsilon ^{-(d+p)}\exp (\epsilon ^{-pd})\), so *p* has a very dominant effect.

### Proof

The proof is divided into four steps.

**Step 1: Compact support.**Let \(P_L:\mathbb {R}^d\to \mathbb {R}^d\) be the projection onto \(\overline B_L(0)=\{x\in \mathbb {R}^d:\|x\|\le L\}\) and let \(\mu \in \mathcal K\). Then

*L*→

*∞*.

**Step 2:**

*n*

**-Point measures.**Let

*n*=

*N*(

*𝜖*;

*B*

_{L}(0)) be the covering number of the Euclidean ball in \(\mathbb {R}^d\). There exists a set \(x_1,\dots ,x_n\in \mathbb {R}^d\) such that

*B*

_{L}(0) ⊆∪

*B*

_{𝜖}(

*x*

_{i}). If \(\mu \in \mathcal W_p(B_L(0))\), there exists a measure

*μ*

_{n}supported on the

*x*

_{i}’s and such that

*C*

_{1}=

*B*

_{𝜖}(

*x*

_{1}),

*C*

_{i}=

*B*

_{𝜖}(

*x*

_{i}) ∖∪

_{j<i}

*B*

_{𝜖}(

*x*

_{j}) and define

*μ*

_{n}({

*x*

_{i}}) =

*μ*(

*C*

_{i}). The transport map defined by

**t**(

*x*) =

*x*

_{i}for

*x*∈

*C*

_{i}pushes

*μ*forward to

*μ*

_{n}and

*𝜖*≤

*L*∕

*d*.

**Step 3: Common weights.**If \(\mu =\sum a_k\delta _{x_k}\) and \(\nu =\sum b_k\delta _{x_k}\), then \(W_p^p(\mu ,\nu )\le \sum _k |a_k - b_k|\sup _{i,j}\|x_i - x_j\|{ }^p\). Let

*δ*)

^{n−1}elements, and any measure supported on {

*x*

_{1}, …,

*x*

_{n}} can be approximated by a measure in

*μ*

_{n,𝜖,δ}with error 2

*L*(

*nδ*)

^{1∕p}.

**Step 4: Conclusion.**Let

*L*=

*f*

^{−1}(1∕

*𝜖*

^{p}),

*n*=

*N*(

*𝜖*;

*B*

_{L}(0)) and

*δ*= [

*𝜖*∕(2

*L*)]

^{p}∕

*n*. Combining the previous three steps, we obtain in the case

*L*≥

*𝜖d*that

*L*∕

*𝜖*≥ 1 and

*θ*

_{d}≥ 5. Conclude that

## 2.3 The Tangent Bundle

Although the Wasserstein space \(\mathcal W_p(\mathcal X)\) is non-linear in terms of measures, it *is* linear in terms of maps. Indeed, if \(\mu \in \mathcal W_p(\mathcal X)\) and \(T_i:\mathcal X\to \mathcal X\) are such that ∥*T*_{i}∥∈ *L*_{p}(*μ*), then \((\alpha T_1 + \beta T_2)\#\mu \in \mathcal W_p(\mathcal X)\) for all \(\alpha ,\beta \in \mathbb {R}\). Later, in Sect. 2.4, we shall see that \(\mathcal W_p(\mathcal X)\) is in fact homeomorphic to a subset of the space of such functions. The goal of this section is to exploit the linearity of the latter in order to define the tangent bundle of \(\mathcal W_p\). This in particular will be used for deriving differentiability properties of the Wasserstein distance in Sect. 3.1.6. However, the latter can be understood at a purely analytic level, and readers uncomfortable with differential geometry can access most of the rest of the monograph without reference to this section.

We assume here that \(\mathcal X\) is a Hilbert space and that *p* = 2; the results extend to any *p* > 1. Absolutely continuous measures are assumed to be so with respect to Lebesgue measure if \(\mathcal X=\mathbb {R}^d\) and otherwise refer to Definition 1.6.4.

### 2.3.1 Geodesics, the Log Map and the Exponential Map in \(\mathcal W_2(\mathcal X)\)

*γ*

_{0}=

*γ*,

*γ*

_{1}=

*μ*and from (2.3),

*constant-speed geodesic*in \(\mathcal W_2(\mathcal X)\).

*tangent space*of \(\mathcal W_2(\mathcal X)\) at

*μ*as (Ambrosio et al. [12, Definition 8.5.1])

_{μ}⊆

*L*

_{2}(

*μ*). (Strictly speaking, Tan

_{μ}is a subset of the space of functions \(f:\mathcal X\to \mathcal X\) such that ∥

*f*∥∈

*L*

_{2}(

*μ*) rather than

*L*

_{2}(

*μ*) itself, as in Definition 2.4.3, but we will write

*L*

_{2}for simplicity.)

*L*

_{2}(

*μ*), and for

**t**Lipschitz the negative of a tangent element

**s**can be seen to belong to the subgradient of a convex function by definition of

*s*. This also shows that Tan

_{μ}can be seen to be the

*L*

_{2}(

*μ*)-closure of all gradients of \(C_c^\infty \) functions. We refer to [12, Definition 8.4.1 and Theorem 8.5.1] for the proof and extensions to other values of

*p*> 1 and to infinite dimensions, using cylindrical functions that depend on finitely many coordinates [12, Definition 5.1.11]. The alternative definition highlights that it is essentially the inner product in Tan

_{μ}, but not the elements themselves, that depends on

*μ*.

_{γ}of the mapping that sends

**r**∈

*L*

_{2}(

*γ*) to \([\mathbf r+\mathbf i]\#\gamma \in \mathcal W_2(\mathcal X)\). More explicitly, \({\exp }_\gamma :\mathrm {Tan}_\gamma \to \mathcal W_2 \) takes the form

*γ*is absolutely continuous, exp

_{γ}is surjective, as can be seen from its right inverse, the log map

*π*is any optimal transport plan between

*γ*and

*μ*. This is defined for arbitrary measures \(\gamma ,\mu \in \mathcal W_2\), and reduces to the previous definition if

*γ*is absolutely continuous. It is shown in Ambrosio et al. [12, Chapter 7] or Santambrogio [119, Proposition 5.32] that these are the only constant-speed geodesics in \(\mathcal W_2\).

### 2.3.2 Curvature and Compatibility of Measures

*μ*and

*ν*is smaller in \(\mathcal W_2(\mathcal X)\) than the distance between the corresponding vectors log

_{γ}(

*μ*) and log

_{γ}(

*ν*) in the tangent space Tan

_{γ}. In the terminology of differential geometry, this means that the Wasserstein space has

*nonnegative sectional curvature*at any absolutely continuous

*γ*.

*ν*to

*μ*, equality holds if and only if \({\mathbf {t}}_{\gamma }^{\mu }\circ {\mathbf {t}}_{\nu }^{\gamma }={\mathbf {t}}_{\nu }^{\mu }\). This motivates the following definition.

### Definition 2.3.1 (Compatible Measures)

*A collection of absolutely continuous measures* \(\mathcal C\subseteq \mathcal W_2(\mathcal X)\) *is* compatible *if for all* \(\gamma ,\mu ,\nu \in \mathcal C\)*, we have* \({\mathbf {t}}_{\gamma }^{\mu }\circ {\mathbf {t}}_{\nu }^{\gamma }={\mathbf {t}}_{\nu }^{\mu }\) *(in L*_{2}(*ν)).*

### Remark 2.3.2

*The absolute continuity is not necessary and was introduced for notational simplicity. A more general definition that applies to general measures is the following: every finite subcollection of* \(\mathcal C\) *admits an optimal multicoupling whose relevant projections are simultaneously pairwise optimal; see the paragraph preceding Theorem* *3.1.9*.

*μ*to

*ν*via the uniform distribution Leb|

_{[0,1]}(see Sect. 1.5). Letting \(F_\mu ^{-1}\) and \(F_\nu ^{-1}\) denote the quantile functions, we have seen that

*p*≥ 1 and not only for

*p*= 2.) In other words, \(\mu \mapsto F_\mu ^{-1}\) is an

*isometry*from \(\mathcal W_2(\mathbb {R})\) to the subset of

*L*

_{2}(0, 1) formed by (equivalence classes of) left-continuous nondecreasing functions on (0, 1). Since this is a convex subset of a Hilbert space, this property provides a very simple way to evaluate Fréchet means in \(\mathcal W_2(\mathbb {R})\) (see Sect. 3.1). If

*γ*= Leb|

_{[0,1]}, then \(F_\mu ^{-1}={\mathbf {t}}_{\gamma }^{\mu }\) for all

*μ*, so we can write the above equality as

*flat*(has zero sectional curvature).

*L*

_{2}(

*γ*). Generalising the one-dimensional case, we shall see that this allows for easy calculations of Fréchet means by means of averaging transport maps (Theorem 3.1.9).

*Example: Gaussian compatible measures*. The Gaussian case presented in Sect. 1.6.3 is helpful in shedding light on the structure imposed by the compatibility condition. Let \(\gamma \in \mathcal W_2(\mathbb {R}^d)\) be a standard Gaussian distribution with identity covariance matrix. Let

*Σ*

_{μ}denote the covariance matrix of a measure \(\mu \in \mathcal W_2(\mathbb {R}^d)\). When

*μ*and

*ν*are centred nondegenerate Gaussian measures,

*γ*,

*μ*, and

*ν*are compatible if and only if

*A*and

*B*are symmetric, then

*AB*is symmetric if and only if

*AB*=

*BA*), or equivalently, if and only if

*Σ*

_{ν}and

*Σ*

_{μ}commute. We see that a collection \(\mathcal C\) of Gaussian measures on \(\mathbb {R}^d\) that includes the standard Gaussian distribution is compatible if and only if all the covariance matrices of the measures in \(\mathcal C\) are

*simultaneously diagonalisable*. In other words, there exists an orthogonal matrix

*U*such that

*D*

_{μ}=

*UΣ*

_{μ}

*U*

^{t}is diagonal for all \(\mu \in \mathcal C\). In that case, formula ( 1.6)

*x*↦

*ax*on \(\mathbb {R}\), the optimal maps take the “orthogonal separable” form

*d*nondecreasing real-valued functions. This is yet another crystallisation of the one-dimensional-like structure of compatible measures.

*μ*-almost all

*x*. In the Gaussian case, the optimal maps are linear functions, so

*x*does not appear in the matrices.

Here are some examples of compatible measures. It will be convenient to describe them using the optimal maps from a reference measure \(\gamma \in \mathcal W_2(\mathbb {R}^d)\). Define \(\mathcal C=\mathbf {t}\#\gamma \) with **t** belonging to one of the following families. The first imposes the one-dimensional structure by varying only the behaviour of the norm of *x*, while the second allows for separation of variables that splits the *d*-dimensional problem into *d* one-dimensional ones.

*Radial transformations.*Consider the collection of functions \(\mathbf {t}:\mathbb {R}^d\to \mathbb {R}^d\) of the form

**t**(

*x*) =

*xG*(∥

*x*∥) with \(G:\mathbb {R}_+\to \mathbb {R}\) differentiable. Then a straightforward calculation shows that

*I*and

*xx*

^{t}are positive semidefinite, the above matrix is so if both

*G*and

*G′*are nonnegative. If

**s**(

*x*) =

*xH*(∥

*x*∥) is a function of the same form, then

**s**(

**t**(

*x*)) =

*xG*(∥

*x*∥)

*H*(∥

*x*∥

*G*(∥

*x*∥)) which belongs to that family of functions (since

*G*is nonnegative). Clearly

**t**(

*x*), since both matrices are of the form

*aI*+

*bxx*

^{t}with

*a*,

*b*scalars (that depend on

*x*). In order to be able to change the base measure

*γ*, we need to check that the inverses belong to the family. But if

*y*=

**t**(

*x*), then

*x*=

*ay*for some scalar

*a*that solves the equation

*a*is guaranteed to be unique if

*a*↦

*aG*(

*a*) is strictly increasing and it will exist (for

*y*in the range of

**t**) if it is continuous. As a matter of fact, since the eigenvalues of ∇

**t**(

*x*) are

*G*(

*a*) and

*a*↦

*aG*(

*a*) is strictly increasing is sufficient (this is weaker than

*G*itself increasing). Finally, differentiability of

*G*is not required, so it is enough if

*G*is continuous and

*aG*(

*a*) is strictly increasing.

*Separable variables.*Consider the collection of functions \(\mathbf {t}:\mathbb {R}^d\to \mathbb {R}^d\) of the form

*T*

_{i}continuous and strictly increasing. This is a generalisation of the compatible Gaussian case discussed above in which all the

*T*

_{i}’s were linear. Here, it is obvious that elements in this family are optimal maps and that the family is closed under inverses and composition, so that compatibility follows immediately.

*common dependence structure*. More precisely, we say that

*C*: [0, 1]

^{d}→ [0, 1] is a

*copula*if

*C*is (the restriction of) a distribution function of a random vector having uniform margins. In other words, if there is a random vector

*V*= (

*V*

_{1}, …,

*V*

_{d}) with \(\mathbb {P}(V_i\le a)=a\) for all

*a*∈ [0, 1] and all

*j*= 1, …,

*d*, and

*d*-dimensional probability measure

*μ*, one can assign a copula

*C*=

*C*

_{μ}in terms of the distribution function

*G*of

*μ*and its marginals

*G*

_{j}as

*G*

_{j}is surjective on (0, 1), which is equivalent to it being continuous, then this equation defines

*C*uniquely on (0, 1)

^{d}, and consequently on [0, 1]

^{d}. If some marginal

*G*

_{j}is not continuous, then uniqueness is lost, but

*C*still exists [97, Chapter 2]. The connection of copulae to compatibility becomes clear in the following lemma, proven on page 51 in the supplement.

### Lemma 2.3.3 (Compatibility and Copulae)

*The copulae associated with absolutely continuous measures* \(\mu ,\nu \in \mathcal W_2(\mathbb {R}^d)\) *are equal if and only if* \({\mathbf {t}}_{\mu }^{\nu }\) *takes the separable form* (2.8).

*Composition with linear functions.*If \(\phi :\mathbb {R}^d\to \mathbb {R}\) is convex with gradient

**t**and

*A*is a

*d*×

*d*matrix, then the gradient of the convex function

*x*↦

*ϕ*(

*Ax*) at

*x*is

**t**

_{A}=

*A*

^{t}

**t**(

*Ax*). Suppose

*ψ*is another convex function with gradient

**s**and that compatibility holds, i.e., ∇

**s**(

**t**(

*x*)) commutes with ∇

**t**(

*x*) for all

*x*. Then in order for

*AA*

^{t}=

*I*, i.e., that

*A*be orthogonal. Consequently, if {

**t**

*#μ*}

_{t ∈T}are compatible, then so are {

**t**

_{U}

*#μ*}

_{t ∈T}for any orthogonal matrix

*U*.

## 2.4 Random Measures in Wasserstein Space

Let *μ* be a fixed absolutely continuous probability measure in \(\mathcal W_2(\mathcal X)\). If \(\varLambda \in \mathcal W_2(\mathcal X)\) is another probability measure, then the transport map \({\mathbf {t}}_{\mu }^{\varLambda }\) and the convex potential are functions of *Λ*. If *Λ* is now random, then we would like to be able to make probability statements about them. To this end, it needs to be shown that \({\mathbf {t}}_{\mu }^{\varLambda }\) and the convex potential are *measurable* functions of *Λ*. The goal of this section is to develop a rigorous mathematical framework that justifies such probability statements. We show that all the relevant quantities are indeed measurable, and in particular establish a Fubini-type result in Proposition 2.4.9. This technical section may be skipped at first reading.

Here is an example of a measurability result (Villani [125, Corollary 5.22]). Recall that \(P(\mathcal X)\) is the space of Borel probability measures on \(\mathcal X\), endowed with the topology of weak convergence that makes it a metric space. Let \(\mathcal X\) be a complete separable metric space and \(c:\mathcal X^2\to \mathbb {R}_+\) a continuous cost function. Let \((\varOmega ,\mathcal F,\mathbb {P})\) be a probability space and \(\varLambda ,\kappa :\varOmega \to P(\mathcal X)\) be measurable maps. Then there exists a *measurable selection* of optimal transference plans. That is, a measurable \(\pi :\varOmega \to P(\mathcal X^2)\) such that *π*(*ω*) ∈ *Π*(*Λ*(*ω*), *κ*(*ω*)) is optimal for all *ω* ∈ *Ω*.

Although this result is very general, it only provides information about *π*. If *π* is induced from a map *T*, it is not obvious how to construct *T* from *π* in a measurable way; we will therefore follow a different path. In order to (almost) have a self-contained exposition, we work in a somewhat simplified setting that nevertheless suffices for the sequel. At least in the Euclidean case \(\mathcal X=\mathbb {R}^d\), more general measurability results in the flavour of this section can be found in Fontbona et al. [53]. On the other hand, we will not need to appeal to abstract measurable selection theorems as in [53, 125].

### 2.4.1 Measurability of Measures and of Optimal Maps

Let \(\mathcal X\) be a separable Banach space. (Most of the results below hold for any complete separable metric space but we will avoid this generality for brevity and simpler notation). The Wasserstein space \(\mathcal W_p(\mathcal X)\) is a metric space for any *p* ≥ 1. We can thus define:

### Definition 2.4.1 (Random Measure)

*A random measure Λ is any measurable map from a probability space* \((\varOmega ,\mathcal F,\mathbb {P})\) *to* \(\mathcal W_p(\mathcal X)\) *, endowed with its Borel σ-algebra.*

In what follows, whenever we call something random, we mean that it is measurable as a map from some generic unspecified probability space.

### Lemma 2.4.2

*A random measure Λ is measurable if and only if it is measurable with respect to the induced weak topology.*

Since both topologies are Polish, this follows from abstract measure-theoretic results (Fremlin [57, Paragraph 423F]). We give an elementary proof on page 53 of the supplement.

Optimal maps are functions from \(\mathcal X\) to itself. In order to define random optimal maps, we need to define a topology and a *σ*-algebra on the space of such functions.

### Definition 2.4.3 (The Space \(\mathcal L_p(\mu )\))

*Let*\(\mathcal X\)

*be a Banach space and μ a Borel measure on*\(\mathcal X\)

*. Then the space*\(\mathcal L_p(\mu )\)

*is the space of measurable functions*\(f:\mathcal X\to \mathcal X\)

*such that*

When \(\mathcal X\) is separable, \(\mathcal L_p(\mu )\) is an example of a *Bochner space*, though we will not use this terminology.

*L*

_{p}norm of the map \(x\mapsto \|f(x)\|{ }_{\mathcal X}\) from \(\mathcal X\) to \(\mathbb {R}\):

*L*

_{p}spaces—completeness:

### Theorem 2.4.4 (Riesz–Fischer)

*The space* \(\mathcal L_p(\mu )\) *is a Banach space.*

The proof, a simple variant of the classical one, is given on page 53 of the supplement.

Random maps lead naturally to random measures:

### Lemma 2.4.5 (Push-Forward with Random Maps)

*Let* \(\mu \in \mathcal W_p(\mathcal X)\) *and let* **t** *be a random map in* \(\mathcal L_p(\mu )\)*. Then Λ *= **t***#μ is a continuous mapping from* \(\mathcal L_p(\mu )\) *to* \(\mathcal W_p(\mathcal X)\)*, hence a random measure.*

### Proof

*Λ*takes values in \(\mathcal W_p\) follows from a change of variables

*Λ*is a continuous (in fact, 1-Lipschitz) function of

**t**.

Conversely, **t** is a continuous function of *Λ*:

### Lemma 2.4.6 (Measurability of Transport Maps)

*Let Λ be a random measure in* \(\mathcal W_p(\mathcal X)\) *and let* \(\mu \in \mathcal W_p(\mathcal X)\) *such that* \((\mathbf i, {\mathbf {t}}_{\mu }^{\varLambda })\#\mu \) *is the unique optimal coupling of μ and Λ. Then* \(\varLambda \mapsto {\mathbf {t}}_{\mu }^{\varLambda }\) *is a continuous mapping from* \(\mathcal W_p(\mathcal X)\) *to* \(\mathcal L_p(\mu )\)*, so* \({\mathbf {t}}_{\mu }^{\varLambda }\) *is a random element in* \(\mathcal L_p(\mu )\)*. In particular, the result holds if* \(\mathcal X\) *is a separable Hilbert space, p *> 1*, and μ is absolutely continuous.*

### Proof

This result is more subtle than Lemma 2.4.5, since \(\varLambda \mapsto {\mathbf {t}}_{\mu }^{\varLambda }\) is not necessarily Lipschitz. We give here a self-contained proof for the Euclidean case with quadratic cost and *μ* absolutely continuous. The general case builds on Villani [125, Corollary 5.23] and is given on page 54 of the supplement.

*Λ*

_{n}→

*Λ*in \(\mathcal W_2(\mathbb {R}^d)\) and fix

*𝜖*> 0. For any \(S\subseteq \mathbb {R}^d\),

*a*−

*b*∥

^{p}≤ 2

^{p}∥

*a*∥

^{p}+ 2

^{p}∥

*b*∥

^{p}, the last integral is no larger than

*Λ*

_{n}) and

*Λ*are tight in the Wasserstein space, they must satisfy the absolute uniform continuity (2.7). Let

*δ*=

*δ*

_{𝜖}as in (2.7), and notice that by the measure preserving property of the optimal maps, the last two integrals are taken on sets of measures 1 −

*μ*(

*S*). Since

*μ*is absolutely continuous, we can find a compact set

*S*of

*μ*-measure at least 1 −

*δ*and on which Proposition 1.7.11 applies (see Corollary 1.7.12), yielding

*𝜖*→ 0.

In Proposition 5.3.7, we show under some conditions that \(\|{\mathbf {t}}_{\mu }^{\varLambda }\|{ }_{\mathcal L_2(\mu )}\) is a continuous function of *μ*.

### 2.4.2 Random Optimal Maps and Fubini’s Theorem

From now on, we assume that \(\mathcal X\) is a separable Hilbert space and that *p* = 2. The results can most likely be generalised to all *p* > 1 (see Ambrosio et al. [12, Section 10.2]), but we restrict to the quadratic case for simplicity.

*L*

_{2}(

*θ*

_{0}), both of which are Banach spaces. There are several (nonequivalent) definitions for integrals in such spaces (Hildebrant [69]); the one which will be the most convenient for our needs is the Bochner integral.

### Definition 2.4.7 (Bochner Integral)

*Let B be a Banach space and let*\(f:(\varOmega ,\mathcal F, \mathbb {P})\to B\)

*be a simple random element taking values in B:*

*Then the Bochner integral (or expectation) of f is defined by*

*If f is measurable and there exists a sequence f*

_{n}

*of simple random elements such that*∥

*f*

_{n}−

*f*∥→ 0

*almost surely and*\(\mathbb {E}\|f_n - f\|\to 0\)

*, then the Bochner integral of f is defined as the limit*

The space of functions for which the Bochner integral is defined is the *Bochner space* *L*_{1}(*Ω*;*B*), but we will use neither this terminology nor the notation. It is not difficult to see that Bochner integrals are well-defined: the expectations do not depend on the representation of the simple functions nor on the approximating sequence, and the limit exists in *B* (because it is complete). More on Bochner integrals can be found in Hsing and Eubank [71, Section 2.6] or Dunford et al. [48, Chapter III.6]. A major difference from the real case is that there is no clear notion of “infinity” here: the Bochner integral is always an element of *B*, whereas expectations of real-valued random variables can be defined in \(\mathbb {R}\cup \{\pm \infty \}\). It turns out that separability is quite important in this setting:

### Lemma 2.4.8 (Approximation of Separable Functions)

*Let f *:* Ω *→* B be measurable. Then there exists a sequence of simple functions f*_{n} *such that* ∥*f*_{n}(*ω*) −* f*(*ω*)∥→ 0 *for almost all ω if and only if* \(f(\varOmega \setminus \mathcal N)\) *is separable for some* \(\mathcal N\subseteq \varOmega \) *of probability zero. In that case, f*_{n} *can be chosen so that* ∥*f*_{n}(*ω*)∥≤ 2∥*f*(*ω*)∥ *for all ω *∈* Ω.*

A proof can be found in [48, Lemma III.6.9], or on page 55 of the supplement. Functions satisfying this approximation condition are sometimes called *strongly measurable* or *Bochner measurable*. In view of the lemma, we will call them *separately valued*, since this is the condition that will need to be checked in order to define their integrals.

Two remarks are in order. Firstly, if *B* itself is separable, then *f*(*Ω*) will obviously be separable. Secondly, the set \(\mathcal N'\subset \varOmega \setminus \mathcal N\) on which \((g_{n_k})\) does not converge to *f* may fail to be measurable, but must have outer probability zero (it is included in a measurable set of measure zero) [48, Lemma III.6.9]. This can be remedied by assuming that the probability space \((\varOmega ,\mathcal F,\mathbb {P})\) is complete. It will not, however, be necessary to do so, since this measurability issue will not alter the Bochner expectation of *f*.

### Proposition 2.4.9 (Fubini for Optimal Maps)

*Let Λ be a random measure in*\(\mathcal W_2(\mathcal X)\)

*such that*\(\mathbb {E} W_2(\delta _0,\varLambda )<\infty \)

*and let*\(\theta _0,\theta \in \mathcal W_2(\mathcal X)\)

*such that*\({\mathbf {t}}_{\theta _0}^{\varLambda }\)

*and*\({\mathbf {t}}_{\theta _0}^{\theta }\)

*exist (and are unique) with probability one. (For example, if θ*

_{0}

*is absolutely continuous.) Then*

This holds by linearity when *Λ* is a simple random measure. The general case follows by approximation: the Wasserstein space is separable and so is the space of optimal maps, by Lemma 2.4.6, so we may apply Lemma 2.4.8 and approximate \({\mathbf {t}}_{\theta _0}^{\varLambda }\) by simple maps for which the equality holds by linearity. On page 56 of the supplement, we show that these simple maps can be assumed optimal, and give the full details.

## 2.5 Bibliographical Notes

Our proof of Theorem 2.2.11 borrows heavily from Bolley et al. [29]. A similar result was obtained by Kloeckner [81], who also provides a lower bound of a similar order.

The origins of Sect. 2.3 can be traced back to the seminal work of Jordan et al. [74], who interpret the Fokker–Planck equation as a gradient flow (where functionals defined on \(\mathcal W_2\) can be differentiated) with respect to the 2-Wasserstein metric. The Riemannian interpretation was (formally) introduced by Otto [99], and rigorously established by Ambrosio et al. [12] and others; see Villani [125, Chapter 15] for further bibliography and more details.

Compatible measures (Definition 2.3.1) were implicitly introduced by Boissard et al. [28] in the context of *admissible optimal maps* where one defines families of gradients of convex functions (*T*_{i}) such that \(T_j^{-1}\circ T_i\) is a gradient of a convex function for any *i* and *j*. For (any) fixed measure \(\gamma \in \mathcal C\), compatibility of \(\mathcal C\) is then equivalent to admissibility of the collection of maps \(\{{\mathbf {t}}_{\gamma }^{\mu }\}_{\mu \in \mathcal { C}}\). The examples we gave are also taken from [28].

Lemma 2.3.3 is from Cuesta-Albertos et al. [38, Theorem 2.9] (see also Zemel and Panaretos [135]).

## References

- 12.L. Ambrosio, N. Gigli, G. Savaré, Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich, 2nd edn. (Springer, Berlin, 2008)Google Scholar
- 21.P.J. Bickel, D.A. Freedman, Some asymptotic theory for the bootstrap. Ann. Stat.
**9**(6), 1196–1217 (1981)MathSciNetCrossRefGoogle Scholar - 25.S. Bobkov, M. Ledoux, One-Dimensional Empirical Measures, Order Statistics and Kantorovich Transport Distances, vol. 261, no. 1259 (Memoirs of the American Mathematical Society, Providence, 2019). https://doi.org/10.1090/memo/1259
- 28.E. Boissard, T. Le Gouic, J.-M. Loubes, Distribution‘s template estimate with Wasserstein metrics. Bernoulli
**21**(2), 740–759 (2015)MathSciNetCrossRefGoogle Scholar - 29.F. Bolley, A. Guillin, C. Villani, Quantitative concentration inequalities for empirical measures on non-compact spaces. Prob. Theory Rel. Fields
**137**, 541–593 (2007)MathSciNetCrossRefGoogle Scholar - 36.P. Clément, W. Desch, An elementary proof of the triangle inequality for the Wasserstein metric. Proc. Amer. Math. Soc.
**136**(1), 333–339 (2008)MathSciNetCrossRefGoogle Scholar - 38.J.A. Cuesta-Albertos, L. Rüschendorf, A. Tuero-Diaz, Optimal coupling of multivariate distributions and stochastic processes. J. Multivar. Anal.
**46**(2), 335–361 (1993)MathSciNetCrossRefGoogle Scholar - 47.R.M. Dudley, Real Analysis and Probability, vol. 74 (Cambridge University Press, Cambridge, 2002)CrossRefGoogle Scholar
- 48.N. Dunford, J.T. Schwartz, W.G. Bade, R.G. Bartle, Linear Operators (Wiley-Interscience, New York, 1971)Google Scholar
- 49.R. Durrett, Probability: Theory and Examples (Cambridge University Press, Cambridge, 2010)CrossRefGoogle Scholar
- 53.J. Fontbona, H. Guérin, S. Méléard, Measurability of optimal transportation and strong coupling of martingale measures. Electron. Commun. Probab.
**15**, 124–133 (2010)MathSciNetCrossRefGoogle Scholar - 57.D.H. Fremlin, Measure Theory, Vol. 4: Topological Measure Theory (Torres Fremlin, Colchester, 2003)Google Scholar
- 69.T. Hildebrandt, Integration in abstract spaces. Bull. Amer. Math. Soc.
**59**(2), 111–139 (1953)MathSciNetCrossRefGoogle Scholar - 71.T. Hsing, R. Eubank, Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators (Wiley, Hoboken, 2015)CrossRefGoogle Scholar
- 74.R. Jordan, D. Kinderlehrer, F. Otto, The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal.
**29**(1), 1–17 (1998)MathSciNetCrossRefGoogle Scholar - 81.B. Kloeckner, A generalization of Hausdorff dimension applied to Hilbert cubes and Wasserstein spaces. J. Topol. Anal.
**4**(2), 203–235 (2012)MathSciNetCrossRefGoogle Scholar - 87.T. Le Gouic, J.-M. Loubes, Existence and consistency of Wasserstein barycenters. Prob. Theory Relat. Fields
**168**(3–4), 901–917 (2017)MathSciNetzbMATHGoogle Scholar - 93.R.J. McCann, A convexity principle for interacting gases. Adv. Math.
**128**(1), 153–179 (1997)MathSciNetCrossRefGoogle Scholar - 97.R.B. Nelsen, An Introduction to Copulas (Springer, Berlin, 2013)zbMATHGoogle Scholar
- 99.F. Otto, The geometry of dissipative evolution equations: the porous medium equation. Comm. Part. Differ. Equ.
**26**, 101–174 (2001)MathSciNetCrossRefGoogle Scholar - 114.C.A. Rogers, Covering a sphere with spheres. Mathematika
**10**(2), 157–164 (1963)MathSciNetCrossRefGoogle Scholar - 119.F. Santambrogio, Optimal Transport for Applied Mathematicians, vol. 87 (Springer, Berlin, 2015)CrossRefGoogle Scholar
- 124.C. Villani, Topics in Optimal Transportation (American Mathematical Society, Providence, 2003)CrossRefGoogle Scholar
- 125.C. Villani, Optimal Transport: Old and New (Springer, Berlin, 2008)zbMATHGoogle Scholar
- 126.M.J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint (Cambridge University Press, Cambridge, 2019)CrossRefGoogle Scholar
- 135.Y. Zemel, V.M. Panaretos, Supplement to “Fréchet means and Procrustes analysis in Wasserstein space” (2019)Google Scholar

## Copyright information

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.