# Phase Variation and Fréchet Means

- 213 Downloads

## Abstract

Why is it relevant to construct the Fréchet mean of a collection of measures with respect to the Wasserstein metric? A simple answer is that this kind of average will often express a more natural notion of “typical” realisation of a random probability distribution than an arithmetic average.

Why is it relevant to construct the Fréchet mean of a collection of measures with respect to the Wasserstein metric? A simple answer is that this kind of average will often express a more natural notion of “typical” realisation of a random probability distribution than an arithmetic average.^{1} Much more can be said, however, in that the Wasserstein–Fréchet mean and the closely related notion of an optimal multicoupling arise canonically as the appropriate framework for the formulation and solution to the problem of *separation of amplitude and phase variation of a point process*. It would almost seem that Wasserstein–Fréchet means were “made” for precisely this problem.

*Y*(

*x*) :

*x*∈

*K*} over a convex compact domain

*K*, it can be broadly said that one may distinguish two layers of variation:

*Amplitude variation*. This is the “classical” variation that one would also encounter in multivariate analysis, and refers to the stochastic fluctuations around a mean level, usually encoded in the covariance kernel, at least up to second order.In short, this is variation “in the

*y*-axis” (ordinate).*Phase variation*. This is a second layer of non-linear variation peculiar to continuous domain stochastic processes, and is rarely—if ever—encountered in multivariate analysis. It arises as the result of random changes (or deformations) in the time scale (or the spatial domain) of definition of the process. It can be conceptualised as a composition of the stochastic process with a random transformation (warp map) acting on its domain.This is variation “in the

*x*-axis” (abscissa).

*registration*(Ramsay and Li [108]),

*synchronisation*(Wang and Gasser [129]), or

*multireference alignment*(Bandeira et al. [16]), though in some cases these terms refer to a simpler problem where there is no amplitude variation at all.

Phase variation naturally arises in the study of random phenomena where there is no absolute notion of time or space, but every realisation of the phenomenon evolves according to a time scale that is intrinsic to the phenomenon itself, and (unfortunately) unobservable. Processes related to physiological measurements, such as *growth curves* and *neuronal signals*, are usual suspects. Growth curves can be modelled as *continuous random functions (functional data)*, whereas neuronal signals are better modelled as *discrete random measures (point processes)*. We first describe amplitude/phase variation in the former^{2} case, as that is easier to appreciate, before moving on to the latter case, which is the main subject of this chapter.

## 4.1 Amplitude and Phase Variation

### 4.1.1 The Functional Case

Let *K* denote the unit cube \([0,1]^d\subset \mathbb {R}^d\). A real random function *Y* = (*Y* (*x*) : *x* ∈ *K*) can, broadly speaking, have two types of variation. The first, *amplitude variation*, results from *Y* (*x*) being a random variable for every *x* and describes its fluctuations around the mean level \(m(x)=\mathbb {E} Y(x)\), usually encoded by the variance var*Y* (*x*). For this reason, it can be referred to as “variation in the *y*-axis”. More generally, for any finite set *x*_{1}, …, *x*_{n}, the *n* × *n* covariance matrix with entries *κ*(*x*_{i}, *x*_{j}) = cov[*Y* (*x*_{i}), *Y* (*x*_{j})] encapsulates (up to second order) the stochastic deviations of the random vector (*Y* (*x*_{1}), …, *Y* (*x*_{n})) from its mean, in analogy with the multivariate case. Heuristically, one then views amplitude variation as the collection *κ*(*x*, *y*) for *x*, *y* ∈ *K* in a sense we discuss next.

*Y*as a random element in the separable Hilbert space

*L*

_{2}(

*K*), assumed to have \(\mathbb {E}\|Y\|{ }^2<\infty \) and continuous sample paths, so that in particular

*Y*(

*x*) is a random variable for all

*x*∈

*K*. Then the

*mean function*

*covariance kernel*

*Y*being

*mean-square continuous*:

*κ*gives rise to the

*covariance operator*\(\mathcal R:L_2(K)\to L_2(K)\), defined by

*L*

_{2}(

*K*). The justification to this terminology is the observation that when

*m*= 0, for all bounded

*f*,

*g*∈

*L*

_{2}(

*K*),

*m*= 0,

*ϕ*

_{k}) is an orthonormal basis of

*L*

_{2}(

*K*). One then has the celebrated

*Karhunen–Loève expansion*

*ϕ*

_{k}(

*x*) are deterministic; the random variables

*ξ*

_{k}are scalars. This separation actually holds for any orthonormal basis; the role of choosing the eigenbasis of \(\mathcal R\) is making

*ξ*

_{k}

*uncorrelated*:

*k*≠

*l*and equals

*r*

_{k}otherwise. For this reason, it is not surprising that using as

*ϕ*

_{k}the eigenfunctions yields the optimal representation of

*Y*. Here, optimality is with respect to truncations: for any other basis (

*ψ*

_{k}) and any

*M*,

*ϕ*

_{k}) provides the best finite-dimensional approximation to

*Y*. The approximation error on the right-hand side equals

*m*and

*κ*on the basis of a sample

*Y*

_{1}, …,

*Y*

_{n}by

*phase variation*, that is non-linear and does not have an obvious finite-dimensional analogue. It arises when in addition to the randomness in the values

*Y*(

*x*) itself, an extra layer of stochasticity is present in its domain of definition. In mathematical terms, there is a random invertible

*warp function*(sometimes called

*deformation*or

*warping*)

*T*:

*K*→

*K*and instead of

*Y*(

*x*), one observes realisations from

*x*-axis”. When

*d*= 1, the set

*K*is usually interpreted as a time interval, and then the model stipulates that each individual has its own time scale. Typically, the warp function is assumed to be a homeomorphism of

*K*independent of

*Y*and often some additional smoothness is imposed, say

*T*∈

*C*

^{2}. One of the classical examples is growth curves of children, of which a dataset from the Berkeley growth study (Jones and Bayley [73]) is shown in Fig. 4.1. The curves are the derivatives of the height of a sample of ten girls as a function of time, from birth until age 18. One clearly notices the presence of the two types of variation in the figure. The initial velocity for all children is the highest immediately or shortly after birth, and in most cases decreases sharply during the first 2 years. Then follows a period of acceleration for another year or so, and so on. Despite presenting qualitatively similar behaviour, the curves differ substantially not only in the magnitude of the peaks but also in their location. For instance, one red curve has a local minimum at the age of three, while a green one has a local maximum at almost that same time point. It is apparent that if one tries to estimate the mean function by averaging the curves at each time

*x*, the shape of the resulting estimate would look very different from each of the curves. Thus, this pointwise averaging (known as the

*cross-sectional mean*) fails to represent the typical behaviour. This phenomenon is seen more explicitly in the next example.

*A*and

*B*be symmetric random variables and consider the random function

*x*↦

*x*+

*B*is not from [0, 1] to itself; for illustration purposes, we assume in this example that \(K=\mathbb {R}\).) The random variable

*A*generates the amplitude variation, while

*B*represents the phase variation. In Fig. 4.2, we plot four realisations and the resulting empirical means for the two extreme scenarios where

*B*= 0 (no phase variation) or

*A*= 1 (no amplitude variation). In the left panel of the figure, we see that the sample mean (in thick blue) lies between the observations and has a similar form, so can be viewed as the curve representing the typical realisation of the random curve. This is in contrast to the right panel, where the mean is qualitatively different from all curves in the sample: though periodicity is still present, the peaks and troughs have been flattened, and the sample mean is much more diffuse than any of the observations.

*A*= 1 we have

*B*is symmetric the second term vanishes, and unless

*B*is trivial the expectation of the cosine is smaller than one in absolute value. Consequently, the expectation of \(\tilde Y(x)\) is the original function \(\sin 8\pi x\) multiplied by a constant of magnitude strictly less than one, resulting in peaks of smaller magnitude.

*Y*and

*T*are independent, we have

*T*. Then the value of the mean function itself, \(\mathbb {E} \tilde \mu \), at

*x*

_{0}is determined not by a single point, say

*x*, but rather by all the values of

*m*at the possible outcomes of

*T*

^{−1}(

*x*). In particular, if

*x*

_{0}was a local maximum for

*m*, then \(\mathbb {E} [\tilde \mu (x_0)]\) will typically be strictly smaller than

*m*(

*x*

_{0}); the phase variation results in smearing

*m*.

*m*, the mean of

*Y*, that is of interest, then the confounding of the amplitude and phase variation will lead to inconsistency. This can also be seen from the formula

*not*the Karhunen–Loève expansion of \(\tilde Y\); the simplest way to notice this is the observation that

*ϕ*

_{k}(

*T*

^{−1}(

*x*)) includes both the functional component

*ϕ*

_{k}and the random component

*T*

^{−1}(

*x*). The true Karhunen–Loève expansion of \(\tilde Y\) will in general be qualitatively very different from that of

*Y*, not only in terms of the mean function but also in terms of the covariance operator and, consequently, its eigenfunctions and eigenvalues. As illustrated in the trigonometric example, the typical situation is that the mean \(\mathbb {E} \tilde Y\) is more diffuse than

*m*, and the decay of the eigenvalues \(\tilde r_k\) of the covariance operator is slower than that of

*r*

_{k}; as a result, one needs to truncate the sum at high threshold in order to capture a substantial enough part of the variability. In the toy example (4.1), the Karhunen–Loève expansion has a single term besides the mean if

*B*= 0, while having two terms if

*A*= 1.

*m*and the covariance

*κ*, the random function

*T*pertaining to the phase variation is a nuisance parameter. Given a sample \(\tilde Y_i=Y_i\circ T_i^{-1}\),

*i*= 1, …,

*n*, there is no point in taking pointwise means of \(\tilde Y_i\), because the curves are

*misaligned*; \(\tilde Y_1(x)=Y_1(T_1^{-1}(x))\) should not be compared with \(\tilde Y_2(x)\), but rather with \(Y_2(T_1^{-1}(x))=\tilde Y_2(T_1^{-1}(T_2(x))\). To overcome this difficulty, one seeks estimators \(\widehat {T_i}\) such that

*Y*

_{i}(

*x*). In other words, one tries to align the curves in the sample to have a common time scale. Such a procedure is called

*curve registration*. Once registration has been carried out, one proceeds the analysis on \(\widehat {Y_i}(x)\) assuming only amplitude variation is now present: estimate the mean

*m*by

*κ*by its analogous counterpart. Put differently, registering the curves amounts to

*separating the two types of variation*. This step is crucial regardless of whether the warp functions are considered as nuisance or an analysis of the warp functions is of interest in the particular application.

There is an obvious identifiability problem in the model \(\tilde Y=Y\circ T^{-1}\). If *S* is any (deterministic) invertible function, then the model with (*Y*, *T*) is statistically indistinguishable from the model with (*Y* ∘ *S*, *T* ∘ *S*). It is therefore often assumed that \(\mathbb {E} T=\mathbf i\) is the identity and in addition, in nearly all application, that *T* is monotonically increasing (if *d* = 1).

*Discretely observed data*. One cannot measure the height of person at every single instant of her life. In other words, it is rare in practice that one has access to the entire curve. A far more common situation is that one observes the curves

*discretely*, i.e., at a finite number of points. The conceptually simplest setting is that one has access to a grid

*x*

_{1}, …,

*x*

_{J}∈

*K*, and the data come in the form

*T*

_{i}and of the original, aligned functions

*Y*

_{i}.

In the bibliographic notes, we review some methods for carrying out this separation of amplitude and phase variation. It is fair to say that no single registration method arises as the canonical solution to the functional registration problem. Indeed, most need to make additional structural and/or smoothness assumptions on the warp maps, further to the basic identifiability conditions requiring that *T* be increasing and that \(\mathbb {E} T\) equal the identity. We will eventually see that, in contrast, the case of point processes (viewed as discretely observed random measures) admits a canonical framework, without needing additional assumptions.

### 4.1.2 The Point Process Case

A point process is the mathematical object that represents the intuitive notion of a random collection of points in a space \(\mathcal X\). It is formally defined as a measurable map *Π* from a generic probability space into the space of (possibly infinite) Borel integer-valued measures of \(\mathcal X\) in such a way that *Π*(*B*) is a measurable real-valued random variable for all Borel subsets *B* of \(\mathcal X\). The quantity *Π*(*B*) represents the random number of points observed in the set *B*. Among the plethora of books on point processes, let us mention Daley and Vere-Jones [41] and Karr [79]. Kallenberg [75] treats more general objects, *random measures*, of which point processes are a peculiar special case. We will assume for convenience that *Π* is a measure on a compact subset \(K\subset \mathbb {R}^d\).

*Π*can be understood in analogy with the functional case. One defines the mean measure

*K*. Just like in the functional case, these two objects encapsulate the (second-order) amplitude variation

^{3}properties of the law of

*Π*.

*Π*

_{1}, …,

*Π*

_{n}of independent point processes distributed as

*Π*, the natural estimators

*T*:

*K*→

*K*(independent of

*Π*) that deforms

*Π*: if we denote the points of

*Π*by

*x*

_{1}, …,

*x*

_{K}(with

*K*random), then instead of (

*x*

_{i}), one observes

*T*(

*x*

_{1}), …,

*T*(

*x*

_{K}). In symbols, this means that the data arise as \(\widetilde \varPi = T\#\varPi \). We refer to

*Π*as the

*original point processes*, and \(\widetilde \varPi \) as the

*warped point processes*. An example of 30 warped and unwarped point processes is shown in Fig. 4.3. The point patterns in both panels present a qualitatively similar structure: there are two peaks of high concentration of points, while few points appear between these peaks. The difference between the two panels is in the position and concentration of those peaks. In the left panel, only amplitude variation is present, and the location/concentration of the peaks is the same across all observations. In contrast, phase variation results in shifting the peaks to different places for each of the observations, while also smearing or sharpening them. Clearly, estimation of the mean measure of a subset

*A*by averaging the number of observed points in

*A*would not be satisfactory as an estimator of

*λ*when carried out with the warped data. As in the functional case, it will only be consistent for the measure \(\tilde \lambda \) defined by

*λ*and is far more diffuse.

*Π*and

*T*are independent, the conditional expectation of \(\widetilde \varPi \) given

*T*is

*Λ*=

*T#λ*as the

*conditional mean measure*. The problem of separation of amplitude and phase variation can now be stated as follows. On the basis of a sample \(\widetilde \varPi _1,\dots ,\widetilde \varPi _n\), find estimators of (

*T*

_{i}) and (

*Π*

_{i}). Registering the point processes amounts to constructing estimators,

*registration maps*\(\widehat {T_i^{-1}}\), such that the aligned points

*Π*

_{i}.

### Remark 4.1.1 (Poisson Processes)

*A special but important case is that of a Poisson process. Gaussian processes probably yield the most elegant and rich theory in functional data analysis, and so do Poisson processes when it comes to point processes. We say that Π is a*Poisson process

*when the following two conditions hold. (1) For any disjoint collection*(

*A*

_{1}, …,

*A*

_{n})

*of Borel sets, the random variables Π*(

*A*

_{1}), …,

*Π*(

*A*

_{n})

*are independent; and (2) for every Borel A*⊆

*K, Π*(

*A*)

*follows a Poisson distribution with mean λ*(

*A*)

*:*

*Conditional upon T, the random variables* \(\widetilde \varPi (A_k)=\varPi (T^{-1}(A_k))\)*, k *= 1, …, *n are independent as the sets* (*T*^{−1}(*A*_{k})) *are disjoint, and* \(\widetilde \varPi (A)\) *follows a Poisson distribution with mean λ*(*T*^{−1}(*A*)) =* Λ*(*A*)*. This is precisely the definition of a* Cox process *: conditional upon the* driving measure *Λ,* \(\widetilde \varPi \) *is a Poisson process with mean measure λ. For this reason, it is also called a* doubly stochastic process *; in our context, the phase variation is associated with the stochasticity of Λ while the amplitude one is associated with the Poisson variation conditional upon Λ.*

As in the functional case there are problems with identifiability: the model (*Π*, *T*) cannot be distinguished from the model (*S#Π*, *T* ∘ *S*^{−1}) for any invertible *S* : *K* → *K*. It is thus natural to assume that \(\mathbb {E} T\) is the identity map^{4} (otherwise set \(S=\mathbb {E} T\), i.e., replace *Π* by \([\mathbb {E} T]\#\varPi \) and *T* by \(T\circ [\mathbb {E} T]^{-1}\)).

*T*to have mean identity is nevertheless not sufficient for the model \(\widetilde \varPi =T\#\varPi \) to be identifiable. The reason is that given the two point sets \(\widetilde \varPi \) and

*Π*, there are many functions that push forward the latter to the former. This ambiguity can be dealt with by assuming some sort of

*regularity*or

*parsimony*for

*T*. For example, when

*K*= [

*a*,

*b*] is a subset of the real line, imposing

*T*to be monotonically increasing guarantees its uniqueness. In multiple dimensions, there is no obvious analogue for increasing functions. One possible definition is the monotonicity described in Sect. 1.7.2:

*y*≥

*x*if and only if

*y*

_{i}≥

*x*

_{i}for

*i*= 1, 2. It is natural to expect the deformations to maintain the

*lexicographic order*in \(\mathbb {R}^2\):

*z*=

*T*(

*x*) and

*w*=

*T*(

*y*)

*y*

_{i}=

*T*(

*x*

_{i})), which is itself equivalent to the property of being the subgradient of a convex function. But if extra smoothness is present and

*T*is a gradient of some function \(\phi :K\to \mathbb {R}\), then

*ϕ*must be convex and

*T*is then cyclically monotone. Consequently, we will make the following assumptions:

the expected value of

*T*is the identity;*T*is a gradient of a convex function.

*T*, its structural properties, or its distance from the identity. In the next section, we show how these two conditions alone lead to the Wasserstein geometry and open the door to consistent, fully nonparametric separation of the amplitude and phase variation.

## 4.2 Wasserstein Geometry and Phase Variation

### 4.2.1 Equivariance Properties of the Wasserstein Distance

*p*≥ 1 and all \(x,y\in \mathcal X\),

*δ*

_{x}is as usual the Dirac measure at \(x\in \mathcal X\). This is in contrast to metrics such as the bounded Lipschitz distance (that metrises weak convergence) or the total variation distance on \(P(\mathcal X)\). Recall that these are defined by

*X*∼

*μ*and

*Y*∼

*ν*be random elements in \(\mathcal X\),

*a*be a fixed point in \(\mathcal X\),

*X′*=

*X*+

*a*and

*Y*

*′*=

*Y*+

*a*. Joint couplings

*Z′*= (

*X′*,

*Y*

*′*) are precisely those that take the form (

*a*,

*a*) +

*Z*for a joint coupling

*Z*= (

*X*,

*Y*). Thus

*δ*

_{a}is a Dirac measure at

*a*and ∗ denotes convolution.

This carries over to Fréchet means in an obvious way.

### Lemma 4.2.1 (Fréchet Means and Translations)

*Let Λ be a random measure in* \(\mathcal W_2(\mathcal X)\) *with finite Fréchet functional and* \(a\in \mathcal X\)*. Then γ is a Fréchet mean of Λ if and only if γ *∗* δ*_{a} *is a Fréchet mean of Λ *∗* δ*_{a}.

*p*, in the formulation sketched in the bibliographic notes of Chap. 2. In the quadratic case, one has a simple extension to the case where only one measure is translated. Denote the first moment (mean) of \(\mu \in \mathcal W_1(\mathcal X)\) by

*a*=

*m*(

*μ*) −

*m*(

*ν*). This leads to the following conclusion:

### Proposition 4.2.2 (First Moment of Fréchet Mean)

*Let Λ be a random measure in*\(\mathcal W_2(\mathcal X)\)

*with finite Fréchet functional and Fréchet mean γ. Then*

### 4.2.2 Canonicity of Wasserstein Distance in Measuring Phase Variation

The purpose of this subsection is to show that the standard functional data analysis assumptions on the warp function *T*, having mean identity and being increasing, are equivalent to purely geometric conditions on *T* and the conditional mean measure *Λ* = *T#λ*. Put differently, if one is willing to assume that \(\mathbb {E} T=\mathbf i\) and that *T* is increasing, then one is led *unequivocally* to the problem of estimation of Fréchet means in the Wasserstein space \(\mathcal W_2(\mathcal X)\). When \(\mathcal X\ne \mathbb {R}\), “increasing” is interpreted as being the gradient of a convex function, as explained at the end of Sect. 4.1.2.

The total mass \(\lambda (\mathcal X)\) is invariant under the push-forward operation, and when it is finite, we may assume without loss of generality that it is equal to one, because all the relevant quantities scale with the total mass. Indeed, if *λ* = *τμ* with *μ* probability measure and *τ* > 0, then *T#λ* = *τ* × *T#μ*, and the Wasserstein distance (defined as the infimum-over-coupling integrated cost) between *τμ* and *τν* is *τW*_{p}(*μ*, *ν*) for *μ*, *ν* probabilities.

We begin with the one-dimensional case, where the explicit formulae allow for a more transparent argument, and for simplicity we will assume some regularity.

### Assumptions 2

*The domain*\(K\subset \mathbb {R}\)

*is a nonempty compact convex set (an interval), and the continuous and injective random map*\(T:K\to \mathbb {R}\)

*(a random element in C*

_{b}(

*K*)

*) satisfies the following two conditions:*

- (A1)
Unbiasedness: \(\mathbb {E}[T(x)]=x\)

*for all x*∈*K.* - (A2)
Regularity:

*T is monotone increasing.*

The relevance of the Wasserstein geometry to phase variation becomes clear in the following proposition that shows that Assumptions 2 are equivalent to geometric assumptions on the Wasserstein space \(\mathcal W_2(\mathbb {R})\).

### Proposition 4.2.3 (Mean Identity Warp Functions and Fréchet Means in \(\mathcal W_2(\mathbb {R})\))

*Let*\(\phi \subset K\subset \mathbb {R}\)

*compact and convex and*\(T:K\to \mathbb {R}\)

*continuous. Then Assumptions*2

*hold if and only if, for any*\(\lambda \in \mathcal W_2(K)\)

*supported on K such that*\(\mathbb {E} [W_2^2(T\#\lambda ,\lambda )]<\infty \)

*, the following two conditions are satisfied:*

- (B1)Unbiasedness:
*for any*\(\theta \in \mathcal W_2(\mathbb {R})\)$$\displaystyle \begin{aligned} \mathbb{E} [W_2^2(T\#\lambda,\lambda)] \le \mathbb{E} [W_2^2(T\#\lambda,\theta)]. \end{aligned}$$ - (B2)Regularity:
*if*\(Q:K\to \mathbb {R}\)*is such that T#λ*=*Q#λ, then with probability one*$$\displaystyle \begin{aligned} {{\int_{K} \! {\Big|T(x)-x\Big|{}^2} \, \mathrm{d}{\lambda(x)}}} \le {{\int_{K} \! {\Big|Q(x)-x\Big|{}^2} \, \mathrm{d}{\lambda(x)}}} ,\qquad \mathit{\mbox{almost surely}}. \end{aligned}$$

These assumptions have a clear interpretation: (B1) stipulates that *λ* is a Fréchet mean of the random measure *Λ* = *T#λ*, while (B2) states that *T* must be the optimal map from *λ* to *Λ*, that is, \(T={\mathbf t_{\lambda }^{\varLambda }} \).

### Proof

If *T* satisfies (B2) then, as an optimal map, it must be nondecreasing *λ*-almost surely. Since *λ* is arbitrary, *T* must be nondecreasing on the entire domain *K*. Conversely, if *T* is nondecreasing, then it is optimal for any *λ*. Hence (A2) and (B2) are equivalent.

*T*imply that

*F*

_{Λ}(

*x*) =

*F*

_{λ}(

*T*

^{−1}(

*x*)). Suppose that

*F*

_{λ}is invertible (i.e., continuous and strictly increasing on

*K*). Then \(F_\varLambda ^{-1}(u)=T(F_\lambda ^{-1}(u))\). Thus (B1) is equivalent to \(\mathbb {E} T(x)=x\) for all

*x*in the range of \(F_\lambda ^{-1}\), which is

*K*. The assertion that (A1) implies (B1), even if

*F*

_{λ}is not invertible, is proven in the next theorem (Theorem 4.2.4) in a more general context.

^{q}(

*q*≥ 0):

*f*taking values in \(\mathcal X\) instead of \(\mathbb {R}\), and the norm will be denoted in the same way. These are nonseparable Banach spaces.

### Theorem 4.2.4 (Mean Identity Warp Functions and Fréchet Means)

*Fix*\(\lambda \in P(\mathcal X)\)

*and let*\(\mathbf t\in G_{1}(\mathcal X,\mathcal X)\)

*be a (Bochner measurable) random optimal map with (Bochner) mean identity and such that*\(\mathbb {E} \|\mathbf t\|{ }_{G_1}<\infty \)

*. Then Λ*=

**t**

*#λ has Fréchet mean λ:*

The generalisation with respect to the one-dimensional result is threefold. Firstly, since our main interest is the implication (A1–A2) ⇒ (B1–B2), we need not assume *T* to be injective. Secondly, the support of *λ* is not required to be compact. Lastly, the result holds in arbitrary dimension, including infinite-dimensional separable Hilbert spaces \(\mathcal X\). In particular, if **t** is a linear map, then \(\|\mathbf t\|{ }_{G_1}\) coincides with the operator norm of **t**, so the assumption is that **t** be a bounded self-adjoint nonnegative operator with mean identity and finite expected operator norm.

### Proof

**t**ensures that it has a convex potential

*ϕ*, and strong and weak duality give

as required. The rigorous mathematical justification for this is given on page 88 in the supplement.

### Remark 4.2.5

*The “natural” space for* **t** *would be* \(\mathcal L_2(\lambda )\) *, but without the continuity assumption, the result may fail (Álvarez-Esteban et al. [* 9 *, Example 3.1]). A simple argument shows that the growth condition imposed by the G* _{1} *assumption is minimal; see page 89 in the supplement or Galasso et al. [* 58 *].*

### Remark 4.2.6

*The same statement holds if* \(\mathcal X\) *is replaced by a (Borel) convex subset K thereof. The integrals will then be taken on K, showing that λ minimises the Fréchet functional among measures supported on K, and, by continuity, on* \(\overline K\) *. By Proposition* *3.2.4* *, λ is a Fréchet mean.*

## 4.3 Estimation of Fréchet Means

### 4.3.1 Oracle Case

In view of the canonicity of the Wasserstein geometry in Sect. 4.2.2, separation of amplitude and phase variation of the point processes \(\widetilde \varPi _i\) essentially requires computing Fréchet means in the 2-Wasserstein space. It is both conceptually important and technically convenient to introduce the case where an oracle reveals the conditional mean measures *Λ* = *T#λ* entirely. Thus, assuming that \(\lambda \in \mathcal W_2(\mathcal X)\) is the unique Fréchet mean of a random measure *Λ*, the goal is to estimate the structural mean *λ* on the basis of independent and identically distributed realisations *Λ*_{1}, …, *Λ*_{n} of *λ*.

*λ*is defined as the minimiser of the Fréchet functional

*λ*by a minimiser, say

*λ*

_{n}, of the empirical Fréchet functional

*λ*

_{n}exists by Corollary 3.1.3. When \(\mathcal X=\mathbb {R}\),

*λ*

_{n}can be seen to be an

*unbiased*estimator of

*λ*in a generalised sense of Lehmann [88] (see Sect. 4.3.5).

The warp maps (and their inverses) can then be estimated as the optimal maps from *λ*_{n} to each *Λ*_{i} (see Sect. 4.3.4).

### 4.3.2 Discretely Observed Measures

*Λ*

_{1}, …,

*Λ*

_{n}. A far more realistic scenario is that one only has access to a discrete version of

*Λ*

_{i}, say \(\widetilde {\varLambda }_i\). The simplest situation is when \(\widetilde {\varLambda }_i\) arises as an empirical measure of the form \(\tau ^{-1}\sum _{i=1}^\tau \delta \{Y_j\}\), where

*Y*

_{j}are independent with distribution

*Λ*

_{i}. More generally, \(\widetilde \varLambda _i\) can be a normalised point process \(\widetilde \varPi _i\) with mean measure

*τΛ*

_{i}, i.e.

*τ*is an integer and \(\widetilde \varPi _i\) is a

*binomial point process*. The parameter

*τ*is the expected number of observed points over the entire space \(\mathcal X\); the larger

*τ*is, the more information \(\widetilde \varPi _i\) gives on

*Λ*

_{i}.

*Λ*

_{i}are observed. In the asymptotic setup below, conditions will be imposed to ensure that this probability becomes negligible as

*n*→

*∞*. For concreteness we define \(\widetilde \varLambda _i=\lambda ^{(0)}\) for some fixed measure

*λ*

^{(0)}that will be of minor importance. This can be a Dirac measure at 0, a certain fixed Gaussian measure, or (normalised) Lebesgue measure on some bounded set in case \(\mathcal X=\mathbb {R}^d\). We can now replace the estimator

*λ*

_{n}by \(\widetilde \lambda _n\), defined as any minimiser of

As a generalisation of the discrete case discussed in Sect. 1.3, the Fréchet mean of discrete measures can be computed exactly. Suppose that \(N_i=\widetilde \varPi _i(\mathcal X)\) is nonzero for all *i*. Then each \(\widetilde \varLambda _i\) is a discrete measure supported on *N*_{i} points. One can then recast the multimarginal formulation (see Sect. 3.1.2) as a finite linear program, solve it, and “average” the solution as in Proposition 3.1.2 in order to obtain \(\widetilde \lambda _n\) (an alternative linear programming formulation for finding a Fréchet mean is given by Anderes et al. [14]). Thus, \(\widetilde \lambda _n\) can be computed in finite time, even when \(\mathcal X\) is infinite-dimensional.

Finally, a remark about measurability is in order. Point processes can be viewed as random elements in \(M_+(\mathcal X)\) endowed with the *vague topology* induced from convergence of integrals of continuous functions with compact support. If *μ*_{n} converge to *μ* vaguely, and *a*_{n} are numbers that converge to *a*, then *a*_{n}*μ*_{n} → *aμ* vaguely. Thus, \(\tilde \varLambda _i\) is a continuous function of the pair \((\widetilde \varPi _i,\widetilde \varPi _i(\mathcal X))\) and can be viewed as a random measure with respect to the vague topology. The restriction of the vague topology to probability measures is equivalent to the weak topology,^{5} and therefore vague, weak, and Wasserstein measurability are all equivalent.

### 4.3.3 Smoothing

Even when the computational complexity involved in calculating \(\widetilde \lambda _n\) is tractable, there is another reason not to use it as an estimator for *λ*. If one has a-priori knowledge that *λ* is smooth, it is often desirable to estimate it by a smooth measure. One way to achieve this would be to apply some smoothing technique to \(\widetilde \lambda _n\) using, e.g., kernel density estimation. However, unless the number of observed points from each measure is the same *N*_{1} = ⋯ = *N*_{n} = *N*, \(\widetilde \lambda _n\) will usually be concentrated on many points, essentially *N*_{1} + ⋯ + *N*_{n} of them. In other words, the Fréchet mean is concentrated on many more points than each of the measures \(\widetilde \varLambda _i\), thus potentially hindering its usefulness as a mean because it will not be a representative of the sample.

*Λ*

_{i}is diffuse). If we now set

*G*

_{i}to be the distribution function of \(\widetilde \varLambda _i\), then the quantile function \(G_i^{-1}\) is piecewise constant on each interval (

*k*,

*k*+ 1]∕

*N*

_{i}with jumps at

*k*∕

*N*

_{i}for

*k*≤

*N*

_{i}and

*i*= 1, …,

*n*. In the worst-case scenario, when no pair from

*N*

_{i}has a common divisor, there will be

*G*

^{−1}, which is the number of points on which the Fréchet mean will be supported. (All the \(G_i^{-1}\)’s have a jump at one which thus needs to be counted once rather than

*n*times.)

By counting the number of redundancies in the constraints matrix of the linear program, one can show that this is in general an upper bound on the number of support points of the Fréchet mean.

An alternative approach is to first smooth each observation \(\widetilde \lambda _n\) and then calculate the Fréchet mean. Since it is easy to bound the Wasserstein distances when dealing with convolutions, we will employ kernel density estimation, although other smoothing approaches could be used as well.

*ψ*(

*x*) =

*ψ*

_{1}(∥

*x*∥) with

*ψ*

_{1}nonincreasing and

*ψ*is the standard Gaussian density in \(\mathbb {R}^d\). Define the rescaled version

*ψ*

_{σ}(

*x*) =

*σ*

^{−d}

*ψ*(

*x*∕

*σ*) for all

*σ*> 0. We can then replace \(\widetilde \varLambda _i\) by a smooth proxy \(\widetilde \varLambda _i \ast \psi _\sigma \). If \(\widetilde \varLambda _i\) is a sum of Dirac masses at \(x_1,\dots ,x_{N_i}\), then

*N*

_{i}= 0, one can either use

*λ*

^{(0)}or

*λ*

^{(0)}∗

*ψ*

_{σ}; this event will have negligible probability anyway.

*Λ*is supported on a convex compact \(K\subset \mathbb {R}^d\), it is desirable to construct an estimator that has the same support

*K*. The first idea that comes to mind is to project \(\widetilde \varLambda _i \ast \psi _\sigma \) to

*K*(see Proposition 3.2.4), as this will further decrease the Wasserstein distance, but the resulting measure will then have positive mass on the boundary of

*K*, and will not be absolutely continuous. We will therefore use a different strategy: eliminate all the mass outside

*K*and redistribute it on

*K*. The simplest way to do this is to restrict \(\widetilde \varLambda _i\ast \psi _\sigma \) to

*K*and renormalise the restriction to be a probability measure. For technical reasons, it will be more convenient to bound the Wasserstein distance when the restriction and renormalisation is done separately on each point of \(\widetilde \varLambda _i\). This yields the measure

*C*. It is apparent that \(\widehat {\varLambda _i}\) is a continuous function of \(\widetilde \varLambda _i\) and

*σ*, so \(\widehat {\varLambda _i}\) is measurable; in any case this is not particularly important because

*σ*will vanish, so \(\widehat {\varLambda _i}=\widetilde \varLambda _i\) asymptotically and the latter is measurable.

*λ*is defined as the minimiser of

*regularised Fréchet–Wasserstein estimator*, where the regularisation comes from the smoothing and the possible restriction to

*K*.

If \(\mathcal X=\mathbb {R}^d\) and *d* ≥ 2, then there is no explicit expression for \(\widehat \lambda _n\), although it exists and is unique. In the next chapter, we present a steepest descent algorithm that approximately constructs \(\widehat \lambda _n\) by taking advantage of the differentiability properties of the Fréchet functional \(\widehat F_n\) in Sect. 3.1.6.

### 4.3.4 Estimation of Warpings and Registration Maps

*i*= 1, …,

*n*and \(\widehat \lambda _n\) are constructed, it is natural to estimate the map \(T_i={\mathbf t_{\lambda }^{\varLambda _i}}\) and its inverse \(T_i^{-1}={\mathbf t_{{\varLambda _i}}^{\lambda }}\) (when

*Λ*

_{i}are absolutely continuous; see the discussion after Assumptions 3 below) by the plug-in estimators

*Π*

_{i}via

*T*

_{i}should be close to the identity and \(\widehat {\varPi _i}\) should be close to

*Π*

_{i}.

### 4.3.5 Unbiased Estimation When \(\mathcal X=\mathbb {R}\)

*H*be a separable Hilbert space (or a convex subset thereof) and suppose that \(\widehat \theta \) is a random element in

*H*whose distribution

*μ*

_{θ}depends on a parameter

*θ*∈

*H*. Then \(\widehat {\theta }\) is

*unbiased*for

*θ*if for all

*θ*∈

*H*

*δ*=

*δ*(

*Λ*

_{1}, …,

*Λ*

_{n}) for which

Unbiased estimators allow us to avoid the problem of over-registering (the so-called pinching effect; Kneip and Ramsay [82, Section 2.4]; Marron et al. [90, p. 476]). An extreme example of over-registration is if one “aligns” all the observed patterns into a single fixed point *x*_{0}. The registration will then seem “successful” in the sense of having no residual phase variation, but the estimation is clearly biased because the points are not registered to the correct reference measure. Thus, requiring the estimator to be unbiased is an alternative to penalising the registration maps.

Due to the Hilbert space embedding of \(\mathcal W_2(\mathbb {R})\), it is possible to characterise unbiased estimators in terms of a simple condition on their quantile functions. As a corollary, *λ*_{n}, the Fréchet mean of {*Λ*_{1}, …, *Λ*_{n}}, is unbiased. Our regularised Fréchet–Wasserstein estimator \(\hat \lambda _n\) can then be interpreted as *approximately unbiased*, since it approximates the unobservable *λ*_{n}.

### Proposition 4.3.1 (Unbiased Estimators in \(\mathcal W_2(\mathbb {R})\))

*Let Λ be a random measure in* \(\mathcal W_2(\mathbb {R})\) *with finite Fréchet functional and let λ be the unique Fréchet mean of Λ (Theorem* *3.2.11*)*. An estimator δ constructed as a function of a sample* (*Λ*_{1}, …, *Λ*_{n}) *is unbiased for λ if and only if the left-continuous representatives (in L*_{2}(0, 1)*) satisfy* \(\mathbb {E} [F_\delta ^{-1}(x)]=F_\lambda ^{-1}(x)\) *for all x *∈ (0, 1).

### Proof

*δ*is unbiased if and only if for all

*λ*and all

*γ*,

*δ*=

*λ*

_{n}is unbiased, we simply invoke Theorem 3.2.11 twice to see that

*δ*.

## 4.4 Consistency

In functional data analysis, one often assumes that the number of curves *n* and the number of observed points per curve *m* both diverge to infinity. An analogous framework for point processes would similarly require the number of point processes *n* as well as the expected number of points *τ* per processes to diverge. A technical complication arises, however, because the mean measures do not suffice to characterise the distribution of the processes. Indeed, if one is given a point processes *Π* with mean measure *λ* (not necessarily a probability measure), and *τ* is an integer, there is no unique way to define a process *Π*^{(τ)} with mean measure *τλ*. One can define *Π*^{(τ)} = *τΠ*, so that every point in *Π* will be counted *τ* times. Such a construction, however, can never yield a consistent estimator of *λ*, even when *τ* →*∞*.

*τλ*is to take a superposition of

*τ*independent copies of

*Π*. In symbols, this means

*Π*

_{i}) independent, each having the same distribution as

*Π*. This superposition scheme gives the possibility to use the law of large numbers. If

*τ*is not an integer, then this construction is not well-defined but can be made so by assuming that the distribution of

*Π*is

*infinitely divisible*. The reader willing to assume that

*τ*is always an integer can safely skip to Sect. 4.4.1; all the main ideas are developed first for integer values of

*τ*and then extended to the general case.

*Π*is infinitely divisible if for every integer

*m*there exists a collection of

*m*independent and identically distributed \(\varPi _i^{(1/m)}\) such that

*Π*is infinitely divisible and

*τ*=

*k*∕

*m*is rational, then can define

*π*

^{(τ)}using

*km*independent copies of

*Π*

^{(1∕m)}:

*τ*via duality and continuity arguments, as follows. Define the

*Laplace functional*of

*Π*by

*Π*=

*Π*

^{(1)}is infinitely divisible, the Laplace functional

*L*

_{1}of

*Π*takes the form (Kallenberg [75, Chapter 6]; Karr [79, Theorem 1.43])

The Laplace functional of *Π*^{(τ)} is *L*_{τ}(*f*) = [*L*_{1}(*f*)]^{τ} for any rational *τ*, which simply amounts to multiplying the measure * ρ* by the scalar

*τ*. One can then do the same for an irrational

*τ*, and the resulting Laplace functional determines the distribution of

*Π*

^{(τ)}for all

*τ*> 0.

### 4.4.1 Consistent Estimation of Fréchet Means

We are now ready to define our asymptotic setup. The following assumptions will be made. Notice that the Wasserstein geometry does not appear explicitly in these assumptions, but is rather *derived* from them in view of Theorem 4.2.4. The compactness requirement can be relaxed under further moment conditions on *λ* and the point process *Π*; we focus on the compact case for the simplicity and because in practice the point patterns will be observed on a bounded observation window.

### Assumptions 3

*Let*\(K\subset \mathbb {R}^d\)

*be a compact convex nonempty set, λ an absolutely continuous probability measure on K, and τ*

_{n}

*a sequence of positive numbers. Let Π be a point processes on K with mean measure λ. Finally, define U*= int

*K.*

*For every n, let*\(\{\varPi _1^{(n)},\dots ,\varPi _n^{(n)}\}\)*be independent point processes, each having the same distribution as a superposition of τ*_{n}*copies of Π.**Let T be a random injective function on K (viewed as a random element in C*_{b}(*K*,*K*)*endowed with the supremum norm) such that T*(*x*) ∈*U for x*∈*U (that is, T*∈*C*_{b}(*U*,*U*)*) with nonsingular derivative*\(\nabla T(x)\in \mathbb {R}^{d\times d}\)*for almost all x*∈*U, that is a gradient of a convex function. Let*{*T*_{1}, …,*T*_{n}}*be independent and identically distributed as T.**For every x*∈*U, assume that*\(\mathbb {E} [T(x)]=x\).*Assume that the collections*\(\{T_n\}_{n=1}^\infty \)*and*\(\{\varPi _i^{(n)}\}_{i\le n,n=1,2,\dots }\)*are independent.**Let*\(\widetilde {\varPi }_i^{(n)}={T_i}\#\varPi _i^{(n)}\)*be the warped point processes, having*conditional mean measures \(\varLambda _i=T_i\#\lambda =\tau _n^{-1}\mathbb {E}\Big \{\widetilde {\varPi }_i^{(n)}\Big |T_i\Big \}\).*Define*\(\widehat \varLambda _i\)*by the smoothing procedure*(4.2)*, using bandwidth*\(\sigma _i^{(n)}\in [0,1]\)*(possibly random).*

*The dependence of the estimators on n will sometimes be tacit. But Λ* _{ i } *does not depend on n.*

By virtue of Theorem 4.2.4, *λ* is a Fréchet mean of the random measure *Λ* = *T#λ*. Uniqueness of this Fréchet mean will follow from Proposition 3.2.7 if we show that *Λ* is absolutely continuous with positive probability. This is indeed the case, since *T* is injective and has a nonsingular Jacobian matrix; see Ambrosio et al. [12, Lemma 5.5.3]. The Jacobian assumption can be removed when \(\mathcal X=\mathbb {R}\), because Fréchet means are always unique by Theorem 3.2.11.

*Full independence*: here the point processes are independent across rows, that is, \(\varPi _i^{(n)}\) and \(\varPi _i^{(n+1)}\) are also independent.*Nested observations*: here \(\varPi _i^{(n+1)}\) includes the same points as \(\varPi _i^{(n)}\) and additional points, that is, \(\varPi _i^{(n+1)}\) is a superposition of \(\varPi _i^{(n)}\) and another point process distributed as (*τ*_{n+1}−*τ*_{n})*Π*.

*τ*

_{n}are integers, as well as Poisson processes or, more generally, Poisson cluster processes.

We now state and prove the consistency result for the estimators of the conditional mean measures *Λ*_{i} and the structural mean measure *λ*.

### Theorem 4.4.1 (Consistency)

*If Assumptions*3

*hold,*\(\sigma _n=n^{-1}\sum _{i=1}^n\sigma _i^{(n)}\to 0\)

*almost surely and τ*

_{n}→

*∞ as n*→

*∞, then:*

- 1.
*The estimators*\(\widehat{\varLambda _{i}}\)*defined by*(4.2)*, constructed with bandwidth*\(\sigma =\sigma _i^{(n)}\)*, are Wasserstein-consistent for the conditional mean measures: for all i such that*\(\sigma _i^{(n)}\stackrel {p}\to 0\)$$\displaystyle \begin{aligned} W_2\left(\widehat{\varLambda_i},\varLambda_i\right) \stackrel{p}{\longrightarrow}0 ,\qquad \mathit{\mbox{ as }}n\to\infty; \end{aligned}$$ - 2.
*The regularised Fréchet–Wasserstein estimator of the structural mean measure (as described in Sect.*4.3)*is strongly Wasserstein-consistent,*$$\displaystyle \begin{aligned} W_2(\widehat{\lambda}_n,\lambda)\stackrel{a.s.}{\longrightarrow}0, \qquad \mathit{\mbox{ as }}n\to\infty. \end{aligned}$$

*Convergence in 1. holds almost surely under the additional conditions that* \(\sum _{n=1}^\infty \tau _n^{-2} <\infty \) *and* \(\mathbb {E}\left [\varPi (\mathbb {R}^d)\right ]^4<\infty \)*. If σ*_{n} → 0 *only in probability, then convergence in 2. still holds in probability.*

Theorem 4.4.1 still holds without smoothing (*σ*_{n} = 0). In that case, \(\widehat {\lambda }_n=\tilde \lambda _n\) is possibly not unique, and the theorem should be interpreted in a set-valued sense (as in Proposition 1.7.8): almost surely, *any* choice of minimisers \(\tilde \lambda _n\) converges to *λ* as *n* →*∞*.

The preceding paragraph notwithstanding, we will usually assume that some smoothing *is* present, in which case \(\widehat {\lambda }_n\) is unique and absolutely continuous by Proposition 3.1.8. The uniform Lipschitz bounds for the objective function show that if we restrict the relevant measures to be absolutely continuous, then \(\widehat \lambda _n\) is a continuous function of \((\widehat {\varLambda _1},\dots ,\widehat {\varLambda _n})\) and hence \(\widehat \lambda _n:(\varOmega ,\mathcal F,{\mathbb {P}})\to \mathcal W_2(K)\) is measurable; this is again a minor issue because many arguments in the proof hold for each *ω* ∈ *Ω* separately. Thus, even if \(\widehat \lambda _n\) is not measurable, the proof shows that the convergence holds outer almost surely or in outer probability.

The first step in proving consistency is to show that the Wasserstein distance between the unsmoothed and the smoothed estimators of *Λ*_{i} vanishes with the smoothing parameter. The exact rate of decay will be important to later establish the rate of convergence of \(\widehat {\lambda }_n\) to *λ*, and is determined next.

### Lemma 4.4.2 (Smoothing Error)

*There exists a finite constant C*

_{ψ,K}

*, depending only on ψ and on K, such that*

Since the smoothing parameter will anyway vanish, this restriction to small values of *σ* is not binding. The constant *C*_{ψ,K} is explicit. When \(\mathcal X=\mathbb {R}\), a more refined construction allows to improve this constant in some situations, see Panaretos and Zemel [100, Lemma 1].

### Proof

The idea is that (4.2) is a sum of measures with mass 1∕*N*_{i} that can be all sent to the relevant point *x*_{j}, and we refer to page 98 in the supplement for the precise details.

*Proof(Proof of Theorem* 4.4.1) The proof, detailed on page 97 of the supplement, follows the following steps: firstly, one shows the convergence in probability of \(\widehat {\varLambda _i}\) to *Λ*_{i}. This is basically a corollary of Karr [79, Proposition 4.8] and the smoothing bound from Lemma 4.4.2.

*K*is compact, they are all locally Lipschitz, so their differences can be controlled by the distances between

*Λ*

_{i}, \(\widetilde {\varLambda }_i\), and \(\widehat {\varLambda _i}\). The first distance vanishes since the intensity

*τ*→

*∞*, and the second by the smoothing bound. Another compactness argument yields that \(\widehat {F}_n\to F\) uniformly on \(\mathcal W_2(K)\), and so the minimisers converge.

*a*’s, then to all

*a*by approximation. The smoothing error is again controlled by Lemma 4.4.2.

### 4.4.2 Consistency of Warp Functions and Inverses

We next discuss the consistency of the warp and registration function estimators. These are key elements in order to align the observed point patterns \(\widetilde \varPi _i\). Recall that we have consistent estimators \(\widehat {\varLambda _i}\) for *Λ*_{i} and \(\widehat \lambda _n\) for *λ*. Then \(T_i={\mathbf t_{\lambda }^{{\varLambda _i}}}\) is estimated by \({\mathbf t_{\widehat \lambda _n}^{\widehat {\varLambda _i}}}\) and \(T_i^{-1}\) is estimated by \({\mathbf t_{{\widehat {\varLambda _i}}}^{{\widehat {\lambda }_n}}}\). We will make the following extra assumptions that lead to more transparent statements (otherwise one needs to replace *K* with the set of Lebesgue points of the supports of *λ* and *Λ*_{i}).

### Assumptions 4 (Strictly Positive Measures)

*In addition to Assumptions*3

*suppose that:*

- 1.
*λ has a positive density on K (in particular,*supp*λ*=*K);* - 2.
*T is almost surely surjective on U*= int*K (thus a homeomorphism of U).*

As a consequence \(\mathrm {supp}\varLambda =\mathrm {supp} (T\#\lambda )=\overline {T(\mathrm {supp}\lambda )}=K\) almost surely.

### Theorem 4.4.3 (Consistency of Optimal Maps)

*Let Assumptions*4

*be satisfied in addition to the hypotheses of Theorem*4.4.1

*. Then for any i such that*\(\sigma _i^{(n)}\stackrel p\to 0\)

*and any compact set S*⊆int

*K,*

*Almost sure convergence can be obtained under the same provisions made at the end of the statement of Theorem* 4.4.1.

A few technical remarks are in order. First and foremost, it is not clear that the two suprema are measurable. Even though *T*_{i} and \(T_i^{-1}\) are random elements in \(C_b(U,\mathbb {R}^d)\), their estimators are only defined in an *L*_{2} sense. The proof of Theorem 4.4.3 is done *ω*-wise. That is, for any *ω* in the probability space such that Theorem 4.4.1 holds, the two suprema vanish as *n* →*∞*. In other words, the convergence holds in outer probability or outer almost surely.

Secondly, assuming positive smoothing, the random measures \(\widehat {\varLambda _i}\) are smooth with densities bounded below on *K*, so \(\widehat {T_i^{-1}}\) are defined on the whole of *U* (possibly as set-valued functions on a *Λ*_{i}-null set). But the only known regularity result for \(\widehat \lambda _n\) is an upper bound on its density (Proposition 3.1.8), so it is unclear what is its support and consequently what is the domain of definition of \(\widehat {T_i}\).

Lastly, when the smoothing parameter *σ* is zero, \(\widehat {T_i}\) and \(\widehat {T_i^{-1}}\) are not defined. Nevertheless, Theorem 4.4.3 still holds in the set-valued formulation of Proposition 1.7.11, of which it is a rather simple corollary:

*Proof(Proof of Theorem*4.4.3) The proof amounts to setting the scene in order to apply Proposition 1.7.11 of stability of optimal maps. We define

*μ*

_{n}to

*μ*and

*ν*

_{n}to

*ν*is the conclusion of Theorem 4.4.1; the finiteness is apparent because

*K*is compact and the uniqueness follows from the assumed absolute continuity of

*Λ*

_{i}. Since in addition \(T_i^{-1}\) is uniquely defined on

*U*= int

*K*which is an open convex set, the restrictions on

*Ω*in Proposition 1.7.11 are redundant. Uniform convergence of \(\widehat {T_i}\) to

*T*

_{i}is proven in the same way.

### Corollary 4.4.4 (Consistency of Point Pattern Registration)

*For any i such that*\(\sigma _i^{(n)}\stackrel p\to 0\)

*,*

The division by the number of observed points ensures that the resulting measures are probability measures; the relevant information is contained in the point patterns themselves, and is invariant under this normalisation.

### Proof

*n*is large. Since \(\widehat {\varPi _i^{(n)}}=(\widehat {T_i^{-1}}\circ T_i)\#\varPi _i^{(n)}\), we have the upper bound

*Ω*⊆int

*K*and split the integral to

*Ω*and its complement. Then

*d*

_{K}is the diameter of

*K*. By writing int

*K*as a countable union of compact sets (and since

*λ*is absolutely continuous), this can be made arbitrarily small by choice of

*Ω*.

*Ω*by

*T*

_{i}(

*Ω*) is a compact subset of

*U*= int

*K*, because

*T*

_{i}∈

*C*

_{b}(

*U*,

*U*). The right-hand side therefore vanishes as

*n*→

*∞*by Theorem 4.4.3, and this completes the proof.

Possible extensions pertaining to the boundary of *K* are discussed on page 33 of the supplement.

## 4.5 Illustrative Examples

In this section, we illustrate the estimation framework put forth in this chapter by considering an example of a structural mean *λ* with a bimodal density on the real line. The unwarped point patterns *Π* originate from Poisson processes with mean measure *λ* and, consequently, the warped points \(\widetilde \varPi \) are Cox processes (see Sect. 4.1.2). Another scenario involving triangular densities can be found in Panaretos and Zemel [100].

### 4.5.1 Explicit Classes of Warp Maps

As a first step, we introduce a class of random warp maps satisfying Assumptions 2, that is, increasing maps that have as mean the identity function. The construction is a mixture version of similar maps considered by Wang and Gasser [128, 129].

*k*define

*ζ*

_{k}: [0, 1] → [0, 1] by

*ζ*

_{k}(0) = 0,

*ζ*

_{k}(1) = 1 and

*ζ*

_{k}is smooth and strictly increasing for all

*k*. Figure 4.4a plots

*ζ*

_{k}for

*k*= −3, …, 3. To make

*ζ*

_{k}a random function, we let

*k*be an integer-valued random variable. If the latter is symmetric, then we have

*J*> 1 be an integer and

*V*= (

*V*

_{1}, …,

*V*

_{J}) be a random vector following the flat Dirichlet distribution (uniform on the set of nonnegative vectors with

*v*

_{1}+ ⋯ +

*v*

_{J}= 1). Take independently

*k*

_{j}following the same distribution as

*k*and define

*V*

_{j}is positive,

*T*is increasing and as (

*V*

_{j}) sums up to unity

*T*has mean identity. Realisations of these warp functions are given in Fig. 4.4b and c for

*J*= 2 and

*J*= 10, respectively. The parameters (

*k*

_{j}) were chosen as symmetrised Poisson random variables: each

*k*

_{j}has the law of

*XY*with

*X*Poisson with mean 3 and \({\mathbb {P}}(Y=1)={\mathbb {P}}(Y=-1)=1/2\) for

*Y*and

*X*independent. When

*J*= 10 is large, the function

*T*deviates only mildly from the identity, since a law of large numbers begins to take effect. In contrast,

*J*= 2 yields functions that are quite different from the identity. Thus, it can be said that the parameter

*J*controls the variance of the random warp function

*T*.

### 4.5.2 Bimodal Cox Processes

*λ*be a mixture of a bimodal Gaussian distribution (restricted to

*K*= [−16, 16]) and a beta background on the interval [−12, 12], so that mass is added at the centre of

*K*but not near the boundary. In symbols this is given as follows. Let

*φ*be the standard Gaussian density and let

*β*

_{α,β}denote the density of a the beta distribution with parameters

*α*and

*β*. Then

*λ*is chosen as the measure with density

*𝜖*∈ [0, 1] is the weight of the beta background. (We ignore the loss of a negligible amount of mass due to the restriction of the Gaussians to [−16, 16].) Plots of the density and distribution functions are given in Fig. 4.5.

The main criterion for the quality of our regularised Fréchet–Wasserstein estimator will be its success in discerning the two modes at ± 8; these will be smeared by the phase variation arising from the warp functions.

*λ*,

*𝜖*= 0.1, and total intensity (expected number of points)

*τ*= 93. In addition, we generated warp functions as in (4.5) but rescaled to [−16, 16]; that is, having the same law as the functions

*K*to

*K*. These cause rather violent phase variation, as can be seen by the plots of the densities and distribution functions of the conditional measures

*Λ*=

*T#λ*presented in Fig. 4.6a and b; the warped points themselves are displayed in Fig. 4.6c.

*λ*and is shown in Fig. 4.7a. It is contrasted with

*λ*at the level of distribution functions, as well as with the empirical arithmetic mean; the latter, the

*naive estimator*, is calculated by ignoring the warping and simply averaging linearly the (smoothed) empirical distribution functions across the observations. We notice that \(\widehat {\lambda }_n\) is rather successful at locating the two modes of

*λ*, in contrast with the naive estimator that is more diffuse. In fact, its distribution function increases approximately linearly, suggesting a nearly constant density instead of the correct bimodal one.

*λ*constructed from a superposition of all the warped points and all the registered ones. Notice that the estimator that uses the registered points is much more successful than the one using the warped ones in discerning the two density peaks. This is not surprising after a brief look at Fig. 4.8, where the unwarped, warped, and registered points are displayed. Indeed, there is very high concentration of registered points around the true location of the peaks, ± 8. This is not the case for the warped points because of the phase variation that translates the centres of concentration for each individual observation. It is important to remark that the fluctuations in the density estimator in Fig. 4.7c are not related to the registration procedure, and could be reduced by a better choice of bandwidth. Indeed, our procedure does not attempt to estimate the density, but rather the distribution function.

*λ*as shown in Theorem 4.4.1, we let the number of processes

*n*as well as the expected number of observed point per process

*τ*to vary. Figures 4.10 and 4.11 show the sampling variation of \(\widehat \lambda _n\) for different values of

*n*and

*τ*. We observe that as either of these increases, the realisations \(\widehat \lambda _n\) indeed approach

*λ*. The figures suggest that, in this scenario, the amplitude variation plays a stronger role than the phase variation, as the effect of

*τ*is more substantial.

### 4.5.3 Effect of the Smoothing Parameter

*σ*

_{i}. The asymptotics (Theorem 4.4.1) guarantee the consistency of the estimators, in particular the regularised Fréchet–Wasserstein estimator \(\widehat \lambda _n\), provided that \(\max _{i=1}^n\sigma _i\to 0\). In our simulations, we choose

*σ*

_{i}in a data-driven way by employing unbiased cross validation. To gauge for the effect of the smoothing, we carry out the same estimation procedure but with

*σ*

_{i}multiplied by a parameter

*s*. Figure 4.12 presents the distribution function of \(\widehat \lambda _n\) as a function of

*s*. Interestingly, the curves are nearly identical as long as

*s*≤ 1, whereas when

*s*> 1, the bias becomes more substantial.

*s*. We see that only minor differences are present as

*s*varies from 0.1 to 1, for example, in the grey (8), black (17), and green (19) processes. When

*s*= 3, the distortion becomes quite more substantial. This phenomenon repeats itself across all combinations of

*n*,

*τ*, and

*s*tested.

## 4.6 Convergence Rates and a Central Limit Theorem on the Real Line

Since the conditional mean measures *Λ*_{i} are discretely observed, the rate of convergence of our estimators will be affected by the rate at which the number of observed points per process \(N_i^{(n)}\) increases to infinity. The latter is controlled by the next lemma, which is valid for any complete separable metric space \(\mathcal X\).

### Lemma 4.6.1 (Number of Points Grows Linearly)

*Let*\(N_i^{(n)}=\varPi _i^{(n)}(\mathcal X)\)

*denote the total number of observed points. If*\(\tau _n/\log n\to \infty \)

*, then there exists a constant C*

_{Π}> 0

*, depending only on the distribution of Π, such that almost surely*

*In particular, there are no empty point processes, so the normalisation is well-defined. If Π is a Poisson process, then we have the more precise result*

### Remark 4.6.2

*One can also show that the limit superior of the same quantity is bounded by a constant* \(C_\varPi ^{\prime }\)*. If* \(\tau _n/\log n\) *is bounded below, then the same result holds but with worse constants. If only τ*_{n} →*∞, then the result holds for each i separately but in probability.*

The proof is a simple application of Chernoff bounds; see page 108 in the supplement.

*τ*

_{n}. As in the consistency proof, the idea is to write

*F*by a sample mean

*F*

_{n}. The second term is associated with the amplitude variation resulting from observing

*Λ*

_{i}discretely. The third term can be viewed as the bias incurred by the smoothing procedure. Accordingly, the rate at which \(\widehat \lambda _n\) converges to

*λ*is a sum of three separate terms. We recall the standard \(O_{\mathbb {P}}\) terminology: if

*X*

_{n}and

*Y*

_{n}are random variables, then \(X_n=O_{\mathbb {P}}(Y_n)\) means that the sequence (

*X*

_{n}∕

*Y*

_{n}) is

*bounded in probability*, which by definition is the condition

*X*

_{n}grows no faster than

*Y*

_{n}, while the latter stresses that

*Y*

_{n}grows at least as fast as

*X*

_{n}(which is of course the same assertion). Finally, \(X_n=o_{\mathbb {P}}(Y_n)\) means that

*X*

_{n}∕

*Y*

_{n}→ 0 in probability.

### Theorem 4.6.3 (Convergence Rates on \(\mathbb {R}\))

*Suppose in addition to Assumptions*3

*that d*= 1

*,*\(\tau _n/\log n\to \infty \)

*and that Π is either a Poisson process or a binomial process. Then*

*where all the constants in the* \(O_{\mathbb {P}}\) *terms are explicit.*

### Remark 4.6.4

*Unlike classical density estimation, no assumptions on the rate of decay of σ*_{n} *are required, because we only need to estimate the distribution function and not the derivative. If the smoothing parameter is chosen to be* \(\sigma _i^{(n)}=[N_i^{(n)}]^{-\alpha }\) *for some α *> 0 *and* \(\tau _n/\log n\to \infty \)*, then by Lemma* 4.6.1 \(\sigma _n\le \max _{1\le i\le n}\sigma _i^{(n)}=O_{\mathbb {P}}(\tau _n^{-\alpha })\)*. For example, if* Rosenblatt’s rule *α *= 1∕5 *is employed, then the* \(O_{\mathbb {P}}(\sigma _n)\) *term can be replaced by* \(O_{\mathbb {P}}(1/\sqrt [5]{\tau _n})\).

One can think about the parameter *τ* as separating the *sparse* and *dense* regimes as in classical functional data analysis (see also Wu et al. [132]). If *τ* is bounded, then the setting is *ultra sparse* and consistency cannot be achieved. A sparse regime can be defined as the case where *τ*_{n} →*∞* but slower than \(\log n\). In that case, consistency is guaranteed, but some point patterns will be empty. The *dense* regime can be defined as *τ*_{n} ≫ *n*^{2}, in which case the amplitude variation is negligible asymptotically when compared with the phase variation.

The exponent − 1∕4 of *τ*_{n} can be shown to be optimal without further assumptions, but it can be improved to − 1∕2 if \({\mathbb {P}}(f_\varLambda \ge \epsilon \mbox{ on }K)=1\) for some *𝜖* > 0, where *f*_{Λ} is the density of *Λ* (see Sect. 4.7). In terms of *T*, the condition is that \({\mathbb {P}}(T'\ge \epsilon )=1\) for some *𝜖* and *λ* has a density bounded below. When this is the case, *τ*_{n} needs to compared with *n* rather than *n*^{2} in the next paragraph and the next theorem.

Theorem 4.6.3 provides conditions for the optimal parametric rate \(\sqrt {n}\) to be achieved: this happens if we set *σ*_{n} to be of the order \(O_{\mathbb {P}}(n^{-1/2})\) or less and if *τ*_{n} is of the order *n*^{2} or more. But if the last two terms in Theorem 4.6.3 are *negligible* with respect to *n*^{−1∕2}, then a sort of *central limit theorem* holds for \(\widehat \lambda _n\):

### Theorem 4.6.5 (Asymptotic Normality)

*In addition to the conditions of Theorem*4.6.3

*, assume that τ*

_{n}∕

*n*

^{2}→

*∞,*\(\sigma _n=o_{{\mathbb {P}}}(n^{-1/2})\)

*and λ possesses an invertible distribution function F*

_{λ}

*on K. Then*

*for a zero-mean Gaussian process Z with the same covariance operator of T (the latter viewed as a random element in L*

_{2}(

*λ*)

*), namely with covariance kernel*

*If the density f*_{λ} *exists and is (piecewise) continuous and bounded below on K, then the weak convergence also holds in L*_{2}(*K*).

In view of Sect. 2.3, Theorem 4.6.5 can be interpreted as asymptotic normality of \(\widehat \lambda _n\) in the *tangential* sense: \(\sqrt {n} \log _\lambda (\widehat \lambda _n)\) converges to a Gaussian random element in the tangent space Tan_{λ}, which is a subset of *L*_{2}(*λ*). The additional smoothness conditions allow to switch to the space *L*_{2}(*K*), which is independent of the unknown template measure *λ*.

See pages 109 and 110 in the supplement for detailed proofs of these theorems. Below we sketch the main ideas only.

*Proof(Proof of Theorem* 4.6.3)

The quantile formula \(W_2(\gamma ,\theta )=\|F^{-1}_\theta - F^{-1}_\gamma \|{ }_{L^2(0,1)}\) from Sect. 1.5 and the average quantile formula for the Fréchet mean (Sect. 3.1.4) show that the oracle empirical mean \(F_{\lambda _n}^{-1}\) follows a central limit theorem in *L*_{2}(0, 1). Since we work in the Hilbert space *L*_{2}(0, 1), Fréchet means are simple averages, so the errors in the Fréchet mean have the same rate as the errors in the Fréchet functionals. The smoothing term is easily controlled by Lemma 4.4.2.

Controlling the amplitude term is more difficult. Bounds can be given using the machinery sketched in Sect. 4.7, but we give a more elementary proof by reducing to the 1-Wasserstein case (using ( 2.2)), which can be more easily handled in terms of distributions functions (Corollary 1.5.3).

*Proof(Proof of Theorem*4.6.5) The hypotheses guarantee that the amplitude and smoothing errors are negligible and

*GP*is the Gaussian process defined in the proof of Theorem 4.6.3. One then employs a composition with

*F*

_{λ}.

## 4.7 Convergence of the Empirical Measure and Optimality

One may find the term \(O_{\mathbb {P}}(1/\sqrt [4]{\tau _n})\) in Theorem 4.6.3 to be somewhat surprising, and expect that it ought to be \(O_{\mathbb {P}}(1/\sqrt {\tau _n})\). The goal of this section is to show why the rate \(1/\sqrt [4]{\tau _n}\) is optimal without further assumptions and discuss conditions under which it can be improved to the optimal rate \(1/\sqrt {\tau _n}\). For simplicity, we concentrate on the case *τ*_{n} = *n* and assume that the point process *Π* is binomial; the Poisson case being easily obtained from the simplified one (using Lemma 4.6.1). We are thus led to study rates of convergence of empirical measures in the Wasserstein space. That is to say, for a fixed exponent *p* ≥ 1 and a fixed measure \(\mu \in \mathcal W_p(\mathcal X)\), we consider independent random variables *X*_{1}, … with law *μ* and the *empirical measure* \(\mu _n=n^{-1}\sum _{i=1}^n\delta \{X_i\}\). The first observation is that \(\mathbb {E} W_p(\mu ,\mu _n)\to 0\):

### Lemma 4.7.1

*Let*\(\mu \in P(\mathcal X)\)

*be any measure. Then*

### Proof

*Z*

_{n}of a random variable \(V={\int _{\mathcal X} \! \|x-X_1\|{ }^p \, \mathrm {d}\mu (x)}\) that has a finite expectation. A version of the dominated converge theorem (given on page 111 in the supplement) implies that \(\mathbb {E} Y_n\to 0\). Now invoke Jensen’s inequality.

### Remark 4.7.2

*The sequence* \(\mathbb {E} W_p(\mu ,\mu _n)\) *is not monotone, as the simple example μ *= (*δ*_{0} +* δ*_{1})∕2 *shows (see page 111 in the supplement).*

The next question is how quickly \(\mathbb {E} W_p(\mu ,\mu _n)\) vanishes when \(\mu \in \mathcal W_p(\mathcal X)\). We shall begin with two simple general lower bounds, then discuss upper bounds in the one-dimensional case, put them in the context of Theorem 4.6.3, and finally briefly touch the *d*-dimensional case.

### Lemma 4.7.3 (\(\sqrt {n}\) Lower Bound)

*Let*\(\mu \in P(\mathcal X)\)

*be nondegenerate. Then there exists a constant c*(

*μ*) > 0

*such that for all p*≥ 1

*and all n*

### Proof

*X*∼

*μ*and let

*a*≠

*b*be two points in the support

*μ*. Consider \(f(x)=\min (1,\|x-a\|)\), a bounded 1-Lipschitz function such that

*f*(

*a*) = 0 <

*f*(

*b*). Then

by the central limit theorem and the Kantorovich–Rubinstein theorem ( 1.11).

For discrete measures, the rates scale badly with *p*. More generally:

### Lemma 4.7.4 (Separated Support)

*Suppose that there exist Borel sets*\(A,B\subset \mathcal X\)

*such that μ*(

*A*∪

*B*) = 1

*,*

*Then for any p *≥ 1 *there exists c*_{p}(*μ*) > 0 *such that* \(\mathbb {E} W_p(\mu _n,\mu )\ge c_p(\mu )n^{-1/(2p)}\).

Any nondegenerate finitely discrete measure *μ* satisfies this condition, and so do “non-pathological” countably discrete ones. (An example of a “pathological” measure is one assigning positive mass to any rational number.)

### Proof

Let *k* ∼ *B*(*n*, *q* = *μ*(*A*)) denote the number of points from the sample (*X*_{1}, …, *X*_{n}) that fall in *A*. Then a mass of |*k*∕*n* − *q*| must travel between *A* and *B*, a distance of at least \(d_{\min }\). Thus, \(W_p^p(\mu _n,\mu )\ge d_{\min }^p|k/n-q|\), and the result follows from the central limit theorem for *k*; see page 112 in the supplement for the full details.

*n*

^{−1∕2}to be attained for

*W*

_{1}: since

*F*

_{n}(

*t*) ∼

*B*(

*n*,

*F*(

*t*)) has variance

*F*(

*t*)(1 −

*F*(

*t*))∕

*n*, we have (by Fubini’s theorem and Jensen’s inequality)

*W*

_{1}(

*μ*

_{n},

*μ*) is of the optimal order

*n*

^{−1∕2}if

*μ*is compactly supported. The

*J*

_{1}condition is essentially a moment condition, since for any

*δ*> 0, we have for

*X*∼

*μ*that \(\mathbb {E}|X|{ }^{2+\delta }<\infty \Longrightarrow J_1(\mu )<\infty \Longrightarrow \mathbb {E}|X|{ }^2<\infty \). It turns out that this condition is necessary, and has a more subtle counterpart for any

*p*≥ 1. Let

*f*denote the density of the absolutely continuous part of

*μ*(so

*f*≡ 0 if

*μ*is discrete).

### Theorem 4.7.5 (Rate of Convergence of Empirical Measures)

*Let p*≥ 1

*and*\(\mu \in \mathcal W_p(\mathbb {R})\)

*. The condition*

*is necessary and sufficient for* \(\mathbb {E} W_p(\mu _n,\mu )=O(n^{-1/2})\).

See Bobkov and Ledoux [25, Theorem 5.10] for a proof for the *J*_{p} condition, and Theorems 5.1 and 5.3 for the values of the constants and a stronger result.

When *p* > 1, for *J*_{p}(*μ*) to be finite, the support of *μ* must be connected; this is not needed when *p* = 1. Moreover, the *J*_{p} condition is satisfied when *f* is bounded below (in which case the support of *μ* must be compact). However, smoothness alone does not suffice, even for measures with positive density on a compact support. More precisely, we have:

### Proposition 4.7.6

*For any rate 𝜖*

_{n}→ 0

*there exists a measure μ on*[−1, 1]

*with positive C*

^{∞}

*density there, and such that for all n*

*n*

^{−1∕(2p)}from Lemma 4.7.4 is the worst among compactly supported measures on \(\mathbb {R}\). Indeed, by Jensen’s inequality and ( 2.2), for any

*μ*∈

*P*([0, 1]),

*Λ*

_{i}are independent, we have

*δ*> 0 such that with probability one

*Λ*has a density bounded below by

*δ*. Since

*Λ*=

*T#λ*, this will happen provided that

*λ*itself has a bounded below density and

*T*has a bounded below derivative. Bigot et al. [23] show that the rate \(\sqrt {\tau _n}\) cannot be improved.

We conclude by proving a lower bound for absolutely continuous measures and stating, without proof, an upper bound.

### Proposition 4.7.7

*Let*\(\mu \in \mathcal W_1(\mathbb {R}^d)\)

*have an absolutely continuous part with respect to Lebesgue measure, and let ν*

_{n}

*be any discrete measure supported on n points (or less). Then there exists a constant C*(

*μ*) > 0

*such that*

### Proof

*f*be the density of the absolutely continuous part

*μ*

_{c}, and observe that for some finite number

*M*,

*x*

_{1}, …,

*x*

_{n}be the support points of

*ν*

_{n}and

*𝜖*> 0. Let

*μ*

_{c,M}be the restriction of

*μ*

_{c}to the set where the density is smaller than

*M*. The union of balls

*B*

_{𝜖}(

*x*

_{i}) has

*μ*

_{c,M}-measure of at most

*𝜖*

^{d}=

*δ*(

*nMC*

_{d})

^{−1}. Thus, a mass 2

*δ*−

*δ*=

*δ*must travel more than

*𝜖*from

*ν*

_{n}to

*μ*in order to cover

*μ*

_{c,M}. Hence

*𝜖*

^{−d}balls of radius

*𝜖*in order to cover a sufficiently large fraction of the mass of

*μ*. The determining quantity for

*upper*bounds on the empirical Wasserstein distance is the

*covering numbers*

*μ*is tight, these are finite for all

*𝜖*,

*τ*> 0, and they increase as

*𝜖*and

*τ*approach zero. To put the following bound in context, notice that if

*μ*is compactly supported on \(\mathbb {R}^d\), then

*N*(

*μ*,

*𝜖*, 0) ≤

*K𝜖*

^{−d}.

### Theorem 4.7.8

*If for some d*>2*p, N*(*μ*, *𝜖*, *𝜖*^{dp∕(d−2p)})≤*K𝜖*^{−d}*, then* \(\mathbb {E} W_p{\le } C_pn^{{-}1/d}\).

Comparing this with the lower bound in Lemma 4.7.4, we see that in the high-dimensional regime *d* > 2*p*, absolutely continuous measures have a worse rate than discrete ones. In the low-dimensional regime *d* < 2*p*, the situation is opposite. We also obtain that for *d* > 2 and a compactly supported absolutely continuous \(\mu \in \mathcal W_1(\mathbb {R}^d)\), \(\mathbb {E} W_1(\mu _n,\mu )\sim n^{-1/d}\).

## 4.8 Bibliographical Notes

Our exposition in this chapter closely follows the papers Panaretos and Zemel [100] and Zemel and Panaretos [134].

Books on functional data analysis include Ramsay and Silverman [109, 110], Ferraty and Vieu [51], Horváth and Kokoszka [70], and Hsing and Eubank [71], and a recent review is also available (Wang et al. [127]). The specific topic of amplitude and phase variation is discussed in [110, Chapter 7] and [127, Section 5.2]. The next paragraph gives some selective references.

One of the first functional registration techniques employed dynamic programming (Wang and Gasser [128]) and dates back to Sakoe and Chiba [118]. Landmark registration consists of identifying salient features for each curve, called *landmarks*, and aligning them (Gasser and Kneip [61]; Gervini and Gasser [63]). In pairwise synchronisation (Tang and Müller [122]) one aligns each pair of curves and then derives an estimator of the warp functions by linear averaging of the pairwise registration maps. Another class of methods involves a template curve, to which each observation is registered, minimising a discrepancy criterion; the template is then iteratively updated (Wang and Gasser [129]; Ramsay and Li [108]). James [72] defines a “feature function” for each curve and uses the moments of the feature function to guarantee identifiability. Elastic registration employs the Fisher–Rao metric that is invariant to warpings and calculates averages in the resulting quotient space (Tucker et al. [123]). Other techniques include semiparametric modelling (Rønn [115]; Gervini and Gasser [64]) and principal components registration (Kneip and Ramsay [82]). More details can be found in the review article by Marron et al. [90]. Wrobel et al. [131] have recently developed a registration method for functional data with a discrete flavour. It is also noteworthy that a version of the Wasserstein metric can also be used in the functional case (Chakraborty and Panaretos [34]).

The literature on the point processes case is more scarce; see the review by Wu and Srivastava [133].

A parametric version of Theorem 4.2.4 was first established by Bigot and Klein [22, Theorem 5.1] in \(\mathbb {R}^d\), extended to a compact nonparametric formulation in Zemel and Panaretos [134]. There is an infinite-dimensional linear version in Masarotto et al. [91]. The current level of generality appears to be new.

Theorem 4.4.1 is a stronger version of Panaretos and Zemel [100, Theorem 1] where it was assumed that *τ*_{n} must diverge to infinity faster than \(\log n\). An analogous construction under the Bayesian paradigm can be found in Galasso et al. [58]. Optimality of the rates of convergence in Theorem 4.6.3 is discussed in detail by Bigot et al. [23], where finiteness of the functional *J*_{2} (see Sect. 4.7) is assumed and consequently \(O_{\mathbb {P}}(\tau _n^{-1/4})\) is improved to \(O_{\mathbb {P}}(\tau _n^{-1/2})\).

As far as we know, Theorem 4.6.5 (taken from [100]) is the first central limit theorem for Fréchet means in Wasserstein space. When the measures *Λ*_{i} are observed exactly (no amplitude variation: *τ*_{n} = *∞* and *σ* = 0) Kroshnin et al. [84] have recently proven a central limit theorem for random Gaussian measures in arbitrary dimension, extending a previous result of Agueh and Carlier [3]. It seems likely that in a fully nonparametric setting, the rates of convergence (compare Theorem 4.6.3) might be slower than \(\sqrt {n}\); see Ahidar-Coutrix et al. [4].

The magnitude of the amplitude variation in Theorem 4.6.3 pertains to the rates of convergence of \(\mathbb {E} W_p(\mu _n,\mu )\) to zero (Sect. 4.7). This is a topic of intense research, dating back to the seminal paper by Dudley [46], where a version of Theorem 4.7.8 with *p* = 1 is shown for the bounded Lipschitz metric. The lower bounds proven in this section were adapted from [46], Fournier and Guillin [54], and Weed and Bach [130].

The version of Theorem 4.7.8 given here can be found in [130] and extends Boissard and Le Gouic [27]. Both papers [27, 130] work in a general setting of complete separable metric spaces. An additional \(\log n\) term appears in the limiting case *d* = 2*p*, as already noted (for *p* = 1) by [46], and the classical work of Ajtai et al. [5] for *μ* uniform on [0, 1]^{2}. More general results are available in [54]. A longer (but far from being complete) bibliography is given in the recent review by Panaretos and Zemel [101, Subsection 3.3.1], including works by Barthe, Dobrić, Talagrand, and coauthors on almost sure results and deviation bounds for the empirical Wasserstein distance.

The *J*_{1} condition is due to del Barrio et al. [43], who showed it to be necessary and sufficient for the empirical process \(\sqrt {n}(F_n - F)\) to converge in distribution to \(\mathbb {B}\circ F\), with \(\mathbb B\) Brownian bridge. The extension to 1 ≤ *p* ≤*∞* (and a lot more) can be found in Bobkov and Ledoux [25], employing order statistics and beta distributions to reduce to the uniform case. Alternatively, one may consult Mason [92], who uses weighted approximations to Brownian bridges.

An important aspect that was not covered here is that of statistical inference of the Wasserstein distance on the basis of the empirical measure. This is a challenging question and results by del Barrio, Munk, and coauthors are available for one-dimensional, elliptical, or discrete measures, as explained in [101, Section 3].

## Footnotes

- 1.
For instance, the arithmetic average of two scalar Gaussians

*N*(*μ*_{1}, 1) and*N*(*μ*_{2}, 1) will be their mixture with equal weights, but their Fréchet–Wasserstein average will be the Gaussian \(N(\frac {1}{2}\mu _1+\frac {1}{2}\mu _2,1)\) (see Lemma 4.2.1), which is arguably more representative from an intuitive point of view. In much the same way, the Fréchet–Wasserstein average of probability measures representing some type of object (e.g., normalised greyscale images of faces) will also be an object of the same type. This sort of phenomenon is well-known in manifold statistics, more generally, and is arguably one of the key motivations to account for the non-linear geometry of the sample space, rather than imbed it into a larger linear space and use the addition operation. - 2.
As the functional case will only serve as a motivation, our treatment of this case will mostly be heuristic and superficial. Rigorous proofs and more precise details can be found in the books by Ferraty and Vieu [51], Horváth and Kokoszka [70], or Hsing and Eubank [71]. The notion of amplitude and phase variation is discussed in the books by Ramsay and Silverman [109, 110] that are of a more applied flavour. One can also consult the review by Wang et al. [127], where amplitude and phase variation are discussed in Sect. 5.2.

- 3.
If the cumulative count process

*Γ*(*t*) =*Π*[0,*t*) is mean-square continuous, then the use of the term “amplitude variation” can be seen to remain natural, as*Γ*(*t*) will admit a Karhunen–Loève expansion, with all stochasticity being attributable to the random amplitudes in the expansion. - 4.
This can be defined as Bochner integral in the space of measurable bounded

*T*:*K*→*K*. - 5.
In finite dimensional (or more generally, locally compact metric) spaces. If \(\mathcal X\) is an infinite-dimensional Hilbert space, the vague topology is trivial. This is stated and proved as Lemma 5 on page 27 in the supplement.

## References

- 2.M. Agueh, G. Carlier, Barycenters in the Wasserstein space. Soc. Indus. Appl. Math.
**43**(2), 904–924 (2011)MathSciNetzbMATHGoogle Scholar - 3.M. Agueh, G. Carlier, Vers un théorème de la limite centrale dans l’espace de Wasserstein? C.R. Math.
**355**(7), 812–818 (2017)Google Scholar - 4.A. Ahidar-Coutrix, T. Le Gouic, Q. Paris, Convergence rates for empirical barycenters in metric spaces: curvature, convexity and extendable geodesics. Probab. Theory Relat. Fields (2019). https://link.springer.com/article/10.1007%2Fs00440-019-00950-0
- 5.M. Ajtai, J. Komlós, G. Tusnády, On optimal matchings. Combinatorica
**4**(4), 259–264 (1984)MathSciNetCrossRefGoogle Scholar - 9.P.C. Álvarez-Esteban, E. del Barrio, J.A. Cuesta-Albertos, C. Matrán, A fixed-point approach to barycenters in Wasserstein space. J. Math. Anal. Appl.
**441**(2), 744–762 (2016)MathSciNetCrossRefGoogle Scholar - 12.L. Ambrosio, N. Gigli, G. Savaré, Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics. ETH Zürich, 2nd edn. (Springer, Berlin, 2008)Google Scholar
- 14.E. Anderes, S. Borgwardt, J. Miller, Discrete Wasserstein barycenters: optimal transport for discrete data. Math. Meth. Oper. Res.
**84**(2), 1–21 (2016)MathSciNetCrossRefGoogle Scholar - 16.A. Bandeira, P. Rigollet, J. Weed, Optimal rates of estimation for multi-reference alignment (2017). arXiv:1702.08546Google Scholar
- 22.J. Bigot, T. Klein, Characterization of barycenters in the Wasserstein space by averaging optimal transport maps. ESAIM: Probab. Stat.
**22**, 35–57 (2018)MathSciNetCrossRefGoogle Scholar - 23.J. Bigot, R. Gouet, T. Klein, A. López, Upper and lower risk bounds for estimating the Wasserstein barycenter of random measures on the real line. Electron. J. Stat.
**12**(2), 2253–2289 (2018)MathSciNetCrossRefGoogle Scholar - 25.S. Bobkov, M. Ledoux, One-Dimensional Empirical Measures, Order Statistics and Kantorovich Transport Distances, vol. 261, no. 1259 (Memoirs of the American Mathematical Society, Providence, 2019). https://doi.org/10.1090/memo/1259
- 27.E. Boissard, T. Le Gouic, On the mean speed of convergence of empirical and occupation measures in Wasserstein distance. Ann. Inst. H. Poincaré. Probab. Stat.
**50**(2), 539–563 (2014)MathSciNetCrossRefGoogle Scholar - 34.A. Chakraborty, V.M. Panaretos, Functional registration and local variations: Identifiability, rank, and tuning (2017). arXiv:1702.03556Google Scholar
- 41.D.J. Daley, D. Vere-Jones, An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure (Springer, Berlin, 2007)zbMATHGoogle Scholar
- 43.E. del Barrio, E. Giné, C. Matrán, Central limit theorems for the Wasserstein distance between the empirical and the true distributions. Ann. Probab.
**27**(2), 1009–1071 (1999)MathSciNetCrossRefGoogle Scholar - 46.R.M. Dudley, The speed of mean Glivenko–Cantelli convergence. Ann. Math. Stat.
**40**(1), 40–50 (1969)MathSciNetCrossRefGoogle Scholar - 51.F. Ferraty, P. Vieu, Nonparametric Functional Data Analysis: Theory and Practice (Springer, Beriln, 2006)zbMATHGoogle Scholar
- 54.N. Fournier, A. Guillin, On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Rel. Fields
**162**(3–4), 707–738 (2015)MathSciNetCrossRefGoogle Scholar - 58.B. Galasso, Y. Zemel, M. de Carvalho, Bayesian semiparametric modelling of phase-varying point processes (2018). arXiv:1812.09607Google Scholar
- 61.T. Gasser, A. Kneip, Searching for structure in curve samples. J. Amer. Statist. Assoc.
**90**(432), 1179–1188 (1995)zbMATHGoogle Scholar - 63.D. Gervini, T. Gasser, Self-modelling warping functions. J. Roy. Stat. Soc.: Ser. B
**66**(4), 959–971 (2004)Google Scholar - 64.D. Gervini, T. Gasser, Nonparametric maximum likelihood estimation of the structural mean of a sample of curves. Biometrika
**92**(4), 801–820 (2005)MathSciNetCrossRefGoogle Scholar - 70.L. Horváth, P. Kokoszka, Inference for Functional Data with Applications, vol. 200 (Springer, Berlin, 2012)CrossRefGoogle Scholar
- 71.T. Hsing, R. Eubank, Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators (Wiley, Hoboken, 2015)CrossRefGoogle Scholar
- 72.G.M. James, Curve alignment by moments. Ann. Appl. Stat.
**1**(2), 480–501 (2007)MathSciNetCrossRefGoogle Scholar - 73.H.E. Jones, N. Bayley, The Berkeley growth study. Child Dev.
**12**(2), 167–173 (1941)Google Scholar - 75.O. Kallenberg, Random Measures, 3rd edn. (Academic, Cambridge, 1983)zbMATHGoogle Scholar
- 79.A. Karr, Point Processes and Their Statistical Inference, vol. 7 (CRC Press, Boca Raton, 1991)zbMATHGoogle Scholar
- 82.A. Kneip, J.O. Ramsay, Combining registration and fitting for functional models. J. Amer. Stat. Assoc.
**103**(483), 1155–1165 (2008)MathSciNetCrossRefGoogle Scholar - 84.A. Kroshnin, V. Spokoiny, A. Suvorikova, Statistical inference for Bures–Wasserstein barycenters (2019). arXiv:1901.00226Google Scholar
- 88.E.L. Lehmann, A general concept of unbiasedness. Ann. Math. Stat.
**22**(4), 587–592 (1951)MathSciNetCrossRefGoogle Scholar - 90.J.S. Marron, J.O. Ramsay, L.M. Sangalli, A. Srivastava, Functional data analysis of amplitude and phase variation. Stat. Sci.
**30**(4), 468–484 (2015)MathSciNetCrossRefGoogle Scholar - 91.V. Masarotto, V.M. Panaretos, Y. Zemel, Procrustes metrics on covariance operators and optimal transportation of Gaussian processes. Sankhyā A
**81**, 172–213 (2019) (Invited Paper, Special Issue on Statistics on non-Euclidean Spaces and Manifolds)Google Scholar - 92.D.M. Mason, A weighted approximation approach to the study of the empirical Wasserstein distance, in ed. by C. Houdré, D.M. Mason, P. Reynaud-Bouret, J. Rosiński, High Dimensional Probability VII (Birkhäuser, Basel, 2016), pp. 137–154Google Scholar
- 100.V.M. Panaretos, Y. Zemel, Amplitude and phase variation of point processes. Ann. Stat.
**44**(2), 771–812 (2016)MathSciNetCrossRefGoogle Scholar - 101.V.M. Panaretos, Y. Zemel, Statistical aspects of Wasserstein distances. Annu. Rev. Stat. Appl.
**6**, 405–431 (2019)MathSciNetCrossRefGoogle Scholar - 108.J. Ramsay, X. Li, Curve registration. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.)
**60**(2), 351–363 (1998)Google Scholar - 109.J.O. Ramsay, B.W. Silverman, Applied Functional Data Analysis: Methods and Case Studies, vol. 77 (Citeseer, 2002)Google Scholar
- 110.J.O. Ramsay, B.W. Silverman, Functional Data Analysis, 2nd edn. (Springer, Berlin, 2005)CrossRefGoogle Scholar
- 111.J.O. Ramsay, H. Wickham, S. Graves, G. Hooker, FDA: Functional Data Analysis, R package version 2.4.8 (2018)Google Scholar
- 115.B.B. Rønn, Nonparametric maximum likelihood estimation for shifted curves. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.)
**63**(2), 243–259 (2001)Google Scholar - 118.H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process.
**26**(1), 43–49 (1978)CrossRefGoogle Scholar - 122.R. Tang, H.-G. Müller, Pairwise curve synchronization for functional data. Biometrika
**95**(4), 875–889 (2008)MathSciNetCrossRefGoogle Scholar - 123.J.D. Tucker, W. Wu, A. Srivastava, Generative models for functional data using phase and amplitude separation. Comput. Stat. Data Anal.
**61**, 50–66 (2013)MathSciNetCrossRefGoogle Scholar - 127.J.-L. Wang, J.-M. Chiou, H.-G. Müller, Functional data analysis. Annu. Rev. Stat. Appl.
**3**, 257–295 (2016)CrossRefGoogle Scholar - 128.K. Wang, T. Gasser, Alignment of curves by dynamic time warping. Ann. Stat.
**25**(3), 1251–1276 (1997)MathSciNetCrossRefGoogle Scholar - 129.K. Wang, T. Gasser, Synchronizing sample curves nonparametrically. Ann. Stat.
**27**, 439–460 (1999)MathSciNetCrossRefGoogle Scholar - 130.J. Weed, F. Bach, Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli
**25**(4A), 2620–2648 (2019). https://projecteuclid.org/euclid.bj/1568362038 MathSciNetCrossRefGoogle Scholar - 131.J. Wrobel, V. Zipunnikov, J. Schrack, J. Goldsmith, Registration for exponential family functional data. Biometrics
**75**, 48–57 (2019)MathSciNetCrossRefGoogle Scholar - 132.S. Wu, H.-G. Müller, Z. Zhang, Functional data analysis for point processes with rare events. Stat. Sinica
**23**(1), 1–23 (2013)MathSciNetzbMATHGoogle Scholar - 133.W. Wu, A. Srivastava, Analysis of spike train data: Alignment and comparisons using the extended Fisher–Rao metric. Electron. J. Stat.
**8**, 1776–1785 (2014)MathSciNetCrossRefGoogle Scholar - 134.Y. Zemel, V.M. Panaretos, Fréchet means and Procrustes analysis in Wasserstein space. Bernoulli
**25**(2), 932–976 (2019)MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.