Why is it relevant to construct the Fréchet mean of a collection of measures with respect to the Wasserstein metric? A simple answer is that this kind of average will often express a more natural notion of “typical” realisation of a random probability distribution than an arithmetic average.Footnote 1 Much more can be said, however, in that the Wasserstein–Fréchet mean and the closely related notion of an optimal multicoupling arise canonically as the appropriate framework for the formulation and solution to the problem of separation of amplitude and phase variation of a point process. It would almost seem that Wasserstein–Fréchet means were “made” for precisely this problem.

When analysing the (co)variation of a real-valued stochastic process {Y (x) : x ∈ K} over a convex compact domain K, it can be broadly said that one may distinguish two layers of variation:

  • Amplitude variation. This is the “classical” variation that one would also encounter in multivariate analysis, and refers to the stochastic fluctuations around a mean level, usually encoded in the covariance kernel, at least up to second order.

    In short, this is variation “in the y-axis” (ordinate).

  • Phase variation. This is a second layer of non-linear variation peculiar to continuous domain stochastic processes, and is rarely—if ever—encountered in multivariate analysis. It arises as the result of random changes (or deformations) in the time scale (or the spatial domain) of definition of the process. It can be conceptualised as a composition of the stochastic process with a random transformation (warp map) acting on its domain.

    This is variation “in the x-axis” (abscissa).

The terminology on amplitude/phase variation is adapted from random trigonometric functions, which may vary in amplitude (oscillations in the range of the function) or phase (oscillations in the domain of the function). Failing to properly account for the superposition of these two forms of variation may entirely distort the findings of a statistical analysis of the random function (see Sect. 4.1.1). Consequently, it is an important problem to be able to separate the two, thus correctly accounting for the distinct contribution of each. The problem of separation is also known as that of registration (Ramsay and Li [108]), synchronisation (Wang and Gasser [129]), or multireference alignment (Bandeira et al. [16]), though in some cases these terms refer to a simpler problem where there is no amplitude variation at all.

Phase variation naturally arises in the study of random phenomena where there is no absolute notion of time or space, but every realisation of the phenomenon evolves according to a time scale that is intrinsic to the phenomenon itself, and (unfortunately) unobservable. Processes related to physiological measurements, such as growth curves and neuronal signals, are usual suspects. Growth curves can be modelled as continuous random functions (functional data), whereas neuronal signals are better modelled as discrete random measures (point processes). We first describe amplitude/phase variation in the formerFootnote 2 case, as that is easier to appreciate, before moving on to the latter case, which is the main subject of this chapter.

4.1 Amplitude and Phase Variation

4.1.1 The Functional Case

Let K denote the unit cube \([0,1]^d\subset \mathbb {R}^d\). A real random function Y = (Y (x) : x ∈ K) can, broadly speaking, have two types of variation. The first, amplitude variation, results from Y (x) being a random variable for every x and describes its fluctuations around the mean level \(m(x)=\mathbb {E} Y(x)\), usually encoded by the variance varY (x). For this reason, it can be referred to as “variation in the y-axis”. More generally, for any finite set x 1, …, x n, the n × n covariance matrix with entries κ(x i, x j) = cov[Y (x i), Y (x j)] encapsulates (up to second order) the stochastic deviations of the random vector (Y (x 1), …, Y (x n)) from its mean, in analogy with the multivariate case. Heuristically, one then views amplitude variation as the collection κ(x, y) for x, y ∈ K in a sense we discuss next.

One typically views Y as a random element in the separable Hilbert space L 2(K), assumed to have \(\mathbb {E}\|Y\|{ }^2<\infty \) and continuous sample paths, so that in particular Y (x) is a random variable for all x ∈ K. Then the mean function

$$\displaystyle \begin{aligned} m(x) =\mathbb{E} Y(x), \quad x\in K \end{aligned}$$

and the covariance kernel

$$\displaystyle \begin{aligned} \kappa(x,y)={\mathrm{cov}}[Y(x),Y(y)], \quad x,y\in K \end{aligned}$$

are well-defined and finite; we shall assume that they are continuous, which is equivalent to Y being mean-square continuous:

$$\displaystyle \begin{aligned} \mathbb{E} [Y(y) - Y(x)]^2 \to0, \qquad y\to x. \end{aligned}$$

The covariance kernel κ gives rise to the covariance operator \(\mathcal R:L_2(K)\to L_2(K)\), defined by

$$\displaystyle \begin{aligned} (\mathcal R f)(y) ={{\int_{K} \! {\kappa(x,y)f(x)} \, \mathrm{d}x}} , \end{aligned}$$

a self-adjoint positive semidefinite Hilbert–Schmidt operator on L 2(K). The justification to this terminology is the observation that when m = 0, for all bounded f, g ∈ L 2(K),

$$\displaystyle \begin{aligned} \mathbb{E}{\left\langle {Y},{f}\right\rangle} {\left\langle {Y},{g}\right\rangle} =\mathbb{E}\left[{\int_{K^2} \! Y(x)f(x)Y(y)g(y) \, \mathrm{d}(x,y)}\right] ={{\int_{K} \! {g(y)(\mathcal Rf)(y)} \, \mathrm{d}y}} , \end{aligned}$$

and so, without the restriction to m = 0,

$$\displaystyle \begin{aligned} {\mathrm{cov}}\left[{\left\langle {Y},{f}\right\rangle} , {\left\langle {Y},{g}\right\rangle} \right] ={{\int_{K} \! {g(y)(\mathcal Rf)(y)} \, \mathrm{d}y}} ={\left\langle {g},{\mathcal Rf}\right\rangle} . \end{aligned}$$

The covariance operator admits an eigendecomposition \((r_k,\phi _k)_{k=1}^\infty \) such that , \(\mathcal R\phi _k=r_k\phi _k\) and (ϕ k) is an orthonormal basis of L 2(K). One then has the celebrated Karhunen–Loève expansion

$$\displaystyle \begin{aligned} Y(x) =m(x) +\sum_{k=1}^\infty \left\langle {Y-m},{\phi_k}\right\rangle\phi_k(x) =m(x) + \sum_{k=1}^\infty \xi_k\phi_k(x). \end{aligned}$$

A major feature in this expansion is the separation of the functional part from the stochastic part: the functions ϕ k(x) are deterministic; the random variables ξ k are scalars. This separation actually holds for any orthonormal basis; the role of choosing the eigenbasis of \(\mathcal R\) is making ξ k uncorrelated:

$$\displaystyle \begin{aligned} {\mathrm{cov}}(\xi_k,\xi_l) ={\mathrm{cov}}\left[{\left\langle {Y},{\phi_k}\right\rangle} , {\left\langle {Y},{\phi_l}\right\rangle} \right] =\left\langle {\phi_l},{\mathcal R\phi_k}\right\rangle \end{aligned}$$

vanishes when kl and equals r k otherwise. For this reason, it is not surprising that using as ϕ k the eigenfunctions yields the optimal representation of Y . Here, optimality is with respect to truncations: for any other basis (ψ k) and any M,

$$\displaystyle \begin{aligned} \mathbb{E} \left\| Y -m -\sum_{k=1}^M \left\langle {Y-m},{\psi_k}\right\rangle\psi_k \right\|{}^2 \ge \mathbb{E} \left\| Y -m -\sum_{k=1}^M \left\langle {Y-m},{\phi_k}\right\rangle\phi_k \right\|{}^2 \end{aligned}$$

so that (ϕ k) provides the best finite-dimensional approximation to Y . The approximation error on the right-hand side equals

$$\displaystyle \begin{aligned} \mathbb{E} \left\| \sum_{k=M+1}^\infty \xi_k\phi_k \right\|{}^2 =\sum_{k=M+1}^\infty r_k \end{aligned}$$

and depends on how quickly the eigenvalues of \(\mathcal R\) decay.

One carries out inference for m and κ on the basis of a sample Y 1, …, Y n by

$$\displaystyle \begin{aligned} \widehat m(x) =\frac 1n \sum_{i=1}^n Y_i(x) ,\qquad x\in K \end{aligned}$$

and

$$\displaystyle \begin{aligned} \widehat \kappa(x,y)= \frac 1n \sum_{i=1}^n Y_i(x)Y_i(y) - \widehat m(x) \widehat m(y), \end{aligned}$$

from which one proceeds to estimate \(\mathcal R\) and its eigendecomposition.

We have seen that amplitude variation in the sense described above is linear and dealt with using linear operations. There is another, qualitatively different type of variation, phase variation, that is non-linear and does not have an obvious finite-dimensional analogue. It arises when in addition to the randomness in the values Y (x) itself, an extra layer of stochasticity is present in its domain of definition. In mathematical terms, there is a random invertible warp function (sometimes called deformation or warping) T : K → K and instead of Y (x), one observes realisations from

$$\displaystyle \begin{aligned} \tilde Y(x) = Y(T^{-1}(x)), \qquad x\in K.\end{aligned} $$

For this reason, phase variation can be viewed as “variation in the x-axis”. When d = 1, the set K is usually interpreted as a time interval, and then the model stipulates that each individual has its own time scale. Typically, the warp function is assumed to be a homeomorphism of K independent of Y and often some additional smoothness is imposed, say T ∈ C 2. One of the classical examples is growth curves of children, of which a dataset from the Berkeley growth study (Jones and Bayley [73]) is shown in Fig. 4.1. The curves are the derivatives of the height of a sample of ten girls as a function of time, from birth until age 18. One clearly notices the presence of the two types of variation in the figure. The initial velocity for all children is the highest immediately or shortly after birth, and in most cases decreases sharply during the first 2 years. Then follows a period of acceleration for another year or so, and so on. Despite presenting qualitatively similar behaviour, the curves differ substantially not only in the magnitude of the peaks but also in their location. For instance, one red curve has a local minimum at the age of three, while a green one has a local maximum at almost that same time point. It is apparent that if one tries to estimate the mean function by averaging the curves at each time x, the shape of the resulting estimate would look very different from each of the curves. Thus, this pointwise averaging (known as the cross-sectional mean) fails to represent the typical behaviour. This phenomenon is seen more explicitly in the next example.

Fig. 4.1
figure 1

Derivatives of growth curves of ten girls from the Berkeley dataset. The data and the code for the figure are from the R package fda (Ramsay et al. [111])

The terminology of amplitude and phase comes from trigonometric functions, from which we derive an artificial example that illustrates the difficulties of estimation in the presence of phase variation. Let A and B be symmetric random variables and consider the random function

$$\displaystyle \begin{aligned} \tilde Y(x) =A\sin[8\pi(x + B)]. \end{aligned} $$
(4.1)

(Strictly speaking, xx + B is not from [0, 1] to itself; for illustration purposes, we assume in this example that \(K=\mathbb {R}\).) The random variable A generates the amplitude variation, while B represents the phase variation. In Fig. 4.2, we plot four realisations and the resulting empirical means for the two extreme scenarios where B = 0 (no phase variation) or A = 1 (no amplitude variation). In the left panel of the figure, we see that the sample mean (in thick blue) lies between the observations and has a similar form, so can be viewed as the curve representing the typical realisation of the random curve. This is in contrast to the right panel, where the mean is qualitatively different from all curves in the sample: though periodicity is still present, the peaks and troughs have been flattened, and the sample mean is much more diffuse than any of the observations.

Fig. 4.2
figure 2

Four realisations of (4.1) with means in thick blue. Left: amplitude variation (B = 0); right: phase variation (A = 1)

The phenomenon illustrated in Fig. 4.2 is hardly surprising, since as mentioned earlier amplitude variation is linear while phase variation is not, and taking sample means is a linear operation. Let us see in formulae how this phenomenon occurs. When A = 1 we have

$$\displaystyle \begin{aligned} \mathbb{E} \tilde Y(x) =\sin (8\pi x) \mathbb{E} [\cos (8\pi B)] +\cos (8\pi x) \mathbb{E} [\sin (8\pi B)]. \end{aligned}$$

Since B is symmetric the second term vanishes, and unless B is trivial the expectation of the cosine is smaller than one in absolute value. Consequently, the expectation of \(\tilde Y(x)\) is the original function \(\sin 8\pi x\) multiplied by a constant of magnitude strictly less than one, resulting in peaks of smaller magnitude.

In the general case, where \(\tilde Y(x)=Y(T^{-1}(x))\) and Y and T are independent, we have

$$\displaystyle \begin{aligned} \mathbb{E}\tilde Y(x) =\mathbb{E} [m(T^{-1}(x))] \end{aligned}$$

and

$$\displaystyle \begin{aligned} {\mathrm{cov}}[\tilde Y(x),\tilde Y(y)] =\mathbb{E} [\kappa(T^{-1}(x),T^{-1}(y))] + {\mathrm{cov}}[m(T^{-1}(x),m(T^{-1}(y))]. \end{aligned}$$

From this, several conclusions can be drawn. Let \(\tilde \mu =\mu (T^{-1}(x))\) be the conditional mean function given T. Then the value of the mean function itself, \(\mathbb {E} \tilde \mu \), at x 0 is determined not by a single point, say x, but rather by all the values of m at the possible outcomes of T −1(x). In particular, if x 0 was a local maximum for m, then \(\mathbb {E} [\tilde \mu (x_0)]\) will typically be strictly smaller than m(x 0); the phase variation results in smearing m.

At this point an important remark should be made. Whether or not phase variation is problematic depends on the specific application. If one is interested indeed in the mean and covariance functions of \(\tilde Y\), then the standard empirical estimators will be consistent, since \(\tilde Y\) itself is a random function. But if it is rather m, the mean of Y , that is of interest, then the confounding of the amplitude and phase variation will lead to inconsistency. This can also be seen from the formula

$$\displaystyle \begin{aligned} \tilde Y(x) =m(T^{-1}(x)) + \sum_{k=1}^\infty \xi_k \phi_k(T^{-1}(x)). \end{aligned}$$

The above series is not the Karhunen–Loève expansion of \(\tilde Y\); the simplest way to notice this is the observation that ϕ k(T −1(x)) includes both the functional component ϕ k and the random component T −1(x). The true Karhunen–Loève expansion of \(\tilde Y\) will in general be qualitatively very different from that of Y , not only in terms of the mean function but also in terms of the covariance operator and, consequently, its eigenfunctions and eigenvalues. As illustrated in the trigonometric example, the typical situation is that the mean \(\mathbb {E} \tilde Y\) is more diffuse than m, and the decay of the eigenvalues \(\tilde r_k\) of the covariance operator is slower than that of r k; as a result, one needs to truncate the sum at high threshold in order to capture a substantial enough part of the variability. In the toy example (4.1), the Karhunen–Loève expansion has a single term besides the mean if B = 0, while having two terms if A = 1.

When one is indeed interested in the mean m and the covariance κ, the random function T pertaining to the phase variation is a nuisance parameter. Given a sample \(\tilde Y_i=Y_i\circ T_i^{-1}\), i = 1, …, n, there is no point in taking pointwise means of \(\tilde Y_i\), because the curves are misaligned; \(\tilde Y_1(x)=Y_1(T_1^{-1}(x))\) should not be compared with \(\tilde Y_2(x)\), but rather with \(Y_2(T_1^{-1}(x))=\tilde Y_2(T_1^{-1}(T_2(x))\). To overcome this difficulty, one seeks estimators \(\widehat {T_i}\) such that

$$\displaystyle \begin{aligned} \widehat Y_i(x) =\tilde Y_i(\widehat{T_i}(x)) =Y_i(T_i^{-1}(\widehat{T_i}(x))) \end{aligned}$$

is approximately Y i(x). In other words, one tries to align the curves in the sample to have a common time scale. Such a procedure is called curve registration. Once registration has been carried out, one proceeds the analysis on \(\widehat {Y_i}(x)\) assuming only amplitude variation is now present: estimate the mean m by

$$\displaystyle \begin{aligned} \widehat m(x) =\frac 1n \sum_{i=1}^n \widehat{Y_i}(x) \end{aligned}$$

and the covariance κ by its analogous counterpart. Put differently, registering the curves amounts to separating the two types of variation. This step is crucial regardless of whether the warp functions are considered as nuisance or an analysis of the warp functions is of interest in the particular application.

There is an obvious identifiability problem in the model \(\tilde Y=Y\circ T^{-1}\). If S is any (deterministic) invertible function, then the model with (Y, T) is statistically indistinguishable from the model with (Y ∘ S, T ∘ S). It is therefore often assumed that \(\mathbb {E} T=\mathbf i\) is the identity and in addition, in nearly all application, that T is monotonically increasing (if d = 1).

Discretely observed data. One cannot measure the height of person at every single instant of her life. In other words, it is rare in practice that one has access to the entire curve. A far more common situation is that one observes the curves discretely, i.e., at a finite number of points. The conceptually simplest setting is that one has access to a grid x 1, …, x J ∈ K, and the data come in the form

$$\displaystyle \begin{aligned} \tilde y_{ij} =\tilde Y_i(t_j), \end{aligned}$$

possibly with measurement error. The problem is to find, given \(\tilde y_{ij}\), consistent estimators of T i and of the original, aligned functions Y i.

In the bibliographic notes, we review some methods for carrying out this separation of amplitude and phase variation. It is fair to say that no single registration method arises as the canonical solution to the functional registration problem. Indeed, most need to make additional structural and/or smoothness assumptions on the warp maps, further to the basic identifiability conditions requiring that T be increasing and that \(\mathbb {E} T\) equal the identity. We will eventually see that, in contrast, the case of point processes (viewed as discretely observed random measures) admits a canonical framework, without needing additional assumptions.

4.1.2 The Point Process Case

A point process is the mathematical object that represents the intuitive notion of a random collection of points in a space \(\mathcal X\). It is formally defined as a measurable map Π from a generic probability space into the space of (possibly infinite) Borel integer-valued measures of \(\mathcal X\) in such a way that Π(B) is a measurable real-valued random variable for all Borel subsets B of \(\mathcal X\). The quantity Π(B) represents the random number of points observed in the set B. Among the plethora of books on point processes, let us mention Daley and Vere-Jones [41] and Karr [79]. Kallenberg [75] treats more general objects, random measures, of which point processes are a peculiar special case. We will assume for convenience that Π is a measure on a compact subset \(K\subset \mathbb {R}^d\).

Amplitude variation of Π can be understood in analogy with the functional case. One defines the mean measure

$$\displaystyle \begin{aligned} \lambda(A) =\mathbb{E} [\varPi(A)], \qquad A\subset K \mathrm{ Borel} \end{aligned}$$

and, provided that \(\mathbb {E} [\varPi (K)]^2<\infty \), the covariance measure

$$\displaystyle \begin{aligned} \kappa(A,B) ={\mathrm{cov}} [\varPi(A),\varPi(B)] =\mathbb{E} [\varPi(A)\varPi(B)] - \lambda(A) \lambda(B), \end{aligned}$$

the latter being a finite signed Borel measure on K. Just like in the functional case, these two objects encapsulate the (second-order) amplitude variationFootnote 3 properties of the law of Π.

Given a sample Π 1, …, Π n of independent point processes distributed as Π, the natural estimators

$$\displaystyle \begin{aligned} \widehat\lambda(A) =\frac 1n \sum_{i=1}^n \varPi_i(A); \qquad \widehat \kappa(A,B) =\frac 1n\sum_{i=1}^n \varPi_i(A)\varPi_i(B) - \widehat\lambda(A) \widehat\lambda(B), \end{aligned}$$

are consistent and the former asymptotically normal [79, Proposition 4.8].

Phase variation then pertains to a random warp function T : K → K (independent of Π) that deforms Π: if we denote the points of Π by x 1, …, x K (with K random), then instead of (x i), one observes T(x 1), …, T(x K). In symbols, this means that the data arise as \(\widetilde \varPi = T\#\varPi \). We refer to Π as the original point processes, and \(\widetilde \varPi \) as the warped point processes. An example of 30 warped and unwarped point processes is shown in Fig. 4.3. The point patterns in both panels present a qualitatively similar structure: there are two peaks of high concentration of points, while few points appear between these peaks. The difference between the two panels is in the position and concentration of those peaks. In the left panel, only amplitude variation is present, and the location/concentration of the peaks is the same across all observations. In contrast, phase variation results in shifting the peaks to different places for each of the observations, while also smearing or sharpening them. Clearly, estimation of the mean measure of a subset A by averaging the number of observed points in A would not be satisfactory as an estimator of λ when carried out with the warped data. As in the functional case, it will only be consistent for the measure \(\tilde \lambda \) defined by

$$\displaystyle \begin{aligned} \tilde\lambda(A) =\mathbb{E} [\lambda(T^{-1}(A))] ,\qquad A\subseteq \mathcal X, \end{aligned}$$

and \(\tilde \lambda =\mathbb {E} [T\#\lambda ]\) misses most (or at least a significant part) of the bimodal structure of λ and is far more diffuse.

Fig. 4.3
figure 3

Unwarped (left) and warped Poisson point processes

Since Π and T are independent, the conditional expectation of \(\widetilde \varPi \) given T is

$$\displaystyle \begin{aligned} \mathbb{E} [\widetilde\varPi(A)|T] =\mathbb{E} [\varPi(T^{-1}(A))|T] =\lambda(T^{-1}(A)) =[T\#\lambda](A). \end{aligned}$$

Consequently, we refer to Λ = T#λ as the conditional mean measure. The problem of separation of amplitude and phase variation can now be stated as follows. On the basis of a sample \(\widetilde \varPi _1,\dots ,\widetilde \varPi _n\), find estimators of (T i) and (Π i). Registering the point processes amounts to constructing estimators, registration maps \(\widehat {T_i^{-1}}\), such that the aligned points

$$\displaystyle \begin{aligned} \widehat{\varPi_i} =\widehat{T_i^{-1}}\# \widetilde \varPi_i =[\widehat{T_i^{-1}}\circ T_i]\# \varPi_i \end{aligned}$$

are close to the original points Π i.

Remark 4.1.1 (Poisson Processes)

A special but important case is that of a Poisson process. Gaussian processes probably yield the most elegant and rich theory in functional data analysis, and so do Poisson processes when it comes to point processes. We say that Π is a Poisson process when the following two conditions hold. (1) For any disjoint collection (A 1, …, A n) of Borel sets, the random variables Π(A 1), …, Π(A n) are independent; and (2) for every Borel A  K, Π(A) follows a Poisson distribution with mean λ(A):

$$\displaystyle \begin{aligned} {\mathbb{P}}(\varPi(A)=k) =e^{-\lambda(A)}\frac{[\lambda(A)]^k}{k!}.\end{aligned} $$

Conditional upon T, the random variables \(\widetilde \varPi (A_k)=\varPi (T^{-1}(A_k))\), k = 1, …, n are independent as the sets (T −1(A k)) are disjoint, and \(\widetilde \varPi (A)\) follows a Poisson distribution with mean λ(T −1(A)) = Λ(A). This is precisely the definition of a Cox process : conditional upon the driving measure Λ, \(\widetilde \varPi \) is a Poisson process with mean measure λ. For this reason, it is also called a doubly stochastic process ; in our context, the phase variation is associated with the stochasticity of Λ while the amplitude one is associated with the Poisson variation conditional upon Λ.

As in the functional case there are problems with identifiability: the model (Π, T) cannot be distinguished from the model (S#Π, T ∘ S −1) for any invertible S : K → K. It is thus natural to assume that \(\mathbb {E} T\) is the identity mapFootnote 4 (otherwise set \(S=\mathbb {E} T\), i.e., replace Π by \([\mathbb {E} T]\#\varPi \) and T by \(T\circ [\mathbb {E} T]^{-1}\)).

Constraining T to have mean identity is nevertheless not sufficient for the model \(\widetilde \varPi =T\#\varPi \) to be identifiable. The reason is that given the two point sets \(\widetilde \varPi \) and Π, there are many functions that push forward the latter to the former. This ambiguity can be dealt with by assuming some sort of regularity or parsimony for T. For example, when K = [a, b] is a subset of the real line, imposing T to be monotonically increasing guarantees its uniqueness. In multiple dimensions, there is no obvious analogue for increasing functions. One possible definition is the monotonicity described in Sect. 1.7.2:

$$\displaystyle \begin{aligned} \left\langle {T(y) - T(x)},{y-x}\right\rangle \ge 0 ,\qquad x,y\in K. \end{aligned}$$

This property is rather weak in a sense we describe now. Let \(K\subseteq \mathbb {R}^2\) and write y ≥ x if and only if y i ≥ x i for i = 1, 2. It is natural to expect the deformations to maintain the lexicographic order in \(\mathbb {R}^2\):

$$\displaystyle \begin{aligned} y\ge x \quad \Longrightarrow \quad T(y) \ge T(x). \end{aligned}$$

If we require in addition that the ordering must be preserved for all quadrants: for z = T(x) and w = T(y)

$$\displaystyle \begin{aligned} \{y_1\ge x_1,y_2\le x_2\} \qquad \Longrightarrow \{w_1\ge z_1,w_2\le z_2\}, \end{aligned}$$

then monotonicity is automatically satisfied. In that sense, it is arguably not very restrictive.

Monotonicity is weaker than cyclical monotonicity (see (1.10) with y i = T(x i)), which is itself equivalent to the property of being the subgradient of a convex function. But if extra smoothness is present and T is a gradient of some function \(\phi :K\to \mathbb {R}\), then ϕ must be convex and T is then cyclically monotone. Consequently, we will make the following assumptions:

  • the expected value of T is the identity;

  • T is a gradient of a convex function.

In the functional case, at least on the real line, these two conditions are imposed on the warp functions in virtually all applications, often accompanied with additional assumptions about smoothness of T, its structural properties, or its distance from the identity. In the next section, we show how these two conditions alone lead to the Wasserstein geometry and open the door to consistent, fully nonparametric separation of the amplitude and phase variation.

4.2 Wasserstein Geometry and Phase Variation

4.2.1 Equivariance Properties of the Wasserstein Distance

A first hint to the relevance of Wasserstein metrics in \(\mathcal W_p(\mathcal X)\) for deformations of the space \(\mathcal X\) is that for all p ≥ 1 and all \(x,y\in \mathcal X\),

$$\displaystyle \begin{aligned} W_p(\delta_x,\delta_y) =\|x-y\|, \end{aligned}$$

where δ x is as usual the Dirac measure at \(x\in \mathcal X\). This is in contrast to metrics such as the bounded Lipschitz distance (that metrises weak convergence) or the total variation distance on \(P(\mathcal X)\). Recall that these are defined by

$$\displaystyle \begin{aligned} \|\mu - \nu\|{}_{\mathrm{BL}} =\sup_{\|\varphi\|{}_{\mathrm{BL}}\le 1} \left| {{\int_{\mathcal X} \! \varphi \, \mathrm{d}\mu}} - {{\int_{\mathcal X} \! \varphi \, \mathrm{d}\nu}} \right|; \qquad \|\mu-\nu\|{}_{\mathrm{TV}} =\sup_A |\mu(A) - \nu(A)|, \end{aligned}$$

so that

$$\displaystyle \begin{aligned} \|\delta_x - \delta_y\|{}_{\mathrm{BL}} =\min(1,\|x-y\|) ;\qquad \|\delta_x - \delta_y\|{}_{\mathrm{TV}} =\begin{cases} 1 & x\ne y\\ 0 & x=y. \end{cases} \end{aligned}$$

In words, the total variation metric “does not see the geometry” of the space \(\mathcal X\). This is less so for the bounded Lipschitz distance that does take small distances into account but not large ones.

Another property (shared by BL and TV) is equivariance with respect to translations. It is more convenient to state it using the probabilistic formalism of Sect. 1.2. Let X ∼ μ and Y ∼ ν be random elements in \(\mathcal X\), a be a fixed point in \(\mathcal X\), X′ = X + a and Y = Y + a. Joint couplings Z′ = (X′, Y) are precisely those that take the form (a, a) + Z for a joint coupling Z = (X, Y ). Thus

$$\displaystyle \begin{aligned} W_p(\mu \ast \delta_a,\nu \ast \delta_a) =W_p(X'+a,Y'+a) =W_p(X,Y) =W_p(\mu,\nu), \end{aligned}$$

where δ a is a Dirac measure at a and ∗ denotes convolution.

This carries over to Fréchet means in an obvious way.

Lemma 4.2.1 (Fréchet Means and Translations)

Let Λ be a random measure in \(\mathcal W_2(\mathcal X)\) with finite Fréchet functional and \(a\in \mathcal X\). Then γ is a Fréchet mean of Λ if and only if γ  δ a is a Fréchet mean of Λ  δ a.

The result holds for other values of p, in the formulation sketched in the bibliographic notes of Chap. 2. In the quadratic case, one has a simple extension to the case where only one measure is translated. Denote the first moment (mean) of \(\mu \in \mathcal W_1(\mathcal X)\) by

$$\displaystyle \begin{aligned} m:\mathcal W_1(\mathcal X)\to\mathcal X \qquad m(\mu) ={{\int_{\mathcal X} \! x \, \mathrm{d}{\mu(x)}}} . \end{aligned}$$

(When \(\mathcal X\) is infinite-dimensional, this can be defined as the unique element \(m\in \mathcal X\) satisfying

$$\displaystyle \begin{aligned} {\left\langle {m},{y}\right\rangle} ={{\int_{\mathcal X} \! {{\left\langle {x},{y}\right\rangle} } \, \mathrm{d}{\mu(x)}}} , \qquad y\in \mathcal X.) \end{aligned}$$

By an equivalence of couplings similar to above, we obtain

$$\displaystyle \begin{aligned} W_2^2(\mu\ast\delta_a,\nu) =W_2^2(\mu,\nu) +(a - [m(\mu) - m(\nu)])^2 -[m(\mu) - m(\nu)]^2, \end{aligned}$$

which is minimised at a = m(μ) − m(ν). This leads to the following conclusion:

Proposition 4.2.2 (First Moment of Fréchet Mean)

Let Λ be a random measure in \(\mathcal W_2(\mathcal X)\) with finite Fréchet functional and Fréchet mean γ. Then

$$\displaystyle \begin{aligned} {{\int_{\mathcal X} \! x \, \mathrm{d}{\gamma(x)}}} =\mathbb{E} \left[ {{\int_{\mathcal X} \! x \, \mathrm{d}{\varLambda(x)}}} \right]. \end{aligned}$$

4.2.2 Canonicity of Wasserstein Distance in Measuring Phase Variation

The purpose of this subsection is to show that the standard functional data analysis assumptions on the warp function T, having mean identity and being increasing, are equivalent to purely geometric conditions on T and the conditional mean measure Λ = T#λ. Put differently, if one is willing to assume that \(\mathbb {E} T=\mathbf i\) and that T is increasing, then one is led unequivocally to the problem of estimation of Fréchet means in the Wasserstein space \(\mathcal W_2(\mathcal X)\). When \(\mathcal X\ne \mathbb {R}\), “increasing” is interpreted as being the gradient of a convex function, as explained at the end of Sect. 4.1.2.

The total mass \(\lambda (\mathcal X)\) is invariant under the push-forward operation, and when it is finite, we may assume without loss of generality that it is equal to one, because all the relevant quantities scale with the total mass. Indeed, if λ = τμ with μ probability measure and τ > 0, then T#λ = τ × T#μ, and the Wasserstein distance (defined as the infimum-over-coupling integrated cost) between τμ and τν is τW p(μ, ν) for μ, ν probabilities.

We begin with the one-dimensional case, where the explicit formulae allow for a more transparent argument, and for simplicity we will assume some regularity.

Assumptions 2

The domain \(K\subset \mathbb {R}\) is a nonempty compact convex set (an interval), and the continuous and injective random map \(T:K\to \mathbb {R}\) (a random element in C b(K)) satisfies the following two conditions:

  1. (A1)

    Unbiasedness: \(\mathbb {E}[T(x)]=x\) for all x  K.

  2. (A2)

    Regularity: T is monotone increasing.

The relevance of the Wasserstein geometry to phase variation becomes clear in the following proposition that shows that Assumptions 2 are equivalent to geometric assumptions on the Wasserstein space \(\mathcal W_2(\mathbb {R})\).

Proposition 4.2.3 (Mean Identity Warp Functions and Fréchet Means in \(\mathcal W_2(\mathbb {R})\))

Let \(\phi \subset K\subset \mathbb {R}\) compact and convex and \(T:K\to \mathbb {R}\) continuous. Then Assumptions 2 hold if and only if, for any \(\lambda \in \mathcal W_2(K)\) supported on K such that \(\mathbb {E} [W_2^2(T\#\lambda ,\lambda )]<\infty \) , the following two conditions are satisfied:

  1. (B1)

    Unbiasedness: for any \(\theta \in \mathcal W_2(\mathbb {R})\)

    $$\displaystyle \begin{aligned} \mathbb{E} [W_2^2(T\#\lambda,\lambda)] \le \mathbb{E} [W_2^2(T\#\lambda,\theta)]. \end{aligned}$$
  2. (B2)

    Regularity: if \(Q:K\to \mathbb {R}\) is such that T#λ = Q#λ, then with probability one

    $$\displaystyle \begin{aligned} {{\int_{K} \! {\Big|T(x)-x\Big|{}^2} \, \mathrm{d}{\lambda(x)}}} \le {{\int_{K} \! {\Big|Q(x)-x\Big|{}^2} \, \mathrm{d}{\lambda(x)}}} ,\qquad \mathit{\mbox{almost surely}}. \end{aligned}$$

These assumptions have a clear interpretation: (B1) stipulates that λ is a Fréchet mean of the random measure Λ = T#λ, while (B2) states that T must be the optimal map from λ to Λ, that is, \(T={\mathbf t_{\lambda }^{\varLambda }} \).

Proof

If T satisfies (B2) then, as an optimal map, it must be nondecreasing λ-almost surely. Since λ is arbitrary, T must be nondecreasing on the entire domain K. Conversely, if T is nondecreasing, then it is optimal for any λ. Hence (A2) and (B2) are equivalent.

Assuming (A2), we now show that (A1) and (B1) are equivalent. Condition (B1) is equivalent to the assertion that for all \(\theta \in \mathcal W_2(\mathbb {R})\),

$$\displaystyle \begin{aligned} \mathbb{E} \|F_{T\#\lambda}^{-1} - F_\lambda^{-1} \|{}^2_{L_2(0,1)} =\mathbb{E} [W_2^2(T\#\lambda,\lambda)] \le \mathbb{E} [W_2^2(T\#\lambda,\theta)] =\mathbb{E} \|F_{T\#\lambda}^{-1} - F_\theta^{-1} \|{}^2_{L_2(0,1)}, \end{aligned}$$

which is in turn equivalent to \(\mathbb {E} [F_{T\#\lambda }]^{-1}]=\mathbb {E} [F_\varLambda ^{-1}]=F_\lambda ^{-1}\) (see Sect. 3.1.4). Condition (A2) and the assumptions on T imply that F Λ(x) = F λ(T −1(x)). Suppose that F λ is invertible (i.e., continuous and strictly increasing on K). Then \(F_\varLambda ^{-1}(u)=T(F_\lambda ^{-1}(u))\). Thus (B1) is equivalent to \(\mathbb {E} T(x)=x\) for all x in the range of \(F_\lambda ^{-1}\), which is K. The assertion that (A1) implies (B1), even if F λ is not invertible, is proven in the next theorem (Theorem 4.2.4) in a more general context.

The situation in more than one dimension is similar but the proof is less transparent. To avoid compactness assumptions, we introduce the following power growth condition (taken from Agueh and Carlier [2]) of continuous functions that grow like ∥⋅∥q (q ≥ 0):

$$\displaystyle \begin{aligned} G_q(\mathcal X)= (1+\|\cdot\|{}^q)C_b(\mathcal X) =\left\{ f:\mathcal X\to\mathbb R\mathrm{ continuous}:\sup_{x\in\mathcal X}\frac{|f(x)|}{1+\|x\|{}^q}<\infty\right\} \end{aligned}$$

with the norm \(\|f\|{ }_{G_q}=\sup |f(x)|/(1+\|x\|{ }^q)=\|f/(1+\|\cdot \|{ }^q)\|{ }_\infty \). The space \(G_q(\mathcal X,\mathcal X)\) is defined similarly, with f taking values in \(\mathcal X\) instead of \(\mathbb {R}\), and the norm will be denoted in the same way. These are nonseparable Banach spaces.

Theorem 4.2.4 (Mean Identity Warp Functions and Fréchet Means)

Fix \(\lambda \in P(\mathcal X)\) and let \(\mathbf t\in G_{1}(\mathcal X,\mathcal X)\) be a (Bochner measurable) random optimal map with (Bochner) mean identity and such that \(\mathbb {E} \|\mathbf t\|{ }_{G_1}<\infty \). Then Λ = t#λ has Fréchet mean λ:

$$\displaystyle \begin{aligned} \mathbb{E} [W_2^2(\lambda,\varLambda)] \le \mathbb{E} [W_2^2(\theta,\varLambda)] \qquad \forall \theta\in \mathcal W_2(\mathcal X). \end{aligned}$$

The generalisation with respect to the one-dimensional result is threefold. Firstly, since our main interest is the implication (A1–A2) ⇒ (B1–B2), we need not assume T to be injective. Secondly, the support of λ is not required to be compact. Lastly, the result holds in arbitrary dimension, including infinite-dimensional separable Hilbert spaces \(\mathcal X\). In particular, if t is a linear map, then \(\|\mathbf t\|{ }_{G_1}\) coincides with the operator norm of t, so the assumption is that t be a bounded self-adjoint nonnegative operator with mean identity and finite expected operator norm.

Proof

Optimality of t ensures that it has a convex potential ϕ, and strong and weak duality give

$$\displaystyle \begin{aligned} W_2^2(\lambda,\varLambda) &= {{\int_{\mathcal X} \! {\left(\frac 12\|x\|{}^2 - \phi(x)\right)} \, \mathrm{d}{\lambda(x)}}} + {{\int_{\mathcal X} \! {\left(\frac 12\|y\|{}^2 - \phi^*(y)\right)} \, \mathrm{d}{\varLambda(y)}}} ;\\ W_2^2(\theta,\varLambda) &\ge {{\int_{\mathcal X} \! {\left(\frac 12\|x\|{}^2 - \phi(x)\right)} \, \mathrm{d}{\theta(x)}}} + {{\int_{\mathcal X} \! {\left(\frac 12\|y\|{}^2 - \phi^*(y)\right)} \, \mathrm{d}{\varLambda(y)}}} . \end{aligned} $$

Formally taking expectations, using Fubini’s theorem and that \(\mathbb {E}\phi =\|\cdot \|{ }^2/2\) (since \(\mathbb {E} \mathbf t\) is the identity) yields

$$\displaystyle \begin{aligned} \mathbb{E} [W_2^2(\theta,\varLambda)] \ge {{\int_{\mathcal X} \! {\left(\!\frac 12\|x\|{}^2 - \mathbb{E}\phi(x)\!\right)\!} \, \mathrm{d}{\theta(x)}}} +\mathbb{E} \left[{{\int_{\mathcal X} \! {\!\left(\!\frac 12\|y\|{}^2 - \phi^*(y)\!\right)\!} \, \mathrm{d}{\varLambda(y)}}} \right]\! =\mathbb{E} [W_2^2(\lambda,\varLambda)]\end{aligned} $$

as required. The rigorous mathematical justification for this is given on page 88 in the supplement.

Remark 4.2.5

The “natural” space for t would be \(\mathcal L_2(\lambda )\) , but without the continuity assumption, the result may fail (Álvarez-Esteban et al. [ 9 , Example 3.1]). A simple argument shows that the growth condition imposed by the G 1 assumption is minimal; see page 89 in the supplement or Galasso et al. [ 58 ].

Remark 4.2.6

The same statement holds if \(\mathcal X\) is replaced by a (Borel) convex subset K thereof. The integrals will then be taken on K, showing that λ minimises the Fréchet functional among measures supported on K, and, by continuity, on \(\overline K\) . By Proposition 3.2.4 , λ is a Fréchet mean.

4.3 Estimation of Fréchet Means

4.3.1 Oracle Case

In view of the canonicity of the Wasserstein geometry in Sect. 4.2.2, separation of amplitude and phase variation of the point processes \(\widetilde \varPi _i\) essentially requires computing Fréchet means in the 2-Wasserstein space. It is both conceptually important and technically convenient to introduce the case where an oracle reveals the conditional mean measures Λ = T#λ entirely. Thus, assuming that \(\lambda \in \mathcal W_2(\mathcal X)\) is the unique Fréchet mean of a random measure Λ, the goal is to estimate the structural mean λ on the basis of independent and identically distributed realisations Λ 1, …, Λ n of λ.

Given that λ is defined as the minimiser of the Fréchet functional

$$\displaystyle \begin{aligned} F(\gamma) =\frac 12 \mathbb{E} W_2^2(\varLambda,\gamma), \qquad \gamma\in \mathcal W_2(\mathcal X), \end{aligned}$$

it is natural to estimate λ by a minimiser, say λ n, of the empirical Fréchet functional

$$\displaystyle \begin{aligned} F_n(\gamma) =\frac 1{2n} \sum_{i=1}^n W_2^2(\varLambda_i,\gamma), \qquad \gamma\in \mathcal W_2(\mathcal X). \end{aligned}$$

A minimiser λ n exists by Corollary 3.1.3. When \(\mathcal X=\mathbb {R}\), λ n can be seen to be an unbiased estimator of λ in a generalised sense of Lehmann [88] (see Sect. 4.3.5).

The warp maps (and their inverses) can then be estimated as the optimal maps from λ n to each Λ i (see Sect. 4.3.4).

4.3.2 Discretely Observed Measures

In practice, one does not have the fortune of fully observing the inherently infinite-dimensional objects Λ 1, …, Λ n. A far more realistic scenario is that one only has access to a discrete version of Λ i, say \(\widetilde {\varLambda }_i\). The simplest situation is when \(\widetilde {\varLambda }_i\) arises as an empirical measure of the form \(\tau ^{-1}\sum _{i=1}^\tau \delta \{Y_j\}\), where Y j are independent with distribution Λ i. More generally, \(\widetilde \varLambda _i\) can be a normalised point process \(\widetilde \varPi _i\) with mean measure τΛ i, i.e.

$$\displaystyle \begin{aligned} \widetilde\varLambda_i =\frac 1 {\widetilde\varPi_i(\mathcal X)} \widetilde \varPi_i \quad \mathrm{with} \quad \mathbb{E} [\widetilde \varPi_i(A) | \varLambda_i] =\tau \varLambda_i(A), \qquad A\subseteq\mathcal X \mathrm{ Borel}. \end{aligned}$$

This encapsulates the case of empirical measure when τ is an integer and \(\widetilde \varPi _i\) is a binomial point process. The parameter τ is the expected number of observed points over the entire space \(\mathcal X\); the larger τ is, the more information \(\widetilde \varPi _i\) gives on Λ i.

Except if \(\widetilde \varLambda _i\) is an empirical measure, there is one difficulty in the above setting that needs to be addressed. Unless \(\widetilde \varPi _i\) is binomial, there is a positive probability that \(\widetilde \varPi _i(\mathcal X)=0\) and no points pertaining to Λ i are observed. In the asymptotic setup below, conditions will be imposed to ensure that this probability becomes negligible as n →. For concreteness we define \(\widetilde \varLambda _i=\lambda ^{(0)}\) for some fixed measure λ (0) that will be of minor importance. This can be a Dirac measure at 0, a certain fixed Gaussian measure, or (normalised) Lebesgue measure on some bounded set in case \(\mathcal X=\mathbb {R}^d\). We can now replace the estimator λ n by \(\widetilde \lambda _n\), defined as any minimiser of

$$\displaystyle \begin{aligned} \widetilde F_n(\gamma) =\frac 1{2n} \sum_{i=1}^n W_2^2(\widetilde\varLambda_i,\gamma), \qquad \gamma\in \mathcal W_2(\mathcal X), \end{aligned}$$

which exists by Corollary 3.1.3.

As a generalisation of the discrete case discussed in Sect. 1.3, the Fréchet mean of discrete measures can be computed exactly. Suppose that \(N_i=\widetilde \varPi _i(\mathcal X)\) is nonzero for all i. Then each \(\widetilde \varLambda _i\) is a discrete measure supported on N i points. One can then recast the multimarginal formulation (see Sect. 3.1.2) as a finite linear program, solve it, and “average” the solution as in Proposition 3.1.2 in order to obtain \(\widetilde \lambda _n\) (an alternative linear programming formulation for finding a Fréchet mean is given by Anderes et al. [14]). Thus, \(\widetilde \lambda _n\) can be computed in finite time, even when \(\mathcal X\) is infinite-dimensional.

Finally, a remark about measurability is in order. Point processes can be viewed as random elements in \(M_+(\mathcal X)\) endowed with the vague topology induced from convergence of integrals of continuous functions with compact support. If μ n converge to μ vaguely, and a n are numbers that converge to a, then a nμ n →  vaguely. Thus, \(\tilde \varLambda _i\) is a continuous function of the pair \((\widetilde \varPi _i,\widetilde \varPi _i(\mathcal X))\) and can be viewed as a random measure with respect to the vague topology. The restriction of the vague topology to probability measures is equivalent to the weak topology,Footnote 5 and therefore vague, weak, and Wasserstein measurability are all equivalent.

4.3.3 Smoothing

Even when the computational complexity involved in calculating \(\widetilde \lambda _n\) is tractable, there is another reason not to use it as an estimator for λ. If one has a-priori knowledge that λ is smooth, it is often desirable to estimate it by a smooth measure. One way to achieve this would be to apply some smoothing technique to \(\widetilde \lambda _n\) using, e.g., kernel density estimation. However, unless the number of observed points from each measure is the same N 1 = ⋯ = N n = N, \(\widetilde \lambda _n\) will usually be concentrated on many points, essentially N 1 + ⋯ + N n of them. In other words, the Fréchet mean is concentrated on many more points than each of the measures \(\widetilde \varLambda _i\), thus potentially hindering its usefulness as a mean because it will not be a representative of the sample.

This is most easily seen when \(\mathcal X=\mathbb {R}\), in which case each \(\widetilde \varLambda _i\) is a discrete uniform measure on points \(x^i_1< x^i_2< \dots < x^i_{N_i}\), where we assume for simplicity that the points are not repeated (this will happen with probability one if Λ i is diffuse). If we now set G i to be the distribution function of \(\widetilde \varLambda _i\), then the quantile function \(G_i^{-1}\) is piecewise constant on each interval (k, k + 1]∕N i with jumps at

$$\displaystyle \begin{aligned} G_i^{-1}(k/N_i)=x^i_k ,\qquad k=1,2,\dots,N_i. \end{aligned}$$

The Fréchet mean has quantile function \(G^{-1}(u)=n^{-1}\sum G_i^{-1}(u)\) and will have jumps at every point of the form kN i for k ≤ N i and i = 1, …, n. In the worst-case scenario, when no pair from N i has a common divisor, there will be

$$\displaystyle \begin{aligned} \left( \sum_{i=1}^n N_i - 1\right) + 1 =\left(\sum_{i=1}^n N_i\right) - n + 1 \end{aligned}$$

jumps for G −1, which is the number of points on which the Fréchet mean will be supported. (All the \(G_i^{-1}\)’s have a jump at one which thus needs to be counted once rather than n times.)

By counting the number of redundancies in the constraints matrix of the linear program, one can show that this is in general an upper bound on the number of support points of the Fréchet mean.

An alternative approach is to first smooth each observation \(\widetilde \lambda _n\) and then calculate the Fréchet mean. Since it is easy to bound the Wasserstein distances when dealing with convolutions, we will employ kernel density estimation, although other smoothing approaches could be used as well.

To simplify the exposition, we provide the technical details only when \(\mathcal X=\mathbb {R}^d\), but a similar construction will work when the dimension of \(\mathcal X\) is infinite. Let \(\psi :\mathbb {R}^d\to (0,\infty )\) be a continuous, bounded, strictly positive isotropic density function with unit variance: ψ(x) = ψ 1(∥x∥) with ψ 1 nonincreasing and

$$\displaystyle \begin{aligned} {{\int_{\mathbb{R}^d} \! {\|x\|{}^2\psi(x)} \, \mathrm{d}x}} =1 = {{\int_{{\mathbb{R}^d}} \! {\psi(x)} \, \mathrm{d}x}} . \end{aligned}$$

(Besides the boundedness all these properties can be relaxed, and if \(\mathcal X=\mathbb {R}\) even boundedness is not necessary.) A classical example for ψ is the standard Gaussian density in \(\mathbb {R}^d\). Define the rescaled version ψ σ(x) = σ dψ(xσ) for all σ > 0. We can then replace \(\widetilde \varLambda _i\) by a smooth proxy \(\widetilde \varLambda _i \ast \psi _\sigma \). If \(\widetilde \varLambda _i\) is a sum of Dirac masses at \(x_1,\dots ,x_{N_i}\), then

$$\displaystyle \begin{aligned} \widetilde\varLambda_i \ast \psi_\sigma \quad \mathrm{has density}\quad g(x) =\frac 1{N_i} \sum_{j=1}^{N_i}\psi_\sigma(x - x_i). \end{aligned}$$

If N i = 0, one can either use λ (0) or λ (0) ∗ ψ σ; this event will have negligible probability anyway.

For the purpose of approximating \(\widetilde \varLambda _i\), this convolution is an acceptable estimator, because as was seen in the proof of Theorem 2.2.7,

$$\displaystyle \begin{aligned} W_2^2(\widetilde\varLambda_i,\widetilde\varLambda_i \ast \psi_\sigma) \le \sigma^2. \end{aligned}$$

But the measure \(\widetilde \varLambda _i\) has a strictly positive density throughout \(\mathbb {R}^d\). If we know that Λ is supported on a convex compact \(K\subset \mathbb {R}^d\), it is desirable to construct an estimator that has the same support K. The first idea that comes to mind is to project \(\widetilde \varLambda _i \ast \psi _\sigma \) to K (see Proposition 3.2.4), as this will further decrease the Wasserstein distance, but the resulting measure will then have positive mass on the boundary of K, and will not be absolutely continuous. We will therefore use a different strategy: eliminate all the mass outside K and redistribute it on K. The simplest way to do this is to restrict \(\widetilde \varLambda _i\ast \psi _\sigma \) to K and renormalise the restriction to be a probability measure. For technical reasons, it will be more convenient to bound the Wasserstein distance when the restriction and renormalisation is done separately on each point of \(\widetilde \varLambda _i\). This yields the measure

$$\displaystyle \begin{aligned} \widehat{\varLambda_i} =\frac 1{N_i} \sum_{j=1}^{N_i}\frac{\delta\{x_j\}\ast \psi_\sigma} {[\delta\{x_j\}\ast \psi_\sigma](K)}\bigg|{}_K, \end{aligned} $$
(4.2)

Lemma 4.4.2 below shows that \(W_2^2(\widetilde \varLambda _i,\widehat {\varLambda _i})\le C\sigma ^2\) for some finite constant C. It is apparent that \(\widehat {\varLambda _i}\) is a continuous function of \(\widetilde \varLambda _i\) and σ, so \(\widehat {\varLambda _i}\) is measurable; in any case this is not particularly important because σ will vanish, so \(\widehat {\varLambda _i}=\widetilde \varLambda _i\) asymptotically and the latter is measurable.

Our final estimator \(\widehat \lambda _n\) for λ is defined as the minimiser of

$$\displaystyle \begin{aligned} \widehat F_n(\gamma) =\frac 1{2n} \sum_{i=1}^n W_2^2(\widehat\varLambda_i,\gamma), \qquad \gamma\in \mathcal W_2(\mathcal X). \end{aligned}$$

Since the measures \(\widehat {\varLambda _i}\) are absolutely continuous, \(\widehat \lambda _n\) is unique. We refer to \(\widehat {\lambda }_n\) as the regularised Fréchet–Wasserstein estimator, where the regularisation comes from the smoothing and the possible restriction to K.

In the case \(\mathcal X=\mathbb {R}\), \(\widehat \lambda _n\) can be constructed via averaging of quantile functions. Let \(\widehat G_i\) be the distribution function of \(\widehat \varLambda _i\). Then \(\widehat \lambda _n\) is the measure with quantile function

$$\displaystyle \begin{aligned} F_{\widehat\lambda_n}^{-1}(u) =\frac 1n \sum_{i=1}^n \widehat G_i^{-1}(u), \qquad u\in (0,1),\end{aligned} $$

and distribution function

$$\displaystyle \begin{aligned} F_{\widehat\lambda_n}(x) =[F_{\widehat\lambda_n}^{-1}]^{-1}(x).\end{aligned} $$

By construction, the \(\widehat G_i\) are continuous and strictly increasing, so the inverses are proper inverses and one does not to use the right-continuous inverse as in Sect. 3.1.4.

If \(\mathcal X=\mathbb {R}^d\) and d ≥ 2, then there is no explicit expression for \(\widehat \lambda _n\), although it exists and is unique. In the next chapter, we present a steepest descent algorithm that approximately constructs \(\widehat \lambda _n\) by taking advantage of the differentiability properties of the Fréchet functional \(\widehat F_n\) in Sect. 3.1.6.

4.3.4 Estimation of Warpings and Registration Maps

Once estimators \(\widehat {\varLambda _i}\), i = 1, …, n and \(\widehat \lambda _n\) are constructed, it is natural to estimate the map \(T_i={\mathbf t_{\lambda }^{\varLambda _i}}\) and its inverse \(T_i^{-1}={\mathbf t_{{\varLambda _i}}^{\lambda }}\) (when Λ i are absolutely continuous; see the discussion after Assumptions 3 below) by the plug-in estimators

$$\displaystyle \begin{aligned} \widehat{T_i} =\mathbf t_{\widehat\lambda_n}^{\widehat\varLambda_i} ,\qquad \widehat{T_i^{-1}} =(\widehat {T_i})^{-1} =\mathbf t_{\widehat\varLambda_i}^{\widehat\lambda_n}. \end{aligned}$$

The latter, the registration maps, can then be used in order to register the points Π i via

$$\displaystyle \begin{aligned} \widehat{\varPi_i^{(n)}} =\widehat{T_i^{-1}} \# \widetilde\varPi_i^{(n)} =\left[ \widehat{T_i^{-1}} \circ T_i \right] \# \varPi_i^{(n)}. \end{aligned}$$

It is thus reasonable to expect that if \(\widehat {T_i^{-1}}\) is a good estimator, then its composition with T i should be close to the identity and \(\widehat {\varPi _i}\) should be close to Π i.

4.3.5 Unbiased Estimation When \(\mathcal X=\mathbb {R}\)

In the same way, Fréchet means extend the notion of mean to non-Hilbertian spaces, they also extend the definition of unbiased estimators. Let H be a separable Hilbert space (or a convex subset thereof) and suppose that \(\widehat \theta \) is a random element in H whose distribution μ θ depends on a parameter θ ∈ H. Then \(\widehat {\theta }\) is unbiased for θ if for all θ ∈ H

$$\displaystyle \begin{aligned} \mathbb{E}_\theta \widehat\theta ={{\int_{H} \! {x} \, \mathrm{d}{\mu_\theta(x)}}} =\theta. \end{aligned}$$

(We use the standard notation \(\mathbb {E}_\theta g(\widehat {\theta })={\int \! g(x) \, \mathrm {d}\mu _\theta (x)}\) in the sequel.) This is equivalent to

$$\displaystyle \begin{aligned} \mathbb{E}_\theta \|\theta - \widehat{\theta}\|{}^2 \le \mathbb{E}_\theta \|\gamma - \widehat{\theta}\|{}^2 ,\qquad \forall\theta,\gamma \in H. \end{aligned}$$

In view of that, one can define unbiased estimators of \(\lambda \in \mathcal W_2\) as measurable functions δ = δ(Λ 1, …, Λ n) for which

$$\displaystyle \begin{aligned} \mathbb{E}_{\lambda}W_2^2(\lambda,\delta) \le\mathbb{E}_{\lambda}W_2^2(\gamma,\delta), \qquad \forall\gamma,\theta\in \mathcal W_2. \end{aligned}$$

This definition was introduced by Lehmann [88].

Unbiased estimators allow us to avoid the problem of over-registering (the so-called pinching effect; Kneip and Ramsay [82, Section 2.4]; Marron et al. [90, p. 476]). An extreme example of over-registration is if one “aligns” all the observed patterns into a single fixed point x 0. The registration will then seem “successful” in the sense of having no residual phase variation, but the estimation is clearly biased because the points are not registered to the correct reference measure. Thus, requiring the estimator to be unbiased is an alternative to penalising the registration maps.

Due to the Hilbert space embedding of \(\mathcal W_2(\mathbb {R})\), it is possible to characterise unbiased estimators in terms of a simple condition on their quantile functions. As a corollary, λ n, the Fréchet mean of {Λ 1, …, Λ n}, is unbiased. Our regularised Fréchet–Wasserstein estimator \(\hat \lambda _n\) can then be interpreted as approximately unbiased, since it approximates the unobservable λ n.

Proposition 4.3.1 (Unbiased Estimators in \(\mathcal W_2(\mathbb {R})\))

Let Λ be a random measure in \(\mathcal W_2(\mathbb {R})\) with finite Fréchet functional and let λ be the unique Fréchet mean of Λ (Theorem 3.2.11 ). An estimator δ constructed as a function of a sample (Λ 1, …, Λ n) is unbiased for λ if and only if the left-continuous representatives (in L 2(0, 1)) satisfy \(\mathbb {E} [F_\delta ^{-1}(x)]=F_\lambda ^{-1}(x)\) for all x ∈ (0, 1).

Proof

The proof is straightforward from the definition: δ is unbiased if and only if for all λ and all γ,

$$\displaystyle \begin{aligned} \mathbb{E}_{\lambda}\|F^{-1}_{\lambda}-F^{-1}_{\delta}\|{}^2_{L_2} \le \mathbb{E}_{\lambda}\|F^{-1}_{\gamma}-F^{-1}_{\delta}\|{}^2_{L_2}, \end{aligned}$$

which is equivalent to \(\mathbb {E}_\lambda [F_\delta ^{-1}]=F_\lambda ^{-1}\). In other words, these two functions must equal almost everywhere on (0, 1), and their left-continuous representatives must equal everywhere (the fact that \(\mathbb {E}_\lambda [F_\delta ^{-1}]\) has such a representative was established in Sect. 3.1.4).

To show that δ = λ n is unbiased, we simply invoke Theorem 3.2.11 twice to see that

$$\displaystyle \begin{aligned} \mathbb{E} [F_\delta^{-1}] =\mathbb{E} \left[\frac 1n \sum_{i=1}^n F_{\varLambda_i}^{-1}\right] =\mathbb{E} [F_\varLambda^{-1}] =F_\lambda^{-1}, \end{aligned}$$

which proves unbiasedness of δ.

4.4 Consistency

In functional data analysis, one often assumes that the number of curves n and the number of observed points per curve m both diverge to infinity. An analogous framework for point processes would similarly require the number of point processes n as well as the expected number of points τ per processes to diverge. A technical complication arises, however, because the mean measures do not suffice to characterise the distribution of the processes. Indeed, if one is given a point processes Π with mean measure λ (not necessarily a probability measure), and τ is an integer, there is no unique way to define a process Π (τ) with mean measure τλ. One can define Π (τ) = τΠ, so that every point in Π will be counted τ times. Such a construction, however, can never yield a consistent estimator of λ, even when τ →.

Another way to generate a point process with mean measure τλ is to take a superposition of τ independent copies of Π. In symbols, this means

$$\displaystyle \begin{aligned} \varPi^{(\tau)} =\varPi_1 + \dots + \varPi_\tau, \end{aligned}$$

with (Π i) independent, each having the same distribution as Π. This superposition scheme gives the possibility to use the law of large numbers. If τ is not an integer, then this construction is not well-defined but can be made so by assuming that the distribution of Π is infinitely divisible. The reader willing to assume that τ is always an integer can safely skip to Sect. 4.4.1; all the main ideas are developed first for integer values of τ and then extended to the general case.

A point process Π is infinitely divisible if for every integer m there exists a collection of m independent and identically distributed \(\varPi _i^{(1/m)}\) such that

$$\displaystyle \begin{aligned} \varPi =\varPi_1^{(1/m)} + \dots + \varPi_m^{(1/m)} \qquad \mathrm{in distribution}. \end{aligned}$$

If Π is infinitely divisible and τ = km is rational, then can define π (τ) using km independent copies of Π (1∕m):

$$\displaystyle \begin{aligned} \varPi^{(\tau)} =\sum_{i=1}^{km} \varPi_i^{(1/m)}. \end{aligned}$$

One then deals with irrational τ via duality and continuity arguments, as follows. Define the Laplace functional of Π by

$$\displaystyle \begin{aligned} L_\varPi(f) =\mathbb{E} \left[ e^{-\varPi f}\right] =\mathbb{E} \left[\exp \left(-{{\int_{\mathcal X} \! f \, \mathrm{d}\varPi}} \right)\right]\in[0,1], \qquad f:\mathcal X\to\mathbb{R}_+ \quad \mathrm{Borel}. \end{aligned}$$

The Laplace functional characterises the distribution of the point process, generalising the notion of Laplace transform of a random variable or vector (Karr [79, Theorem 1.12]). By definition, it translates convolutions into products. When Π = Π (1) is infinitely divisible, the Laplace functional L 1 of Π takes the form (Kallenberg [75, Chapter 6]; Karr [79, Theorem 1.43])

$$\displaystyle \begin{aligned} L_1(f)=\mathbb{E}\left[e^{-\varPi^{(1)} f}\right] =\exp\left [{-}{\int_{M_+(\mathcal X)} \! (1{-}e^{-\mu f}) \, \mathrm{d}\boldsymbol\rho(\mu)}\right] \quad \mathrm{for some} \boldsymbol{\rho} \in M_+(M_+(\mathcal X)). \end{aligned}$$

The Laplace functional of Π (τ) is L τ(f) = [L 1(f)]τ for any rational τ, which simply amounts to multiplying the measure ρ by the scalar τ. One can then do the same for an irrational τ, and the resulting Laplace functional determines the distribution of Π (τ) for all τ > 0.

4.4.1 Consistent Estimation of Fréchet Means

We are now ready to define our asymptotic setup. The following assumptions will be made. Notice that the Wasserstein geometry does not appear explicitly in these assumptions, but is rather derived from them in view of Theorem 4.2.4. The compactness requirement can be relaxed under further moment conditions on λ and the point process Π; we focus on the compact case for the simplicity and because in practice the point patterns will be observed on a bounded observation window.

Assumptions 3

Let \(K\subset \mathbb {R}^d\) be a compact convex nonempty set, λ an absolutely continuous probability measure on K, and τ n a sequence of positive numbers. Let Π be a point processes on K with mean measure λ. Finally, define U = intK.

  • For every n, let \(\{\varPi _1^{(n)},\dots ,\varPi _n^{(n)}\}\) be independent point processes, each having the same distribution as a superposition of τ n copies of Π.

  • Let T be a random injective function on K (viewed as a random element in C b(K, K) endowed with the supremum norm) such that T(x) ∈ U for x  U (that is, T  C b(U, U)) with nonsingular derivative \(\nabla T(x)\in \mathbb {R}^{d\times d}\) for almost all x  U, that is a gradient of a convex function. Let {T 1, …, T n} be independent and identically distributed as T.

  • For every x  U, assume that \(\mathbb {E} [T(x)]=x\).

  • Assume that the collections \(\{T_n\}_{n=1}^\infty \) and \(\{\varPi _i^{(n)}\}_{i\le n,n=1,2,\dots }\) are independent.

  • Let \(\widetilde {\varPi }_i^{(n)}={T_i}\#\varPi _i^{(n)}\) be the warped point processes, having conditional mean measures \(\varLambda _i=T_i\#\lambda =\tau _n^{-1}\mathbb {E}\Big \{\widetilde {\varPi }_i^{(n)}\Big |T_i\Big \}\).

  • Define \(\widehat \varLambda _i\) by the smoothing procedure (4.2), using bandwidth \(\sigma _i^{(n)}\in [0,1]\) (possibly random).

The dependence of the estimators on n will sometimes be tacit. But Λ i does not depend on n.

By virtue of Theorem 4.2.4, λ is a Fréchet mean of the random measure Λ = T#λ. Uniqueness of this Fréchet mean will follow from Proposition 3.2.7 if we show that Λ is absolutely continuous with positive probability. This is indeed the case, since T is injective and has a nonsingular Jacobian matrix; see Ambrosio et al. [12, Lemma 5.5.3]. The Jacobian assumption can be removed when \(\mathcal X=\mathbb {R}\), because Fréchet means are always unique by Theorem 3.2.11.

Notice that there is no assumption about the dependence between rows. Assumptions 3 thus cover, in particular, two different scenarios:

  • Full independence: here the point processes are independent across rows, that is, \(\varPi _i^{(n)}\) and \(\varPi _i^{(n+1)}\) are also independent.

  • Nested observations: here \(\varPi _i^{(n+1)}\) includes the same points as \(\varPi _i^{(n)}\) and additional points, that is, \(\varPi _i^{(n+1)}\) is a superposition of \(\varPi _i^{(n)}\) and another point process distributed as (τ n+1 − τ n)Π.

Needless to say, Assumptions 3 also encompass binomial processes when τ n are integers, as well as Poisson processes or, more generally, Poisson cluster processes.

We now state and prove the consistency result for the estimators of the conditional mean measures Λ i and the structural mean measure λ.

Theorem 4.4.1 (Consistency)

If Assumptions 3 hold, \(\sigma _n=n^{-1}\sum _{i=1}^n\sigma _i^{(n)}\to 0\) almost surely and τ n →∞ as n ∞, then:

  1. 1.

    The estimators \(\widehat{\varLambda _{i}}\) defined by (4.2), constructed with bandwidth \(\sigma =\sigma _i^{(n)}\), are Wasserstein-consistent for the conditional mean measures: for all i such that \(\sigma _i^{(n)}\stackrel {p}\to 0\)

    $$\displaystyle \begin{aligned} W_2\left(\widehat{\varLambda_i},\varLambda_i\right) \stackrel{p}{\longrightarrow}0 ,\qquad \mathit{\mbox{ as }}n\to\infty; \end{aligned}$$
  2. 2.

    The regularised Fréchet–Wasserstein estimator of the structural mean measure (as described in Sect. 4.3) is strongly Wasserstein-consistent,

    $$\displaystyle \begin{aligned} W_2(\widehat{\lambda}_n,\lambda)\stackrel{a.s.}{\longrightarrow}0, \qquad \mathit{\mbox{ as }}n\to\infty. \end{aligned}$$

Convergence in 1. holds almost surely under the additional conditions that \(\sum _{n=1}^\infty \tau _n^{-2} <\infty \) and \(\mathbb {E}\left [\varPi (\mathbb {R}^d)\right ]^4<\infty \). If σ n → 0 only in probability, then convergence in 2. still holds in probability.

Theorem 4.4.1 still holds without smoothing (σ n = 0). In that case, \(\widehat {\lambda }_n=\tilde \lambda _n\) is possibly not unique, and the theorem should be interpreted in a set-valued sense (as in Proposition 1.7.8): almost surely, any choice of minimisers \(\tilde \lambda _n\) converges to λ as n →.

The preceding paragraph notwithstanding, we will usually assume that some smoothing is present, in which case \(\widehat {\lambda }_n\) is unique and absolutely continuous by Proposition 3.1.8. The uniform Lipschitz bounds for the objective function show that if we restrict the relevant measures to be absolutely continuous, then \(\widehat \lambda _n\) is a continuous function of \((\widehat {\varLambda _1},\dots ,\widehat {\varLambda _n})\) and hence \(\widehat \lambda _n:(\varOmega ,\mathcal F,{\mathbb {P}})\to \mathcal W_2(K)\) is measurable; this is again a minor issue because many arguments in the proof hold for each ω ∈ Ω separately. Thus, even if \(\widehat \lambda _n\) is not measurable, the proof shows that the convergence holds outer almost surely or in outer probability.

The first step in proving consistency is to show that the Wasserstein distance between the unsmoothed and the smoothed estimators of Λ i vanishes with the smoothing parameter. The exact rate of decay will be important to later establish the rate of convergence of \(\widehat {\lambda }_n\) to λ, and is determined next.

Lemma 4.4.2 (Smoothing Error)

There exists a finite constant C ψ,K , depending only on ψ and on K, such that

$$\displaystyle \begin{aligned} W_2^2\left(\widehat{\varLambda_i}, \widetilde\varLambda_i \right) \le C_{\psi,K}\sigma^2 \quad \mathrm{ if} \sigma\le 1. \end{aligned} $$
(4.3)

Since the smoothing parameter will anyway vanish, this restriction to small values of σ is not binding. The constant C ψ,K is explicit. When \(\mathcal X=\mathbb {R}\), a more refined construction allows to improve this constant in some situations, see Panaretos and Zemel [100, Lemma 1].

Proof

The idea is that (4.2) is a sum of measures with mass 1∕N i that can be all sent to the relevant point x j, and we refer to page 98 in the supplement for the precise details.

Proof(Proof of Theorem 4.4.1) The proof, detailed on page 97 of the supplement, follows the following steps: firstly, one shows the convergence in probability of \(\widehat {\varLambda _i}\) to Λ i. This is basically a corollary of Karr [79, Proposition 4.8] and the smoothing bound from Lemma 4.4.2.

To prove claim (2) one considers the functionals, defined on \(\mathcal W_2(K)\):

$$\displaystyle \begin{aligned} F(\gamma) &=\frac 12 \mathbb{E} W_2^2(\varLambda,\gamma);\\ F_n(\gamma) &=\frac 1{2n} \sum_{i=1}^n W_2^2(\varLambda_i,\gamma);\\ \tilde F_n(\gamma) &=\frac 1{2n} \sum_{i=1}^n W_2^2(\widetilde\varLambda_i,\gamma) ,\qquad \widetilde \varLambda_i=\frac{\widetilde\varPi_i^{(n)}} {N_i^{(n)}} \qquad \mathrm{or} \lambda^{(0)}\mathrm{ if} N_i^{(n)}=0;\\ \widehat F_n(\gamma) &=\frac 1{2n} \sum_{i=1}^n W_2^2(\widehat{\varLambda_i},\gamma) ,\qquad \widehat\varLambda_i = \lambda^{(0)}\mathrm{ if} N_i^{(n)}=0. \end{aligned} $$

Since K is compact, they are all locally Lipschitz, so their differences can be controlled by the distances between Λ i, \(\widetilde {\varLambda }_i\), and \(\widehat {\varLambda _i}\). The first distance vanishes since the intensity τ →, and the second by the smoothing bound. Another compactness argument yields that \(\widehat {F}_n\to F\) uniformly on \(\mathcal W_2(K)\), and so the minimisers converge.

The almost sure convergence in (1) is proven as follows. Under the stronger conditions at the end of the theorem’s statement, for any fixed \(a=(a_1,\dots ,a_d)\in \mathbb {R}^d\),

$$\displaystyle \begin{aligned} {\mathbb{P}}\left( \frac{\widetilde \varPi_i^{(n)}((-\infty,a])} {\tau_n} - \varLambda_i((-\infty,a]) \to 0 \right) = 1 \end{aligned}$$

by the law of large numbers. This extends to all rational a’s, then to all a by approximation. The smoothing error is again controlled by Lemma 4.4.2.

4.4.2 Consistency of Warp Functions and Inverses

We next discuss the consistency of the warp and registration function estimators. These are key elements in order to align the observed point patterns \(\widetilde \varPi _i\). Recall that we have consistent estimators \(\widehat {\varLambda _i}\) for Λ i and \(\widehat \lambda _n\) for λ. Then \(T_i={\mathbf t_{\lambda }^{{\varLambda _i}}}\) is estimated by \({\mathbf t_{\widehat \lambda _n}^{\widehat {\varLambda _i}}}\) and \(T_i^{-1}\) is estimated by \({\mathbf t_{{\widehat {\varLambda _i}}}^{{\widehat {\lambda }_n}}}\). We will make the following extra assumptions that lead to more transparent statements (otherwise one needs to replace K with the set of Lebesgue points of the supports of λ and Λ i).

Assumptions 4 (Strictly Positive Measures)

In addition to Assumptions 3 suppose that:

  1. 1.

    λ has a positive density on K (in particular, suppλ = K);

  2. 2.

    T is almost surely surjective on U = intK (thus a homeomorphism of U).

As a consequence \(\mathrm {supp}\varLambda =\mathrm {supp} (T\#\lambda )=\overline {T(\mathrm {supp}\lambda )}=K\) almost surely.

Theorem 4.4.3 (Consistency of Optimal Maps)

Let Assumptions 4 be satisfied in addition to the hypotheses of Theorem 4.4.1. Then for any i such that \(\sigma _i^{(n)}\stackrel p\to 0\) and any compact set S ⊆intK,

$$\displaystyle \begin{aligned} \sup_{x\in S}\|\widehat {T_i^{-1}}(x) - T_i^{-1}(x)\|\stackrel{\mathrm{{p}}}\to0, \qquad \sup_{x\in S}\|\widehat {T_i}(x) - T_i(x)\|\stackrel{\mathrm{{p}}}\to0. \end{aligned}$$

Almost sure convergence can be obtained under the same provisions made at the end of the statement of Theorem 4.4.1.

A few technical remarks are in order. First and foremost, it is not clear that the two suprema are measurable. Even though T i and \(T_i^{-1}\) are random elements in \(C_b(U,\mathbb {R}^d)\), their estimators are only defined in an L 2 sense. The proof of Theorem 4.4.3 is done ω-wise. That is, for any ω in the probability space such that Theorem 4.4.1 holds, the two suprema vanish as n →. In other words, the convergence holds in outer probability or outer almost surely.

Secondly, assuming positive smoothing, the random measures \(\widehat {\varLambda _i}\) are smooth with densities bounded below on K, so \(\widehat {T_i^{-1}}\) are defined on the whole of U (possibly as set-valued functions on a Λ i-null set). But the only known regularity result for \(\widehat \lambda _n\) is an upper bound on its density (Proposition 3.1.8), so it is unclear what is its support and consequently what is the domain of definition of \(\widehat {T_i}\).

Lastly, when the smoothing parameter σ is zero, \(\widehat {T_i}\) and \(\widehat {T_i^{-1}}\) are not defined. Nevertheless, Theorem 4.4.3 still holds in the set-valued formulation of Proposition 1.7.11, of which it is a rather simple corollary:

Proof(Proof of Theorem 4.4.3) The proof amounts to setting the scene in order to apply Proposition 1.7.11 of stability of optimal maps. We define

$$\displaystyle \begin{aligned} \mu_n = \widehat\varLambda_i; \qquad \nu_n = \widehat\lambda_n; \qquad \mu = \varLambda_i; \qquad \nu = \lambda; \qquad u_n = \widehat T_i^{-1}; \qquad u = T_i^{-1}, \end{aligned}$$

and verify the conditions of the proposition. The weak convergence of μ n to μ and ν n to ν is the conclusion of Theorem 4.4.1; the finiteness is apparent because K is compact and the uniqueness follows from the assumed absolute continuity of Λ i. Since in addition \(T_i^{-1}\) is uniquely defined on U = intK which is an open convex set, the restrictions on Ω in Proposition 1.7.11 are redundant. Uniform convergence of \(\widehat {T_i}\) to T i is proven in the same way.

Corollary 4.4.4 (Consistency of Point Pattern Registration)

For any i such that \(\sigma _i^{(n)}\stackrel p\to 0\) ,

$$\displaystyle \begin{aligned} W_2\left( \frac{\widehat {\varPi_i^{(n)}}} {N_i^{(n)}}, \frac{\varPi_i^{(n)}} {N_i^{(n)}} \right)\stackrel{\mathrm{{p}}}\to0. \end{aligned}$$

The division by the number of observed points ensures that the resulting measures are probability measures; the relevant information is contained in the point patterns themselves, and is invariant under this normalisation.

Proof

The law of large numbers entails that \(N_i^{(n)}/\tau _n\to 1\), so in particular \(N_i^{(n)}\) is almost surely not zero when n is large. Since \(\widehat {\varPi _i^{(n)}}=(\widehat {T_i^{-1}}\circ T_i)\#\varPi _i^{(n)}\), we have the upper bound

$$\displaystyle \begin{aligned} W_2^2\left( \frac{\widehat {\varPi_i^{(n)}}} {N_i^{(n)}}, \frac{\varPi_i^{(n)}} {N_i^{(n)}} \right) \le {{\int_{K} \! {\|\widehat {T_i^{-1}}(T_i(x)) - x\|{}^2} \, \mathrm{d}{\frac{\varPi_i^{(n)}}{N_i^{(n)}}}}} . \end{aligned}$$

Fix a compact Ω ⊆intK and split the integral to Ω and its complement. Then

$$\displaystyle \begin{aligned} {\int_{K\setminus \varOmega} \! \|\widehat {T_i^{-1}}(T_i(x)) - x\|{}^2 \, \mathrm{d}\frac{\varPi_i^{(n)}}{N_i^{(n)}}} \le d_K^2\frac{\varPi_i^{(n)}(K\setminus\varOmega)} {\tau_n} \frac{\tau_n} {N_i^{(n)}} \stackrel{\mathrm{{as}}}\to d_K^2\lambda(K\setminus\varOmega), \end{aligned}$$

by the law of large numbers, where d K is the diameter of K. By writing intK as a countable union of compact sets (and since λ is absolutely continuous), this can be made arbitrarily small by choice of Ω.

We can easily bound the integral on Ω by

$$\displaystyle \begin{aligned} {{\int_{\varOmega} \! {\|\widehat {T_i^{-1}}(T_i(x)) - x\|{}^2} \, \mathrm{d}{\frac{\varPi_i^{(n)}}{N_i^{(n)}}}}} \le \sup_{x\in\varOmega} \|\widehat {T_i^{-1}}(T_i(x)) - x\|{}^2 = \sup_{y\in T_i(\varOmega)} \|\widehat {T_i^{-1}}(y) - T_i^{-1}(y)\|{}^2. \end{aligned}$$

But T i(Ω) is a compact subset of U = intK, because T i ∈ C b(U, U). The right-hand side therefore vanishes as n → by Theorem 4.4.3, and this completes the proof.

Possible extensions pertaining to the boundary of K are discussed on page 33 of the supplement.

4.5 Illustrative Examples

In this section, we illustrate the estimation framework put forth in this chapter by considering an example of a structural mean λ with a bimodal density on the real line. The unwarped point patterns Π originate from Poisson processes with mean measure λ and, consequently, the warped points \(\widetilde \varPi \) are Cox processes (see Sect. 4.1.2). Another scenario involving triangular densities can be found in Panaretos and Zemel [100].

4.5.1 Explicit Classes of Warp Maps

As a first step, we introduce a class of random warp maps satisfying Assumptions 2, that is, increasing maps that have as mean the identity function. The construction is a mixture version of similar maps considered by Wang and Gasser [128, 129].

For any integer k define ζ k : [0, 1] → [0, 1] by

$$\displaystyle \begin{aligned} \zeta_0(x) = x,\qquad \zeta_k(x) = x - \frac{\sin{}(\pi kx)}{|k|\pi},\quad k\in\mathbb{Z}\setminus\{0\}. \end{aligned} $$
(4.4)

Clearly ζ k(0) = 0, ζ k(1) = 1 and ζ k is smooth and strictly increasing for all k. Figure 4.4a plots ζ k for k = −3, …, 3. To make ζ k a random function, we let k be an integer-valued random variable. If the latter is symmetric, then we have

$$\displaystyle \begin{aligned} \mathbb{E} \left[\zeta_k(x)\right] = x ,\qquad x\in[0,1]. \end{aligned}$$

By means of mixtures, we replace this discrete family by a continuous one: let J > 1 be an integer and V = (V 1, …, V J) be a random vector following the flat Dirichlet distribution (uniform on the set of nonnegative vectors with v 1 + ⋯ + v J = 1). Take independently k j following the same distribution as k and define

$$\displaystyle \begin{aligned} T(x) =\sum_{j=1}^J V_j \zeta_{k_j}(x). \end{aligned} $$
(4.5)

Since V j is positive, T is increasing and as (V j) sums up to unity T has mean identity. Realisations of these warp functions are given in Fig. 4.4b and c for J = 2 and J = 10, respectively. The parameters (k j) were chosen as symmetrised Poisson random variables: each k j has the law of XY with X Poisson with mean 3 and \({\mathbb {P}}(Y=1)={\mathbb {P}}(Y=-1)=1/2\) for Y and X independent. When J = 10 is large, the function T deviates only mildly from the identity, since a law of large numbers begins to take effect. In contrast, J = 2 yields functions that are quite different from the identity. Thus, it can be said that the parameter J controls the variance of the random warp function T.

Fig. 4.4
figure 4

(a) The functions {ζ −3, …, ζ 3}; (b) realisations of T defined by (4.5) with J = 2 and k j symmetrisations of Poisson random variables with mean 3; (c) realisations of T defined by (4.5) with J = 10 and k j as in (b)

4.5.2 Bimodal Cox Processes

Let the structural mean measure λ be a mixture of a bimodal Gaussian distribution (restricted to K = [−16, 16]) and a beta background on the interval [−12, 12], so that mass is added at the centre of K but not near the boundary. In symbols this is given as follows. Let φ be the standard Gaussian density and let β α,β denote the density of a the beta distribution with parameters α and β. Then λ is chosen as the measure with density

$$\displaystyle \begin{aligned} f(x) =\frac{1-\epsilon}2[\varphi(x - 8) + \varphi(x + 8)] + \frac{\epsilon}{24} \beta_{1.5,1.5}\left(\frac{x+12}{24}\right) ,\qquad x\in[-16,16],\end{aligned} $$
(4.6)

where 𝜖 ∈ [0, 1] is the weight of the beta background. (We ignore the loss of a negligible amount of mass due to the restriction of the Gaussians to [−16, 16].) Plots of the density and distribution functions are given in Fig. 4.5.

Fig. 4.5
figure 5

Density and distribution functions corresponding to (4.6) with 𝜖 = 0 and 𝜖 = 0.15

The main criterion for the quality of our regularised Fréchet–Wasserstein estimator will be its success in discerning the two modes at ± 8; these will be smeared by the phase variation arising from the warp functions.

We next simulated 30 independent Poisson processes with mean measure λ, 𝜖 = 0.1, and total intensity (expected number of points) τ = 93. In addition, we generated warp functions as in (4.5) but rescaled to [−16, 16]; that is, having the same law as the functions

$$\displaystyle \begin{aligned} x \mapsto 32T\left(\frac{x+16}{32} \right) - 16 \end{aligned}$$

from K to K. These cause rather violent phase variation, as can be seen by the plots of the densities and distribution functions of the conditional measures Λ = T#λ presented in Fig. 4.6a and b; the warped points themselves are displayed in Fig. 4.6c.

Fig. 4.6
figure 6

(a) 30 warped bimodal densities, with density of λ given by (4.6) in solid black; (b) their corresponding distribution functions, with that of λ in solid black; (c) 30 Cox processes, constructed as warped versions of Poisson processes with mean intensity 93f using as warp functions the rescaling to [−16.16] of (4.5)

Using these warped point patterns, we construct the regularised Fréchet– Wasserstein estimator employing the procedure described in Sect. 4.3. Each \(\widetilde \varPi _i\) was smoothed with a Gaussian kernel and bandwidth chosen by unbiased cross validation. We deviate slightly from the recipe presented in Sect. 4.3 by not restricting the resulting estimates to the interval [−16, 16], but this has no essential effect on the finite sample performance. The regularised Fréchet–Wasserstein estimator \(\widehat \lambda _n\) serves as the estimator of the structural mean λ and is shown in Fig. 4.7a. It is contrasted with λ at the level of distribution functions, as well as with the empirical arithmetic mean; the latter, the naive estimator, is calculated by ignoring the warping and simply averaging linearly the (smoothed) empirical distribution functions across the observations. We notice that \(\widehat {\lambda }_n\) is rather successful at locating the two modes of λ, in contrast with the naive estimator that is more diffuse. In fact, its distribution function increases approximately linearly, suggesting a nearly constant density instead of the correct bimodal one.

Fig. 4.7
figure 7

(a) Comparison between the regularised Fréchet–Wasserstein estimator, the empirical arithmetic mean, and the true distribution function, including residual curves centred at y = 3∕4; (b) The estimated warp functions; (c) Kernel estimates of the density function f of the structural mean, based on the warped and registered point patterns

Estimators of the warp maps \(\widehat {T_i}\), depicted in Fig. 4.7b, and their inverses, are defined as the optimal maps between \(\widehat {\lambda }_n\) and the estimated conditional mean measures, as explained in Sect. 4.3.4. Then we register the point patterns by applying to them the inverse estimators \(\widehat {T_i^{-1}}\) (Fig. 4.8). Figure 4.7c gives two kernel estimators of the density of λ constructed from a superposition of all the warped points and all the registered ones. Notice that the estimator that uses the registered points is much more successful than the one using the warped ones in discerning the two density peaks. This is not surprising after a brief look at Fig. 4.8, where the unwarped, warped, and registered points are displayed. Indeed, there is very high concentration of registered points around the true location of the peaks, ± 8. This is not the case for the warped points because of the phase variation that translates the centres of concentration for each individual observation. It is important to remark that the fluctuations in the density estimator in Fig. 4.7c are not related to the registration procedure, and could be reduced by a better choice of bandwidth. Indeed, our procedure does not attempt to estimate the density, but rather the distribution function.

Fig. 4.8
figure 8

Bimodal Cox processes: (a) the observed warped point processes; (b) the unobserved original point processes; (c) the registered point processes

Figure 4.9 presents a superposition of the regularised Fréchet–Wasserstein estimators for 20 independent replications of the experiment, contrasted with a similar superposition for the naive estimator. The latter is clearly seen to be biased around the two peaks, while the regularised Fréchet–Wasserstein seems approximately unbiased, despite presenting fluctuations. It always captures the bimodal nature of the density, as is seen from the two clear elbows in each realisation.

Fig. 4.9
figure 9

(a) Sampling variation of the regularised Fréchet–Wasserstein mean \(\widehat \lambda _n\) and the true mean measure λ for 20 independent replications of the experiment; (b) sampling variation of the arithmetic mean, and the true mean measure λ for the same 20 replications; (c) superposition of (a) and (b). For ease of comparison, all three panels include residual curves centred at y = 3∕4

To illustrate the consistency of the regularised Fréchet–Wasserstein estimator \(\widehat \lambda _n\) for λ as shown in Theorem 4.4.1, we let the number of processes n as well as the expected number of observed point per process τ to vary. Figures 4.10 and 4.11 show the sampling variation of \(\widehat \lambda _n\) for different values of n and τ. We observe that as either of these increases, the realisations \(\widehat \lambda _n\) indeed approach λ. The figures suggest that, in this scenario, the amplitude variation plays a stronger role than the phase variation, as the effect of τ is more substantial.

Fig. 4.10
figure 10

Sampling variation of the regularised Fréchet–Wasserstein mean \(\widehat \lambda _n\) and the true mean measure λ for 20 independent replications of the experiment, with 𝜖 = 0 and n = 30. Left: τ = 43; middle: τ = 93; right: τ = 143. For ease of comparison, all three panels include residual curves centred at y = 3∕4

Fig. 4.11
figure 11

Sampling variation of the regularised Fréchet–Wasserstein mean \(\widehat \lambda _n\) and the true mean measure λ for 20 independent replications of the experiment, with 𝜖 = 0 and τ = 93. Left: n = 30; middle: n = 50; right: n = 70. For ease of comparison, all three panels include residual curves centred at y = 3∕4.

4.5.3 Effect of the Smoothing Parameter

In order to work with measures of strictly positive density, the observed point patterns have been smoothed using a kernel function. This necessarily incurs an additional bias that depends on the bandwidth σ i. The asymptotics (Theorem 4.4.1) guarantee the consistency of the estimators, in particular the regularised Fréchet–Wasserstein estimator \(\widehat \lambda _n\), provided that \(\max _{i=1}^n\sigma _i\to 0\). In our simulations, we choose σ i in a data-driven way by employing unbiased cross validation. To gauge for the effect of the smoothing, we carry out the same estimation procedure but with σ i multiplied by a parameter s. Figure 4.12 presents the distribution function of \(\widehat \lambda _n\) as a function of s. Interestingly, the curves are nearly identical as long as s ≤ 1, whereas when s > 1, the bias becomes more substantial.

Fig. 4.12
figure 12

Regularised Fréchet–Wasserstein mean as a function of the smoothing parameter multiplier s, including residual curves. Here, n = 30 and τ = 143

These findings are reaffirmed in Fig. 4.13 that show the registered point processes again as a function of s. We see that only minor differences are present as s varies from 0.1 to 1, for example, in the grey (8), black (17), and green (19) processes. When s = 3, the distortion becomes quite more substantial. This phenomenon repeats itself across all combinations of n, τ, and s tested.

Fig. 4.13
figure 13

Registered point processes as a function of the smoothing parameter multiplier s. Left: s = 0.1; middle: s = 1; right: s = 3. Here, n = 30 and τ = 43

4.6 Convergence Rates and a Central Limit Theorem on the Real Line

Since the conditional mean measures Λ i are discretely observed, the rate of convergence of our estimators will be affected by the rate at which the number of observed points per process \(N_i^{(n)}\) increases to infinity. The latter is controlled by the next lemma, which is valid for any complete separable metric space \(\mathcal X\).

Lemma 4.6.1 (Number of Points Grows Linearly)

Let \(N_i^{(n)}=\varPi _i^{(n)}(\mathcal X)\) denote the total number of observed points. If \(\tau _n/\log n\to \infty \), then there exists a constant C Π > 0, depending only on the distribution of Π, such that almost surely

$$\displaystyle \begin{aligned} \liminf_{n\to\infty} \frac{\min_{1\le i\le n}N_i^{(n)}}{\tau_n} \ge C_\varPi. \end{aligned}$$

In particular, there are no empty point processes, so the normalisation is well-defined. If Π is a Poisson process, then we have the more precise result

$$\displaystyle \begin{aligned} \lim_{n\to\infty} \frac{\min_{1\le i\le n}N_i^{(n)}}{\tau_n} =1 \qquad \mathrm{almost surely}. \end{aligned}$$

Remark 4.6.2

One can also show that the limit superior of the same quantity is bounded by a constant \(C_\varPi ^{\prime }\). If \(\tau _n/\log n\) is bounded below, then the same result holds but with worse constants. If only τ n →∞, then the result holds for each i separately but in probability.

The proof is a simple application of Chernoff bounds; see page 108 in the supplement.

With Lemma 4.6.1 under our belt, we can replace terms of the order \(\min _i N_i^{(n)}\) by the more transparent order τ n. As in the consistency proof, the idea is to write

$$\displaystyle \begin{aligned} F - \widehat F_n =(F - F_n) +(F_n - \tilde F_n) +( \tilde F_n - \widehat F_n ) \end{aligned}$$

and control each term separately. The first term corresponds to the phase variation, and comes from the approximation of the theoretical expectation F by a sample mean F n. The second term is associated with the amplitude variation resulting from observing Λ i discretely. The third term can be viewed as the bias incurred by the smoothing procedure. Accordingly, the rate at which \(\widehat \lambda _n\) converges to λ is a sum of three separate terms. We recall the standard \(O_{\mathbb {P}}\) terminology: if X n and Y n are random variables, then \(X_n=O_{\mathbb {P}}(Y_n)\) means that the sequence (X nY n) is bounded in probability, which by definition is the condition

$$\displaystyle \begin{aligned} \forall\epsilon>0\ \exists M:\quad \sup_n {\mathbb{P}}\left(\left| \frac{X_n}{Y_n}\right| > M\right) < \epsilon. \end{aligned}$$

Instead of \(X_n=O_{\mathbb {P}}(Y_n)\), we will sometimes write \(Y_n\ge O_{\mathbb {P}}(X_n)\). The former notation emphasises the condition that X n grows no faster than Y n, while the latter stresses that Y n grows at least as fast as X n (which is of course the same assertion). Finally, \(X_n=o_{\mathbb {P}}(Y_n)\) means that X nY n → 0 in probability.

Theorem 4.6.3 (Convergence Rates on \(\mathbb {R}\))

Suppose in addition to Assumptions 3 that d = 1, \(\tau _n/\log n\to \infty \) and that Π is either a Poisson process or a binomial process. Then

$$\displaystyle \begin{aligned} W_2(\widehat\lambda_n,\lambda) \le O_{{\mathbb{P}}}\left(\frac 1{\sqrt{n}}\right) + O_{{\mathbb{P}}}\left(\frac 1{\sqrt[4]{\tau_n}}\right) + O_{{\mathbb{P}}}\left(\sigma_n\right), \qquad \sigma_n= \frac 1n\sum_{i=1}^n\sigma_i^{(n)}, \end{aligned}$$

where all the constants in the \(O_{\mathbb {P}}\) terms are explicit.

Remark 4.6.4

Unlike classical density estimation, no assumptions on the rate of decay of σ n are required, because we only need to estimate the distribution function and not the derivative. If the smoothing parameter is chosen to be \(\sigma _i^{(n)}=[N_i^{(n)}]^{-\alpha }\) for some α > 0 and \(\tau _n/\log n\to \infty \), then by Lemma 4.6.1 \(\sigma _n\le \max _{1\le i\le n}\sigma _i^{(n)}=O_{\mathbb {P}}(\tau _n^{-\alpha })\). For example, if Rosenblatt’s rule α = 1∕5 is employed, then the \(O_{\mathbb {P}}(\sigma _n)\) term can be replaced by \(O_{\mathbb {P}}(1/\sqrt [5]{\tau _n})\).

One can think about the parameter τ as separating the sparse and dense regimes as in classical functional data analysis (see also Wu et al. [132]). If τ is bounded, then the setting is ultra sparse and consistency cannot be achieved. A sparse regime can be defined as the case where τ n → but slower than \(\log n\). In that case, consistency is guaranteed, but some point patterns will be empty. The dense regime can be defined as τ n ≫ n 2, in which case the amplitude variation is negligible asymptotically when compared with the phase variation.

The exponent − 1∕4 of τ n can be shown to be optimal without further assumptions, but it can be improved to − 1∕2 if \({\mathbb {P}}(f_\varLambda \ge \epsilon \mbox{ on }K)=1\) for some 𝜖 > 0, where f Λ is the density of Λ (see Sect. 4.7). In terms of T, the condition is that \({\mathbb {P}}(T'\ge \epsilon )=1\) for some 𝜖 and λ has a density bounded below. When this is the case, τ n needs to compared with n rather than n 2 in the next paragraph and the next theorem.

Theorem 4.6.3 provides conditions for the optimal parametric rate \(\sqrt {n}\) to be achieved: this happens if we set σ n to be of the order \(O_{\mathbb {P}}(n^{-1/2})\) or less and if τ n is of the order n 2 or more. But if the last two terms in Theorem 4.6.3 are negligible with respect to n −1∕2, then a sort of central limit theorem holds for \(\widehat \lambda _n\):

Theorem 4.6.5 (Asymptotic Normality)

In addition to the conditions of Theorem 4.6.3, assume that τ nn 2 →∞, \(\sigma _n=o_{{\mathbb {P}}}(n^{-1/2})\) and λ possesses an invertible distribution function F λ on K. Then

$$\displaystyle \begin{aligned} \sqrt{n}\left({\mathbf t_{\lambda}^{{\widehat\lambda_n}}} -\mathbf i\right) {\longrightarrow} Z \quad \mathrm{weakly in} L_2(\lambda), \end{aligned}$$

for a zero-mean Gaussian process Z with the same covariance operator of T (the latter viewed as a random element in L 2(λ)), namely with covariance kernel

$$\displaystyle \begin{aligned} \kappa(x,y) ={\mathrm{cov}}\Big\{T(x),T(y)\Big\}. \end{aligned}$$

If the density f λ exists and is (piecewise) continuous and bounded below on K, then the weak convergence also holds in L 2(K).

In view of Sect. 2.3, Theorem 4.6.5 can be interpreted as asymptotic normality of \(\widehat \lambda _n\) in the tangential sense: \(\sqrt {n} \log _\lambda (\widehat \lambda _n)\) converges to a Gaussian random element in the tangent space Tanλ, which is a subset of L 2(λ). The additional smoothness conditions allow to switch to the space L 2(K), which is independent of the unknown template measure λ.

See pages 109 and 110 in the supplement for detailed proofs of these theorems. Below we sketch the main ideas only.

Proof(Proof of Theorem 4.6.3)

The quantile formula \(W_2(\gamma ,\theta )=\|F^{-1}_\theta - F^{-1}_\gamma \|{ }_{L^2(0,1)}\) from Sect. 1.5 and the average quantile formula for the Fréchet mean (Sect. 3.1.4) show that the oracle empirical mean \(F_{\lambda _n}^{-1}\) follows a central limit theorem in L 2(0, 1). Since we work in the Hilbert space L 2(0, 1), Fréchet means are simple averages, so the errors in the Fréchet mean have the same rate as the errors in the Fréchet functionals. The smoothing term is easily controlled by Lemma 4.4.2.

Controlling the amplitude term is more difficult. Bounds can be given using the machinery sketched in Sect. 4.7, but we give a more elementary proof by reducing to the 1-Wasserstein case (using (2.2)), which can be more easily handled in terms of distributions functions (Corollary 1.5.3).

Proof(Proof of Theorem 4.6.5) The hypotheses guarantee that the amplitude and smoothing errors are negligible and

$$\displaystyle \begin{aligned} \sqrt{n}\left(F^{-1}_{\widehat\lambda_n}-F^{-1}_\lambda\right) \to GP \quad \mathrm{weakly in} L_2(0,1), \end{aligned}$$

where GP is the Gaussian process defined in the proof of Theorem 4.6.3. One then employs a composition with F λ.

4.7 Convergence of the Empirical Measure and Optimality

One may find the term \(O_{\mathbb {P}}(1/\sqrt [4]{\tau _n})\) in Theorem 4.6.3 to be somewhat surprising, and expect that it ought to be \(O_{\mathbb {P}}(1/\sqrt {\tau _n})\). The goal of this section is to show why the rate \(1/\sqrt [4]{\tau _n}\) is optimal without further assumptions and discuss conditions under which it can be improved to the optimal rate \(1/\sqrt {\tau _n}\). For simplicity, we concentrate on the case τ n = n and assume that the point process Π is binomial; the Poisson case being easily obtained from the simplified one (using Lemma 4.6.1). We are thus led to study rates of convergence of empirical measures in the Wasserstein space. That is to say, for a fixed exponent p ≥ 1 and a fixed measure \(\mu \in \mathcal W_p(\mathcal X)\), we consider independent random variables X 1, … with law μ and the empirical measure \(\mu _n=n^{-1}\sum _{i=1}^n\delta \{X_i\}\). The first observation is that \(\mathbb {E} W_p(\mu ,\mu _n)\to 0\):

Lemma 4.7.1

Let \(\mu \in P(\mathcal X)\) be any measure. Then

$$\displaystyle \begin{aligned} \mathbb{E} W_p(\mu,\mu_n) \begin{cases} =\infty & \mu\notin \mathcal W_p(\mathcal X)\\ \to0 & \mu\in \mathcal W_p(\mathcal X). \end{cases} \end{aligned}$$

Proof

This result has been established in an almost sure sense in Proposition 2.2.6. To extend to convergence in expectation observe that

$$\displaystyle \begin{aligned} W_p^p(\mu,\mu_n) \le {\int_{\mathcal X^2} \! \|x-y\|{}^p \, \mathrm{d}\mu\otimes\mu_n(x,y)} =\frac 1n \sum_{i=1}^n {\int_{\mathcal X} \! \|x-X_i\|{}^p \, \mathrm{d}\mu(x)}. \end{aligned}$$

Thus, the random variable \(0\le Y_n=W_p^p(\mu ,\mu _n)\) is bounded by the sample average Z n of a random variable \(V={\int _{\mathcal X} \! \|x-X_1\|{ }^p \, \mathrm {d}\mu (x)}\) that has a finite expectation. A version of the dominated converge theorem (given on page 111 in the supplement) implies that \(\mathbb {E} Y_n\to 0\). Now invoke Jensen’s inequality.

Remark 4.7.2

The sequence \(\mathbb {E} W_p(\mu ,\mu _n)\) is not monotone, as the simple example μ = (δ 0 + δ 1)∕2 shows (see page 111 in the supplement).

The next question is how quickly \(\mathbb {E} W_p(\mu ,\mu _n)\) vanishes when \(\mu \in \mathcal W_p(\mathcal X)\). We shall begin with two simple general lower bounds, then discuss upper bounds in the one-dimensional case, put them in the context of Theorem 4.6.3, and finally briefly touch the d-dimensional case.

Lemma 4.7.3 (\(\sqrt {n}\) Lower Bound)

Let \(\mu \in P(\mathcal X)\) be nondegenerate. Then there exists a constant c(μ) > 0 such that for all p ≥ 1 and all n

$$\displaystyle \begin{aligned} \mathbb{E} W_p(\mu_n,\mu) \ge \frac {c(\mu)}{\sqrt{n}}. \end{aligned}$$

Proof

Let X ∼ μ and let ab be two points in the support μ. Consider \(f(x)=\min (1,\|x-a\|)\), a bounded 1-Lipschitz function such that f(a) = 0 < f(b). Then

$$\displaystyle \begin{aligned} \sqrt{n} \mathbb{E} W_p(\mu_n,\mu) \ge\sqrt{n} \mathbb{E} W_1(\mu_n,\mu) \ge \mathbb{E} \left|n^{-1/2}\sum_{i=1}^n f(X_i) - \mathbb{E} f(X)\right| \to \sqrt{\frac{2{\mathrm{var}} f(X)}{\pi}} >0 \end{aligned}$$

by the central limit theorem and the Kantorovich–Rubinstein theorem (1.11).

For discrete measures, the rates scale badly with p. More generally:

Lemma 4.7.4 (Separated Support)

Suppose that there exist Borel sets \(A,B\subset \mathcal X\) such that μ(A  B) = 1,

$$\displaystyle \begin{aligned} \mu(A)\mu(B)>0 \qquad \mathrm{ and} \qquad d_{\min}=\inf_{x\in A,y\in B}\|x-y\| >0. \end{aligned}$$

Then for any p ≥ 1 there exists c p(μ) > 0 such that \(\mathbb {E} W_p(\mu _n,\mu )\ge c_p(\mu )n^{-1/(2p)}\).

Any nondegenerate finitely discrete measure μ satisfies this condition, and so do “non-pathological” countably discrete ones. (An example of a “pathological” measure is one assigning positive mass to any rational number.)

Proof

Let k ∼ B(n, q = μ(A)) denote the number of points from the sample (X 1, …, X n) that fall in A. Then a mass of |kn − q| must travel between A and B, a distance of at least \(d_{\min }\). Thus, \(W_p^p(\mu _n,\mu )\ge d_{\min }^p|k/n-q|\), and the result follows from the central limit theorem for k; see page 112 in the supplement for the full details.

These lower bounds are valid on any separable metric space. On the real line, it is easy to obtain a sufficient condition for the optimal rate n −1∕2 to be attained for W 1: since F n(t) ∼ B(n, F(t)) has variance F(t)(1 − F(t))∕n, we have (by Fubini’s theorem and Jensen’s inequality)

$$\displaystyle \begin{aligned} \mathbb{E} W_1(\mu_n,\mu) = {{\int_{\mathbb{R}} \! {\mathbb{E} |F_n(t) - F(t)|} \, \mathrm{d}t}} \le n^{-1/2} {{\int_{\mathbb{R}} \! {\sqrt{F(t) (1-F(t))}} \, \mathrm{d}t}} , \end{aligned}$$

so that W 1(μ n, μ) is of the optimal order n −1∕2 if

$$\displaystyle \begin{aligned} J_1(\mu) :={{\int_{\mathbb{R}} \! {\sqrt{F(t) (1-F(t))}} \, \mathrm{d}t}} <\infty. \end{aligned}$$

Since the integrand is bounded by 1∕2, this is certainly satisfied if μ is compactly supported. The J 1 condition is essentially a moment condition, since for any δ > 0, we have for X ∼ μ that \(\mathbb {E}|X|{ }^{2+\delta }<\infty \Longrightarrow J_1(\mu )<\infty \Longrightarrow \mathbb {E}|X|{ }^2<\infty \). It turns out that this condition is necessary, and has a more subtle counterpart for any p ≥ 1. Let f denote the density of the absolutely continuous part of μ (so f ≡ 0 if μ is discrete).

Theorem 4.7.5 (Rate of Convergence of Empirical Measures)

Let p ≥ 1 and \(\mu \in \mathcal W_p(\mathbb {R})\). The condition

$$\displaystyle \begin{aligned} J_p(\mu) ={{\int_{{\mathbb{R}}} \! {\frac{[F(t) (1 - F(t))]^{p/2}} {[f(t)]^{p-1}}} \, \mathrm{d}t}} <\infty ,\qquad (0^0=1) \end{aligned}$$

is necessary and sufficient for \(\mathbb {E} W_p(\mu _n,\mu )=O(n^{-1/2})\).

See Bobkov and Ledoux [25, Theorem 5.10] for a proof for the J p condition, and Theorems 5.1 and 5.3 for the values of the constants and a stronger result.

When p > 1, for J p(μ) to be finite, the support of μ must be connected; this is not needed when p = 1. Moreover, the J p condition is satisfied when f is bounded below (in which case the support of μ must be compact). However, smoothness alone does not suffice, even for measures with positive density on a compact support. More precisely, we have:

Proposition 4.7.6

For any rate 𝜖 n → 0 there exists a measure μ on [−1, 1] with positive C density there, and such that for all n

$$\displaystyle \begin{aligned} \mathbb{E} W_p(\mu_n,\mu) \ge C(p,\mu) n^{-1/(2p)} \epsilon_n. \end{aligned}$$

The rate n −1∕(2p) from Lemma 4.7.4 is the worst among compactly supported measures on \(\mathbb {R}\). Indeed, by Jensen’s inequality and (2.2), for any μ ∈ P([0, 1]),

$$\displaystyle \begin{aligned} \mathbb{E} W_p(\mu_n,\mu) \le \left[\mathbb{E} W_p^p(\mu_n,\mu)\right]^{1/p} \le [\mathbb{E} W_1(\mu_n,\mu)]^{1/p} \le n^{-1/(2p)}. \end{aligned}$$

The proof of Proposition 4.7.6 is done by “smoothing” the construction in Lemma 4.7.4, and is given on page 113 in the supplement.

Let us now put this in the context of Theorem 4.6.3. In the binomial case, since each \(\varPi _i^{(n)}\) and each Λ i are independent, we have

$$\displaystyle \begin{aligned} \mathbb{E} W_2(\varLambda_i,\widetilde\varLambda_i) | \varLambda_i \le \sqrt{2J_2(\varLambda_i)} \frac 1{\sqrt{\tau_n}}. \end{aligned}$$

(In the Poisson case, we need to condition on \(N_i^{(n)}\) and then estimate its inverse square root as is done in the proof of Theorem 4.6.3.) Therefore, a sufficient condition for the rate \(1/\sqrt {\tau _n}\) to hold is that \(\mathbb {E} \sqrt {J_2(\varLambda )}<\infty \) and a necessary condition is that \({\mathbb {P}}(\sqrt {J_2(\varLambda )}<\infty )=1\). These hold if there exists δ > 0 such that with probability one Λ has a density bounded below by δ. Since Λ = T#λ, this will happen provided that λ itself has a bounded below density and T has a bounded below derivative. Bigot et al. [23] show that the rate \(\sqrt {\tau _n}\) cannot be improved.

We conclude by proving a lower bound for absolutely continuous measures and stating, without proof, an upper bound.

Proposition 4.7.7

Let \(\mu \in \mathcal W_1(\mathbb {R}^d)\) have an absolutely continuous part with respect to Lebesgue measure, and let ν n be any discrete measure supported on n points (or less). Then there exists a constant C(μ) > 0 such that

$$\displaystyle \begin{aligned} W_p(\mu,\nu_n) \ge W_1(\mu,\nu_n) \ge C(\mu) n^{-1/d}. \end{aligned}$$

Proof

Let f be the density of the absolutely continuous part μ c, and observe that for some finite number M,

$$\displaystyle \begin{aligned} 2\delta =\mu_c(\{x:f(x)\le M\}) >0. \end{aligned}$$

Let x 1, …, x n be the support points of ν n and 𝜖 > 0. Let μ c,M be the restriction of μ c to the set where the density is smaller than M. The union of balls B 𝜖(x i) has μ c,M-measure of at most

$$\displaystyle \begin{aligned} M\sum_{i=1}^n\mathrm{Leb}(B_\epsilon(x_i)) =Mn\epsilon^d \mathrm{Leb}_d(B_1(0)) =Mn\epsilon^d C_d =\delta, \end{aligned}$$

if 𝜖 d = δ(nMC d)−1. Thus, a mass 2δ − δ = δ must travel more than 𝜖 from ν n to μ in order to cover μ c,M. Hence

$$\displaystyle \begin{aligned} W_1(\nu_n,\mu) \ge \delta \epsilon =\delta (\delta/MC_d)^{1/d} \ n^{-1/d}. \end{aligned}$$

The lower bound holds because we need 𝜖 d balls of radius 𝜖 in order to cover a sufficiently large fraction of the mass of μ. The determining quantity for upper bounds on the empirical Wasserstein distance is the covering numbers

$$\displaystyle \begin{aligned} N(\mu,\epsilon,\tau) =\mathrm{ minimal number of balls whose union has} \mu \mathrm{ mass} \ge1-\tau. \end{aligned}$$

Since μ is tight, these are finite for all 𝜖, τ > 0, and they increase as 𝜖 and τ approach zero. To put the following bound in context, notice that if μ is compactly supported on \(\mathbb {R}^d\), then N(μ, 𝜖, 0) ≤ K𝜖 d.

Theorem 4.7.8

If for some d>2p, N(μ, 𝜖, 𝜖 dp∕(d−2p))≤K𝜖 d, then \(\mathbb {E} W_p{\le } C_pn^{{-}1/d}\).

Comparing this with the lower bound in Lemma 4.7.4, we see that in the high-dimensional regime d > 2p, absolutely continuous measures have a worse rate than discrete ones. In the low-dimensional regime d < 2p, the situation is opposite. We also obtain that for d > 2 and a compactly supported absolutely continuous \(\mu \in \mathcal W_1(\mathbb {R}^d)\), \(\mathbb {E} W_1(\mu _n,\mu )\sim n^{-1/d}\).

4.8 Bibliographical Notes

Our exposition in this chapter closely follows the papers Panaretos and Zemel [100] and Zemel and Panaretos [134].

Books on functional data analysis include Ramsay and Silverman [109, 110], Ferraty and Vieu [51], Horváth and Kokoszka [70], and Hsing and Eubank [71], and a recent review is also available (Wang et al. [127]). The specific topic of amplitude and phase variation is discussed in [110, Chapter 7] and [127, Section 5.2]. The next paragraph gives some selective references.

One of the first functional registration techniques employed dynamic programming (Wang and Gasser [128]) and dates back to Sakoe and Chiba [118]. Landmark registration consists of identifying salient features for each curve, called landmarks, and aligning them (Gasser and Kneip [61]; Gervini and Gasser [63]). In pairwise synchronisation (Tang and Müller [122]) one aligns each pair of curves and then derives an estimator of the warp functions by linear averaging of the pairwise registration maps. Another class of methods involves a template curve, to which each observation is registered, minimising a discrepancy criterion; the template is then iteratively updated (Wang and Gasser [129]; Ramsay and Li [108]). James [72] defines a “feature function” for each curve and uses the moments of the feature function to guarantee identifiability. Elastic registration employs the Fisher–Rao metric that is invariant to warpings and calculates averages in the resulting quotient space (Tucker et al. [123]). Other techniques include semiparametric modelling (Rønn [115]; Gervini and Gasser [64]) and principal components registration (Kneip and Ramsay [82]). More details can be found in the review article by Marron et al. [90]. Wrobel et al. [131] have recently developed a registration method for functional data with a discrete flavour. It is also noteworthy that a version of the Wasserstein metric can also be used in the functional case (Chakraborty and Panaretos [34]).

The literature on the point processes case is more scarce; see the review by Wu and Srivastava [133].

A parametric version of Theorem 4.2.4 was first established by Bigot and Klein [22, Theorem 5.1] in \(\mathbb {R}^d\), extended to a compact nonparametric formulation in Zemel and Panaretos [134]. There is an infinite-dimensional linear version in Masarotto et al. [91]. The current level of generality appears to be new.

Theorem 4.4.1 is a stronger version of Panaretos and Zemel [100, Theorem 1] where it was assumed that τ n must diverge to infinity faster than \(\log n\). An analogous construction under the Bayesian paradigm can be found in Galasso et al. [58]. Optimality of the rates of convergence in Theorem 4.6.3 is discussed in detail by Bigot et al. [23], where finiteness of the functional J 2 (see Sect. 4.7) is assumed and consequently \(O_{\mathbb {P}}(\tau _n^{-1/4})\) is improved to \(O_{\mathbb {P}}(\tau _n^{-1/2})\).

As far as we know, Theorem 4.6.5 (taken from [100]) is the first central limit theorem for Fréchet means in Wasserstein space. When the measures Λ i are observed exactly (no amplitude variation: τ n =  and σ = 0) Kroshnin et al. [84] have recently proven a central limit theorem for random Gaussian measures in arbitrary dimension, extending a previous result of Agueh and Carlier [3]. It seems likely that in a fully nonparametric setting, the rates of convergence (compare Theorem 4.6.3) might be slower than \(\sqrt {n}\); see Ahidar-Coutrix et al. [4].

The magnitude of the amplitude variation in Theorem 4.6.3 pertains to the rates of convergence of \(\mathbb {E} W_p(\mu _n,\mu )\) to zero (Sect. 4.7). This is a topic of intense research, dating back to the seminal paper by Dudley [46], where a version of Theorem 4.7.8 with p = 1 is shown for the bounded Lipschitz metric. The lower bounds proven in this section were adapted from [46], Fournier and Guillin [54], and Weed and Bach [130].

The version of Theorem 4.7.8 given here can be found in [130] and extends Boissard and Le Gouic [27]. Both papers [27, 130] work in a general setting of complete separable metric spaces. An additional \(\log n\) term appears in the limiting case d = 2p, as already noted (for p = 1) by [46], and the classical work of Ajtai et al. [5] for μ uniform on [0, 1]2. More general results are available in [54]. A longer (but far from being complete) bibliography is given in the recent review by Panaretos and Zemel [101, Subsection 3.3.1], including works by Barthe, Dobrić, Talagrand, and coauthors on almost sure results and deviation bounds for the empirical Wasserstein distance.

The J 1 condition is due to del Barrio et al. [43], who showed it to be necessary and sufficient for the empirical process \(\sqrt {n}(F_n - F)\) to converge in distribution to \(\mathbb {B}\circ F\), with \(\mathbb B\) Brownian bridge. The extension to 1 ≤ p ≤ (and a lot more) can be found in Bobkov and Ledoux [25], employing order statistics and beta distributions to reduce to the uniform case. Alternatively, one may consult Mason [92], who uses weighted approximations to Brownian bridges.

An important aspect that was not covered here is that of statistical inference of the Wasserstein distance on the basis of the empirical measure. This is a challenging question and results by del Barrio, Munk, and coauthors are available for one-dimensional, elliptical, or discrete measures, as explained in [101, Section 3].