Keywords

1 Introduction

Many results in statistical learning theory study the problem of estimating the probability that a hypothesis chosen from a given hypothesis class can achieve a small true risk. This probability is often expressed in the form of generalization bounds on the true risk obtained using concentration inequalities with respect to (w.r.t.) some hypothesis class. Classic generalization bounds make the assumption that training and test data follow the same distribution. This assumption, however, can be violated in many real-world applications (e.g., in computer vision, language processing or speech recognition) where training and test data actually follow a related but different probability distribution. One may think of an example, where a spam filter is learned based on the abundant annotated data collected for one user and is further applied for newly registered user with different preferences. In this case, the performance of the spam filter will deteriorate as it does not take into account the mismatch between the underlying probability distributions. The need for algorithms tackling this problem has led to the emergence of a new field in machine learning called domain adaptation (DA), subfield of transfer learning [18], where the source (training) and target (test) distributions are not assumed to be the same. From a theoretical point of view, existing generalization guarantees for DA are expressed in the form of bounds over the target risk involving the source risk, a divergence between domains and a term \(\lambda \) evaluating the capability of the considered hypothesis class to solve the problem, often expressed as a joint error of the ideal hypothesis between the two domains. In this context, minimizing the divergence between distributions is a key factor for the potential success of DA algorithms. Among the most striking results, existing generalization bounds based on the H-divergence [3] or the discrepancy distance [15] have also an interesting property of being able to link the divergence between the probability distributions of two domains w.r.t. the considered class of hypothesis.

Despite their advantages, the above mentioned divergences do not directly take into account the geometry of the data distribution. Recently, [6, 7] has proposed to tackle this drawback by solving the DA problem using ideas from optimal transportation (OT) theory. Their paper proposes an algorithm that aims to reduce the divergence between two domains by minimizing the Wasserstein distance between their distributions. This idea has a very appealing and intuitive interpretation based on the transport of one domain to another. The transportation plan solving OT problem takes into account the geometry of the data by means of an associated cost function which is based on the Euclidean distance between examples. Furthermore, it is naturally defined as an infimum problem over all feasible solutions. An interesting property of this approach is that the resulting solution given by a joint probability distribution allows one to obtain the new projection of the instances of one domain into another directly without being restricted to a particular hypothesis class. This independence from the hypothesis class means that this solution not only ensures successful adaptation but also influences the capability term \(\lambda \). While showing very promising experimental results, it turns out that this approach, however, has no theoretical guarantees. This paper aims to bridge this gap by presenting contributions covering three DA settings: (i) classic unsupervised DA where the learner has only access to labeled source data and unsupervised target instances, (ii) DA where one has access to labeled data from both source and target domains, (iii) multi-source DA where labeled instances for a set of distinct source domains (more than 2) are available. We provide new theoretical guarantees in the form of generalization bounds for these three settings based on the Wasserstein distance thus justifying its use in DA. According to [26], the Wasserstein distance is rather strong and can be combined with smoothness bounds to obtain convergences in other distances. This important advantage of Wasserstein distance leads to tighter bounds in comparison to other state-of-the-art results and is more computationally attractive.

The rest of this paper is organized as follows: Sect. 2 is devoted to the presentation of optimal transport and its application in DA. In Sect. 3, we present the generalization bounds for DA with the Wasserstein distance for both single- and multi-source learning scenarios. Finally, we conclude our paper in Sect. 4.

2 Definitions and Notations

In this section, we first present the formalization of the Monge-Kantorovich [13] optimization problem and show how optimal transportation problem found its application in DA.

2.1 Optimal Transport

Optimal transportation theory was first introduced in [17] to study the problem of resource allocation. Assuming that we have a set of factories and a set of mines, the goal of optimal transportation is to move the ore from mines to factories in an optimal way, i.e., by minimizing the overall transport cost. More formally, let \(\varOmega \subseteq \mathbb {R}^d\) be a measurable space and denote by \(\mathcal {P}\left( \varOmega \right) \) the set of all probability measures over \(\varOmega \). Given two probability measures \(\mu _S, \mu _T \in \mathcal {P}\left( \varOmega \right) \), the Monge-Kantorovich problem consists in finding a probabilistic coupling \(\gamma \) defined as a joint probability measure over \(\varOmega \times \varOmega \) with marginals \(\mu _S\) and \(\mu _T\) for all \(x,y \in \varOmega \) that minimizes the cost of transport w.r.t. some function \(c: \varOmega \times \varOmega \rightarrow \mathbb {R}_+\):

$$\begin{aligned}&\underset{\gamma }{\arg \min } \int _{\varOmega _1 \times \varOmega _2} c(\varvec{x},\varvec{y})^pd\gamma (\varvec{x},\varvec{y})\\&\text {s.t.} \ \varvec{P}^{\varOmega _1}\# \gamma = \mu _S, \varvec{P}^{\varOmega _2}\# \gamma = \mu _T, \end{aligned}$$

where \(\varvec{P}^{\varOmega _i}\) is the projection over \(\varOmega _i\) and \(\#\) denotes the pushforward measure. This problem admits a unique solution \(\gamma _0\) which allows us to define the Wasserstein distance of order p between \(\mu _S\) and \(\mu _T\) for any \(p \in [1; +\infty ]\) as follows:

$$W_p^p(\mu _S,\mu _T) = \inf _{\gamma \in \varPi (\mu _S, \mu _T)} \int _{\varOmega \times \varOmega } c(\varvec{x},\varvec{y})^pd\gamma (\varvec{x},\varvec{y}),$$

where \(c: \varOmega \times \varOmega \rightarrow \mathbb {R}^+\) is a cost function for transporting one unit of mass \(\varvec{x}\) to \(\varvec{y}\) and \(\varPi (\mu _S, \mu _T)\) is a collection of all joint probability measures on \(\varOmega \times \varOmega \) with marginals \(\mu _S\) and \(\mu _T\).

Remark 1

In what follows, we consider only the case \(p=1\) but all the obtained results can be easily extended to the case \(p>1\) using Hölder inequality implying for every \(p\le q \Rightarrow W_p \le W_q\).

In the discrete case, when one deals with empirical measures \(\hat{\mu }_S = \frac{1}{N_S}\sum _{i=1}^{N_S}\delta _{x_S^i}\) and \(\hat{\mu }_T = \frac{1}{N_T}\sum _{i=1}^{N_T}\delta _{x_T^i}\) represented by the uniformly weighted sums of \(N_S\) and \(N_T\) Diracs with mass at locations \(x_S^i\) and \(x_T^i\) respectively, Monge-Kantorovich problem is defined in terms of the inner product between the coupling matrix \(\gamma \) and the cost matrix C:

$$ W_1(\hat{\mu }_S, \hat{\mu }_T) = \min _{\gamma \in \varPi (\hat{\mu }_S, \hat{\mu }_T)}\langle C, \gamma \rangle _F $$

where \(\langle \cdot \text {,} \cdot \rangle _F\) is the Frobenius dot product, \(\varPi (\hat{\mu }_S, \hat{\mu }_T) = \lbrace \gamma \in \mathbb {R}^{N_S \times N_T}_+ \vert \gamma \varvec{1} = \hat{\mu }_S, \gamma ^T \varvec{1} = \hat{\mu }_T\rbrace \) is a set of doubly stochastic matrices and C is a dissimilarity matrix, i.e., \(C_{ij} = c(x_S^i,x_T^j)\), defining the energy needed to move a probability mass from \(x_S^i\) to \(x_T^j\). Figure 1 shows how the solution of optimal transport between two point-clouds can look like.

Fig. 1.
figure 1

Blue points are generated to lie inside a square with a side length equal to 1. Red points are generated inside an annulus containing the square. Solution of the regularized optimal transport problem is visualized by plotting dashed and solid lines that correspond to the large and small values given by the optimal coupling matrix \(\gamma \). (Color figure online)

It turns out that the Wasserstein distance has been successfully used in various applications, for instance: computer vision [22], texture analysis [21], tomographic reconstruction [12] and clustering [9]. The huge success of algorithms based on this distance is due to [8] who introduced an entropy-regularized version of optimal transport that can be optimized efficiently using matrix scaling algorithm. We are now ready to present the application of OT to DA below.

2.2 Domain Adaptation and Optimal Transport

The problem of DA is formalized as follows: we define a domain as a pair consisting of a distribution \(\mu _D\) on \(\varOmega \) and a labeling function \(f_D: \varOmega \rightarrow [0,1]\). A hypothesis class H is a set of functions so that \(\forall h \in H, h : \varOmega \rightarrow \lbrace 0,1\rbrace \).

Definition 1

Given a convex loss-function l, the probability according to the distribution \(\mu _D\) that a hypothesis \(h \in H\) disagrees with a labeling function \(f_D\) (which can also be a hypothesis) is defined as

$$\epsilon _D (h,f_D) = \mathbb {E}_{x \sim \mu _D} \left[ l(h(x),f_D(x))\right] .$$

When the source and target error functions are defined w.r.t. h and \(f_S\) or \(f_T\), we use the shorthand \(\epsilon _S (h, f_S) = \epsilon _S (h)\) and \(\epsilon _T (h, f_T) = \epsilon _T (h)\). We further denote by \(\langle \mu _S, f_S \rangle \) the source domain and \(\langle \mu _T, f_T \rangle \) the target domain. The ultimate goal of DA then is to learn a good hypothesis h in \(\langle \mu _S, f_S \rangle \) that has a good performance in \(\langle \mu _T, f_T \rangle \).

In unsupervised DA problem, one usually has access to a set of source data instances \(\varvec{X_S} = \{\varvec{x}^{i}_S \in \mathbb {R}^d \}_{i=1}^{N_S}\) associated with labels \(\{ y^i_S\}_{i=1}^{N_S}\) and a set of unlabeled target data instances \(\varvec{X_T} = \{\varvec{x}^{i}_T \in \mathbb {R}^d \}_{i=1}^{N_T}\). Contrary to the classic learning paradigm, unsupervised DA assumes that the marginal distributions of \(\varvec{X_S}\) and \(\varvec{X_T}\) are different and given by \(\mu _S, \mu _T \in \mathcal {P}\left( \varOmega \right) \).

For the first time, optimal transportation problem was applied to DA in [6, 7]. The main underlying idea of their work is to find a coupling matrix that efficiently transports source samples to target ones by solving the following optimization problem:

$$\gamma _o = \mathop {\mathrm {arg\,min}}_{\gamma \in \varPi (\hat{\mu }_S,\hat{\mu }_T)}\langle C, \gamma \rangle _F.$$

Once the optimal coupling \(\gamma _o\) is found, source samples \(\varvec{X_S}\) can be transformed into target aligned source samples \(\varvec{\hat{X}_S}\) using the following equation

$$\varvec{\hat{X}_S} = \text {diag}((\gamma _o \varvec{1})^{-1})\gamma _o \varvec{X_T}.$$

The use of Wasserstein distance here has an important advantage over other distances used in DA (see Sect. 5) as it preserves the topology of the data and admits a rather efficient estimation as mentioned above. Furthermore, as shown in [6, 7], it improves current state-of-the-art results on benchmark computer vision data sets and has a very appealing intuition behind.

3 Generalization Bounds with Wasserstein Distance

In this section, we introduce generalization bounds for the target error when the divergence between tasks’ distributions is measured by the Wasserstein distance.

3.1 A Bound Relating the Source and Target Error

We first consider the case of unsupervised DA where no labelled data are available in the target domain. We start with a lemma that relates the Wasserstein metric with the source and target error functions for an arbitrary pair of hypothesis. Then, we show how the target error can be bounded by the Wasserstein distance for empirical measures. We first present the Lemma that introduces Wasserstein distance to relate the source and target error functions in a Reproducing Kernel Hilbert Space.

Lemma 1

Let \(\mu _S, \mu _T \in \mathcal {P}\left( \varOmega \right) \) be two probability measures on \(\mathbb {R}^d\). Assume that the cost function \(c(\varvec{x},\varvec{y}) = \Vert \phi (\varvec{x}) - \phi (\varvec{y}) \Vert _{\mathcal {H}_{k_l}}\), where \(\mathcal {H}_{k_l}\) is a Reproducing Kernel Hilbert Space (RKHS) equipped with kernel \(k_l: \varOmega \times \varOmega \rightarrow \mathbb {R}\) induced by \(\phi : \varOmega \rightarrow \mathcal {H}_{k_l}\) and \(k_l(\varvec{x}, \varvec{y}) = \langle \phi (\varvec{x}), \phi (\varvec{y}) \rangle _{\mathcal {H}_{k_l}}\). Assume further that the loss function \(l_{h,f}:x \longrightarrow l(h(x),f(x))\) is convex, symmetric, bounded, obeys the triangular equality and has the parametric form \(\vert h(x) - f(x) \vert ^q\) for some \(q > 0\). Assume also that kernel \(k_l\) in the RKHS \(\mathcal {H}_{k_l}\) is square-root integrable w.r.t. both \(\mu _S,\mu _T\) for all \(\mu _S,\mu _T \in \mathcal {P}(\varOmega )\) where \(\varOmega \) is separable and \(0\le k_l(\varvec{x},\varvec{y}) \le K, \forall \ \varvec{x},\varvec{y} \in \varOmega \). Then the following holds

$$ \epsilon _T (h , h')\le \epsilon _S (h , h') + W_1(\mu _S,\mu _T)$$

for every hypothesis \(h', h\).

Proof

As this Lemma plays a key role in the following sections, we give its proof here. We assume that \(l_{h,f}:x \longrightarrow l(h(x),f(x))\) in the definition of \(\epsilon (h)\) is a convex loss-function defined \(\forall h,f \in \mathcal {F}\) where \(\mathcal {F}\) is a unit ball in the RKHS \(\mathcal {H}_k\). Considering that \(h,f \in \mathcal {F}\), the loss function l is a non-linear mapping of the RKHS \(\mathcal {H}_{k}\) for the family of losses \(l(h(x),f(x)) = \vert h(x) - f(x) \vert ^q\) Footnote 1. Using results from [23], one may show that \(l_{h,f}\) also belongs to the RKHS \(\mathcal {H}_{k_l}\) admitting the reproducing kernel \(k_l\) and that its norm obeys the following inequality:

$$\vert \vert l_{h,f} \vert \vert _{\mathcal {H}_{k_l}}^2 \le \vert \vert h - f \vert \vert _{\mathcal {H}_k}^{2q}.$$

This result gives us two important properties of \(l_{f,h}\) that we use further:

  • \(l_{h,f}\) belongs to the RKHS that allows us to use the reproducing property;

  • the norm \(\vert \vert l_{h,f} \vert \vert _{\mathcal {H}_{k_l}}\) is bounded.

For simplicity, we can assume that \(\vert \vert l_{h,f} \vert \vert _{\mathcal {H}_{k_l}}\) is bounded by 1. This assumption can be verified by imposing the appropriate bounds on the norms of h and f and is easily extendable to the case when \(\vert \vert l_{h,f} \vert \vert _{\mathcal {H}_{k_l}} \le M\) by scaling as explained in [15, Proposition 2]. We also note that q does not necessarily have to appear in the final result as we seek to bound the norm of l and not to give an explicit expression for it in terms of \(\Vert h \Vert _{\mathcal {H}_{k}}, \Vert f \Vert _{\mathcal {H}_{k}}\) and q. Now the error function defined above can be also expressed in terms of the inner product in the corresponding Hilbert space, i.eFootnote 2:

$$\epsilon _S (h, f_S) = \mathbb {E}_{x \sim \mu _S} [l(h(x),f_S(x))] = \mathbb {E}_{x \sim \mu _S} [\langle \phi (x),l\rangle _\mathcal {H}].$$

We define the target error in the same manner:

$$\epsilon _T (h, f_T) = \mathbb {E}_{y \sim \mu _T} [l(h(y),f_T(y))] = \mathbb {E}_{y \sim \mu _T} [\langle \phi (y),l\rangle _\mathcal {H}].$$

With the definitions introduced above, the following holds:

$$\begin{aligned} \epsilon _T (h , h')&= \epsilon _T (h , h') + \epsilon _S (h , h') - \epsilon _S (h , h') \\&= \epsilon _S (h , h') + \mathbb {E}_{y \sim \mu _T} [\langle \phi (y),l\rangle _\mathcal {H}] - \mathbb {E}_{x \sim \mu _S} [\langle \phi (x),l\rangle _\mathcal {H}] \\&= \epsilon _S (h , h') + \langle \mathbb {E}_{y \sim \mu _T} [\phi (y)] - \mathbb {E}_{x \sim \mu _S} [ \phi (x)] ,l\rangle _\mathcal {H}\\&\le \epsilon _S (h , h') + \Vert l \Vert _\mathcal {H} \Vert \mathbb {E}_{y \sim \mu _T} [\phi (y)] - \mathbb {E}_{x \sim \mu _S} [ \phi (x)]\Vert _\mathcal {H}\\&\le \epsilon _S (h , h') + \Vert \int _{\varOmega } \phi d(\mu _S - \mu _T) \Vert _\mathcal {H}. \end{aligned}$$

Second line is obtained by using the reproducing property applied to l, third line follows from the properties of the expected value. Fourth line here is due to the properties of the inner-product while fifth line is due to \(\vert \vert l_{h,f} \vert \vert _{\mathcal {H}} \le 1\). Now using the definition of the joint distribution we have the following:

$$\begin{aligned} \Vert \int _{\varOmega } \phi d(\mu _S - \mu _T) \Vert _\mathcal {H}&= \Vert \int _{\varOmega \times \varOmega } (\phi (\varvec{x}) - \phi (\varvec{y})) d\gamma (\varvec{x},\varvec{y}) \Vert _\mathcal {H}\\&\le \int _{\varOmega \times \varOmega } \Vert \phi (\varvec{x}) - \phi (\varvec{y}) \Vert _{\mathcal {H}} d\gamma (\varvec{x},\varvec{y}). \end{aligned}$$

As the last inequality holds for any \(\gamma \), we obtain the final result by taking the infimum over \(\gamma \) from the right-hand side, i.e.:

$$\begin{aligned} \int _{\varOmega } \phi d(\mu _S - \mu _T) \Vert _\mathcal {H} \le \inf _{\gamma \in \varPi (\mu _S, \mu _T)} \int _{\varOmega \times \varOmega } \Vert \phi (\varvec{x}) - \phi (\varvec{y}) \Vert _{\mathcal {H}} d\gamma (\varvec{x},\varvec{y}). \end{aligned}$$

which gives

$$\epsilon _T (h , h') \le \epsilon _S (h , h') + W_1(\mu _S,\mu _T).$$

   \(\Box \)

Remark 2

We note that the functional form of the loss-function \(l(h(x),f(x)) = \vert h(x) - f(x) \vert ^q\) is just an example that was used as the basis for the proof. According to [23, Appendix 2], we may also consider more general nonlinear transformations of h and f that satisfy the assumption imposed on \(l_{h,f}\) above. These transformations may include a product of hypothesis and labeling functions and thus the proposed results is valid for hinge-loss too.

This lemma makes use of the Wasserstein distance to relate the source and target errors. The assumption made here is to specify that the cost function \(c(\varvec{x},\varvec{y}) = \Vert \phi (\varvec{x}) - \phi (\varvec{y}) \Vert _{\mathcal {H}}\). While it may seem too restrictive, this assumption is, in fact, not that strong. Using the properties of the inner-product, one has:

$$\begin{aligned} \Vert \phi (\varvec{x}) - \phi (\varvec{y}) \Vert _{\mathcal {H}}&= \sqrt{\langle \phi (\varvec{x}) - \phi (\varvec{y}), \phi (\varvec{x}) - \phi (\varvec{y}) \rangle _{\mathcal {H}}} \\&= \sqrt{k(\varvec{x},\varvec{x}) -2k(\varvec{x},\varvec{y})+k(\varvec{x},\varvec{y})}. \end{aligned}$$

Now it can be shown that for any given positive-definite kernel k there is a distance c (used as a cost function in our case) that generates it and vice versa (see Lemma 12 from [24]).

In order to prove our next theorem, we present first an important result showing the convergence of the empirical measure \(\hat{\mu }\) to its true associated measure w.r.t. the Wasserstein metric. This concentration guarantee allows us to propose generalization bounds based on the Wasserstein distance for finite samples rather than true population measures. Following [4], it can be specialized for the case of \(W_1\) as followsFootnote 3.

Theorem 1

([4], Theorem 1.1). Let \(\mu \) be a probability measure in \(\mathbb {R}^d\) so that for some \(\alpha >0\), we have that \(\int _{\mathbb {R}^d} e^{\alpha \Vert x\Vert ^2}d\mu <\infty \) and \(\hat{\mu } = \frac{1}{N}\sum _{i=1}^{N} \delta _{x_i}\) be its associated empirical measure defined on a sample of independent variables \(\{ x_i \}_{i=1}^N\) drawn from \(\mu \). Then for any \(d'>d\) and \(\varsigma ' < \sqrt{2}\) there exists some constant \(N_0\) depending on \(d'\) and some square exponential moment of \(\mu \) such that for any \(\varepsilon > 0\) and \(N \ge N_0 \max (\varepsilon ^{-(d'+2)},1)\)

$$\mathbb {P} \left[ W_1(\mu , \hat{\mu }) > \varepsilon \right] \le \exp \left( -\frac{\varsigma '}{2} N \varepsilon ^2\right) ,$$

where \(d', \varsigma '\) can be calculated explicitly.

The convergence guarantee of this theorem can be further strengthened as shown in [11] but we prefer this version for the ease of reading. We can now use it in combination with the previous Lemma to prove the following theorem.

Theorem 2

Under the assumptions of Lemma 1, let \(\mathbf {X_S}\) and \(\mathbf {X_T}\) be two samples of size \(N_S\) and \(N_T\) drawn i.i.d. from \(\mu _S\) and \(\mu _T\) respectively. Let \(\hat{\mu }_S = \frac{1}{N_S}\sum _{i=1}^{N_S} \delta _{x_S^i}\) and \(\hat{\mu }_T = \frac{1}{N_T}\sum _{i=1}^{N_T} \delta _{x_T^i}\) be the associated empirical measures. Then for any \(d'>d\) and \(\varsigma ' < \sqrt{2}\) there exists some constant \(N_0\) depending on \(d'\) such that for any \(\delta > 0\) and \(\min (N_S,N_T) \ge N_0 \max (\delta ^{-(d'+2)},1)\) with probability at least \(1-\delta \) for all h the following holds:

$$\begin{aligned} \epsilon _T (h)\le \epsilon _S (h)&+ W_1(\hat{\mu }_S, \hat{\mu }_T) + \sqrt{2\log \left( \frac{1}{\delta }\right) /\varsigma '}\left( \sqrt{\frac{1}{N_S}}+\sqrt{\frac{1}{N_T}}\right) + \lambda , \end{aligned}$$

where \(\lambda \) is the combined error of the ideal hypothesis \(h^*\) that minimizes the combined error of \(\epsilon _S(h)+\epsilon _T(h)\).

Proof

$$\begin{aligned} \epsilon _T (h)&\le \epsilon _T (h^*) + \epsilon _T (h^*,h) = \epsilon _T (h^*) + \epsilon _S (h,h^*) + \epsilon _T (h^*,h) - \epsilon _S (h,h^*)\\&\le \epsilon _T (h^*) + \epsilon _S (h,h^*) + W_1(\mu _S, \mu _T) \\&\le \epsilon _T (h^*) + \epsilon _S (h) + \epsilon _S (h^*) + W_1(\mu _S, \mu _T) \\&= \epsilon _S (h) + W_1(\mu _S, \mu _T) + \lambda \\&\le \epsilon _S (h) + W_1(\mu _S, \hat{\mu }_S) + W_1(\hat{\mu }_S, \mu _T) + \lambda \\&\le \epsilon _S (h) + \sqrt{2\log \left( \frac{1}{\delta }\right) /N_S\varsigma '} + W_1(\hat{\mu }_S, \hat{\mu }_T) + W_1(\hat{\mu }_T,\mu _T) + \lambda \\&\le \epsilon _S (h) + W_1(\hat{\mu }_S, \hat{\mu }_T) + \lambda + \sqrt{2\log \left( \frac{1}{\delta }\right) /\varsigma '}\left( \sqrt{\frac{1}{N_S}}+\sqrt{\frac{1}{N_T}}\right) . \end{aligned}$$

Second and fourth lines are obtained using the triangular inequality applied to the error function. Third inequality is a consequence of Lemma 1. Fifth line follows from the definition of \(\lambda \), sixth, seventh and eighth lines use the fact that Wasserstein metric is a proper distance and Theorem 1.     \(\Box \)

A first immediate consequence of this theorem is that it justifies the use of the optimal transportation in DA context. However, we would like to clarify the fact that the bound does not suggest that minimization of the Wasserstein distance can be done independently from the minimization of the source error nor it says that the joint error given by the lambda term becomes small. First, it is clear that the result of \(W_1\) minimization provides a transport of the source to the target such as \(W_1\) becomes small when computing the distance between newly transported sources and target instances. Under the hypothesis that class labeling is preserved by transport, i.e. \(P_{\text {source}}(y|x_s)=P_{\text {target}}(y|\text {Transport}(x_s))\), the adaptation can be possible by minimizing \(W_1\) only. However, this is not a reasonable assumption in practice. Indeed, by minimizing the \(W_1\) distance only, it is possible that the obtained transformation transports one positive and one negative source instance to the same target point and then the empirical source error cannot be properly minimized. Additionally, the joint error will be affected since no classifier will be able to separate these source points. We can also think of an extreme case where the positive source examples are transported to negative target instances, in that case the joint error \(\lambda \) will be dramatically affected. A solution is then to regularize the transport to help the minimization of the source error which can be seen as a kind of joint optimization. This idea was partially implemented as a class-labeled regularization term added to the original optimal transport formulation in [6, 7] and showed good empirical results in practice. The proposed regularized optimization problem reads

$$\min _{\gamma \in \varPi (\hat{\mu }_S,\hat{\mu }_T)}\langle C, \gamma \rangle _F - \frac{1}{\lambda }E(\gamma ) + \eta \sum _j \sum _\mathcal {L} \Vert \gamma (I_\mathcal {L}, j)\Vert _q^p.$$

Here, the second term \(E(\gamma ) = -\sum _{i,j}^{N_S,N_T} \gamma _{i,j}\log (\gamma _{i,j})\) is the regularization term that allows one to solve optimal transportation problem efficiently using Sinkhorn-Knopp matrix scaling algorithm [25]. Second regularization term \(\eta \sum _j \sum _c \Vert \gamma (I_c, j)\Vert _q^p\) is used to restrict source examples of different classes to be transported to the same target examples by promoting group sparsity in the matrix \(\gamma \) thanks to \(\Vert \cdot \Vert ^p_q\) with \(q = 1\) and \(p = \frac{1}{2}\). In some way, this regularization term influences the capability term by ensuring the existence of a good hypothesis that will be able to be discriminant on both source and target domains data. Another recent paper of [28] also suggests that transport regularization is important for the use of OT in domain adaptation tasks. Thus, we conclude that the regularized transport formulations such as the one of [6, 7] can be seen as algorithmic solutions for controlling the trade-off between the terms of the bound.

Assuming that \(\epsilon _S (h)\) is properly minimized, only \(\lambda \) and the Wasserstein distance between empirical measures defined on the source and target samples have an impact on the potential success of adaptation. Furthermore, the fact that the Wasserstein distance is defined in terms of the optimal coupling used to solve the DA problem and is not restricted to any particular hypothesis class directly influences \(\lambda \) as discussed above. We now proceed to give similar bounds for the case where one has access to some labeled instances in the target domain.

3.2 A Learning Bound for the Combined Error

In semi-supervised DA, when we have access to an additional small set of labeled instances in the target domain, the goal is often to find a trade-off between minimizing the source and the target errors depending on the number of instances available in each domain and their mutual correlation. Let us now assume that we possess \(\beta n\) instances drawn independently from \(\mu _T\) and \((1-\beta )n\) instances drawn independently from \(\mu _S\) and labeled by \(f_S\) and \(f_T\), respectively. In this case, the empirical combined error [2] is defined as a convex combination of errors on the source and target training data:

$$\hat{\epsilon }_{\alpha }(h) = \alpha \hat{\epsilon }_T(h) + (1-\alpha )\hat{\epsilon }_S(h),$$

where \(\alpha \in [0,1]\).

The use of the combined error is motivated by the fact that if the number of instances in the target sample is small compared to the number of instances in the source domain (which is usually the case in DA), minimizing only the target error may not be appropriate. Instead, one may want to find a suitable value of \(\alpha \) that ensures the minimum of \(\hat{\epsilon }_{\alpha }(h)\) w.r.t. a given hypothesis h. We now prove a theorem for the combined error similar to the one presented in [2].

Theorem 3

Under the assumptions of Theorem 2 and Lemma 1, let D be a labeled sample of size n with \(\beta n\) points drawn from \(\mu _T\) and \((1-\beta ) n\) from \(\mu _S\) with \(\beta \in (0,1)\), and labeled according to \(f_S\) and \(f_T\). If \(\hat{h}\) is the empirical minimizer of \(\hat{\epsilon }_\alpha (h)\) and \(h^*_T = \underset{h}{\min } \ \epsilon _T(h)\) then for any \(\delta \in (0,1)\) with probability at least \(1-\delta \) (over the choice of samples),

$$\epsilon _T(\hat{h}) \le \epsilon _T(h_T^*) + c_1 + 2(1-\alpha )(W_1(\hat{\mu }_S, \hat{\mu }_T) + \lambda + c_2),$$

where

$$\begin{aligned} {{}=1}c_1 =&2 \sqrt{\frac{2K\left( \frac{(1-\alpha )^2}{1-\beta }+\frac{\alpha ^2}{\beta }\right) \log (2/\delta )}{n}} +4 \sqrt{K/n} \left( \frac{\alpha }{n\beta \sqrt{\beta } } + \frac{(1-\alpha )}{n(1-\beta )\sqrt{1-\beta } }\right) , \end{aligned}$$
$$\begin{aligned}&c_2 = \sqrt{2\log \left( \frac{1}{\delta }\right) /\varsigma '}\left( \sqrt{\frac{1}{N_S}}+\sqrt{\frac{1}{N_T}}\right) . \end{aligned}$$

Proof

$$\begin{aligned} \epsilon _T(\hat{h})&\le \epsilon _{\alpha }(\hat{h}) + (1-\alpha ) (W_1(\mu _S,\mu _T)+\lambda )\\&\le \hat{\epsilon }_{\alpha }(\hat{h}) + \sqrt{\frac{2K\left( \frac{(1-\alpha )^2}{1-\beta }+\frac{\alpha ^2}{\beta }\right) \log (2/\delta )}{n}}+ (1-\alpha ) (W_1(\mu _S,\mu _T)+\lambda ) \\&{11}+2 \sqrt{K/n} \left( \frac{\alpha }{n\beta \sqrt{\beta } }+ \frac{(1-\alpha )}{n(1-\beta )\sqrt{1-\beta } }\right) \\&\le \hat{\epsilon }_{\alpha }(h_T^*) + \sqrt{\frac{2K\left( \frac{(1-\alpha )^2}{1-\beta }+\frac{\alpha ^2}{\beta }\right) \log (2/\delta )}{n}}+ (1-\alpha )(W_1(\mu _S,\mu _T)+\lambda ) \\&{11}+ 2 \sqrt{K/n} \left( \frac{\alpha }{n\beta \sqrt{\beta } } + \frac{(1-\alpha )}{n(1-\beta )\sqrt{1-\beta } }\right) \\&\le \epsilon _{\alpha }(h_T^*) + 2\sqrt{\frac{2K\left( \frac{(1-\alpha )^2}{1-\beta }+\frac{\alpha ^2}{\beta }\right) \log (2/\delta )}{n}}+ (1-\alpha )(W_1(\mu _S,\mu _T)+\lambda )\\&{11}+4 \sqrt{K/n} \left( \frac{\alpha }{n\beta \sqrt{\beta } } + \frac{(1-\alpha )}{n(1-\beta )\sqrt{1-\beta } }\right) \\&\le \epsilon _{T}(h_T^*) + 2\sqrt{\frac{2K\left( \frac{(1-\alpha )^2}{1-\beta }+\frac{\alpha ^2}{\beta }\right) \log (2/\delta )}{n}} +2(1-\alpha ) (W_1(\mu _S,\mu _T)+\lambda ) \\&{11}+ 4 \sqrt{K/n} \left( \frac{\alpha }{n\beta \sqrt{\beta } } + \frac{(1-\alpha )}{n(1-\beta )\sqrt{1-\beta } }\right) \\&\le \epsilon _T(h_T^*) + c_1 + 2(1-\alpha )(W_1(\hat{\mu }_S, \hat{\mu }_T) + \lambda + c_2). \end{aligned}$$

The proof follows the standard theory of uniform convergence for empirical risk minimizers where lines 1 and 5 are obtained by observing that \( \vert \epsilon _{\alpha }(h) - \epsilon _T(h)\vert = \vert \alpha \epsilon _{T}(h)\,+\,(1\,-\,\alpha )\epsilon _S(h)\,-\,\epsilon _T(h) \vert = \vert (1\,-\,\alpha )(\epsilon _{S}(h)\,-\,\epsilon _T(h)) \vert \le (1\,-\,\alpha )(W_1(\mu _T,\mu _{S})\,+\,\lambda ) \) where the last inequality comes from line 4 of the proof of Theorem 2, line 3 follows from the definition of \(\hat{h}\) and \(h_T^*\) and line 6 is a consequence of Theorem 1. Finally, lines 2 and 4 are obtained based on the concentration inequality obtained for \(\epsilon _{\alpha }(h)\). Due to the lack of space, we put this result in the Supplementary material.    \(\Box \)

This theorem shows that the best hypothesis that takes into account both source and target labeled data (i.e., \(0 \le \alpha < 1 \)) performs at least as good as the best hypothesis learned on target data instances alone (\(\alpha = 1\)). This result agrees well with the intuition that semi-supervised DA approaches should be at least as good as unsupervised ones.

4 Multi-source Domain Adaptation

We now consider the case where not one but many source domains are available during the adaptation. More formally, we define N different source domains (where T can either be or not a part of this set). For each source j, we have a labelled sample \(S_j\) of size \(n_j = \beta _j n\) \(\left( \sum _{j=1}^N \beta _j = 1, \sum _{j=1}^N n_j = n\right) \) drawn from the associated unknown distribution \(\mu _{S_j}\) and labelled by \(f_j\). We now consider the empirical weighted multi-source error of a hypothesis h defined for some vector \(\varvec{\alpha } = \{\alpha _1, \dots , \alpha _N \}\) as follows:

$$\hat{\epsilon }_{\varvec{\alpha }}(h) = \sum _{j=1}^N\alpha _j \hat{\epsilon }_{S_{j}}(h),$$

where \(\sum _{j=1}^N\alpha _j = 1\) and each \(\alpha _j\) represents the weight of the source domain \(S_j\).

In what follows, we show that generalization bounds obtained for the weighted error give some interesting insights into the application of the Wasserstein distance to multi-source DA problems.

Theorem 4

With the assumptions from Theorem 2 and Lemma 1, let S be a sample of size n, where for each \(j \in \{1,\dots ,N\}\), \(\beta _j n\) points are drawn from \(\mu _{S_j}\) and labelled according to \(f_{j}\). If \(\hat{h}_{\varvec{\alpha }}\) is the empirical minimizer of \(\hat{\epsilon }_{\varvec{\alpha }}(h)\) and \(h^*_T = \underset{h}{\min } \ \epsilon _T(h)\) then for any fixed \(\varvec{\alpha }\) and \(\delta \in (0,1)\) with probability at least \(1-\delta \) (over the choice of samples),

$$\epsilon _T(\hat{h}_{\varvec{\alpha }}) \le \epsilon _T(h_T^*) + c_1 + 2\sum _{j=1}^N \alpha _j \left( W_1(\hat{\mu }_j, \hat{\mu }_T)+\lambda _j+c_2\right) ,$$

where

$$c_1 = 2 \sqrt{\frac{2K\sum _{j=1}^N \frac{\alpha _j^2}{\beta _j} \log (2/\delta )}{n}}+2\sqrt{\sum _{j=1}^N\frac{K\alpha _j}{\beta _jn}},$$
$$c_2 = \sqrt{2\log \left( \frac{1}{\delta }\right) /\varsigma '}\left( \sqrt{\frac{1}{N_{S_j}}}+\sqrt{\frac{1}{N_T}}\right) ,$$

where \(\lambda _j = \underset{h}{\min } \ (\epsilon _{S_j}(h)+\epsilon _T(h))\) represents the joint error for each source domain j.

Proof

The proof of this Theorem is very similar to the proof of Theorem 4. The final result is obtained by applying the concentration inequality for \(\epsilon _{\varvec{\alpha }}(h)\) (instead of those used for \(\epsilon _{\alpha }(\hat{h})\) in the proof of Theorem 4) and by using the following inequality that can be obtained easily by following the principle of the proof of [2, Theorem 4]:

$$\vert \epsilon _{\varvec{\alpha }}(h) - \epsilon _T(h) \vert \le \sum _{j=1}^N\alpha _j \left( W_1(\mu _j, \mu _T) + \lambda _j \right) ,$$

where \(\lambda _j = \underset{h}{\min } \ (\epsilon _{S_j}(h)+\epsilon _T(h))\). For the sake of completness, we present the concentration inequality for \(\epsilon _{\varvec{\alpha }}(h)\) in the Supplementary material.   \(\Box \)

While the results for multi-source DA may look like a trivial extension of the theoretical guarantees for the case of two domains, they can provide a very fruitful implication on their own. As in the previous case, we consider that the potential term that should be minimized in this bound by a given multi-source DA algorithm is the term \(\sum _{j=1}^N \alpha _j W_1(\hat{\mu }_j, \hat{\mu }_T)\).

Assume that \(\hat{\mu }\) is an arbitrary unknown empirical probability measure on \(\mathbb {R}^d\). Using the triangle inequality and bearing in mind that \(\alpha _j\le 1\) for all j, we can bound this term as follows:

$$\sum _{j=1}^N \alpha _j W_1(\hat{\mu }_j, \hat{\mu }_T) \le (\sum _{j=1}^N \alpha _j W_1(\hat{\mu }_j, \hat{\mu })) + NW_1(\hat{\mu },\hat{\mu }_T).$$

Now, let us consider the following optimization problem

$$\begin{aligned} \inf _{\hat{\mu } \in \mathcal {P}(\varOmega )} \frac{1}{N} \sum _{j=1}^N \alpha _j W_1(\hat{\mu }_j, \hat{\mu })+ W_1(\hat{\mu },\hat{\mu }_T). \end{aligned}$$
(1)

In this formulation, the first term \(\frac{1}{N} \sum _{j=1}^N \alpha _j W_1(\hat{\mu }_j, \hat{\mu })\) corresponds exactly to the problem known in the literature as the Wasserstein barycenters problem [1] that can be defined for \(W_1\) as follows.

Definition 2

For N probability measures \(\mu _1, \mu _2, \dots , \mu _N \in \mathcal {P}(\varOmega )\), an empirical Wasserstein barycenter is a minimizer \(\mu ^*_N \in \mathcal {P}(\varOmega )\) of \(J_N(\mu ) = \min _{\mu } \frac{1}{N}\sum _{i=1}^N a_iW_1(\mu , \mu _i)\), where for all i, \(a_{i}>0\) and \(\sum _{i=1}^N a_i = 1\).

The second term \(W_1(\hat{\mu },\hat{\mu }_T)\) of Eq. 1 finds the probability coupling that transports the barycenter to the target distribution. Altogether, this bound suggests that in order to adapt in the multi-source learning scenario, one can proceed by finding a barycenter of the source probability distributions and transport it to the target probability distribution.

On the other hand, the optimization problem related to the Wasserstein barycenters is closely related to the Multimarginal optimal transportation problem [19] where the goal is to find a probabilistic coupling that aligns N distinct probability measures. Indeed, as shown in [1], for a quadratic Euclidean cost function the solution \(\mu ^*_N\) of the barycenter problem in the Wasserstein space is given by the following equation:

$$\mu ^*_N = \sum _{k \in \{k_1, \dots , k_N \}} \gamma _k \delta _{A_{k}(x)},$$

where \(A_{k}(x) = \sum _{j=1}^N \gamma _j x_{k_j}\) and \(\varvec{\gamma } \in \mathbb {R}^{\prod _{j=1}^N n_j}\) is an optimal coupling solving for all \(k \in \{1, \dots , N \}\) the multimarginal optimal transportation problem with the following cost:

$$c_k = \sum \frac{a_j}{2} \Vert x_{k_j} - A_{k}(x) \Vert ^2.$$

We note that this reformulation is particularly useful when the source distributions are assumed to be Gaussians. In this case, there exists a closed form solution for the multimarginal optimal transportation problem [14] and thus for Wasserstein barycenters problem too. Finally, it is also worth noticing that the optimization problem Eq. 1 has already been introduced to solve the multiview learning task [12]. In their formulation, the second term is referred to as an a priori knowledge about the barycenter which, in our case, is explicitly given by the target probability measure simultaneously.

5 Comparison to Other Existing Bounds

As mentioned in the introduction, there are numerous papers that proposed DA generalization bounds. The main difference between them lies in the distance used to measure the divergence between source and target probability distributions. The seminal work of [3] considered a modification of the total variation distance called H-divergence given by the following equation:

$$d_H(p,q) = 2\sup _{h \in H} \vert p(h(x)=1) - q(h(x)=1)\vert .$$

On the other hand, [5, 15] proposed to replace it with the discrepancy distance:

$$\text {disc}(p,q) = \max _{h,h' \in H} \vert \epsilon _p(h,h') - \epsilon _q(h,h')\vert .$$

The latter one was shown to be tighter in some plausible scenarios. A more recent work on generalization bounds using integral probability metric

$$\text {D}_{\mathcal {F}} (p,q) = \sup _{f \in \mathcal {F}} \vert \int fdp - \int fdq \vert $$

and Rényi divergence

$$D_\alpha (p\Vert q) = \frac{1}{\alpha - 1} \log \left( \sum _{i=1}^n \frac{p_i^\alpha }{q_i^{\alpha - 1}}\right) $$

were presented in [16, 27], respectively. [27] provides a comparative analysis of discrepancy and integral metric based bounds and shows that the former are less tight. [16] derives the domain adaptation bounds in multisource scenario by assuming that the good hypothesis can be learned as a weighted convex combination of hypothesis from all the sources available. Considering a reasonable amount of previous work on the subject, a natural question about the tightness of the DA bounds based on the Wasserstein metric introduced above arises in spite of the Theorem 3.

The answer to this question is partially given by the Csiszàr-Kullback-Pinsker inequlity [20] defined for any two probability measures \(p, q \in \mathcal {P}(\varOmega )\) as follows:

$$W_1(p,q) \le \text {diam}(\varOmega )\Vert p-q \Vert _{\text {TV}} \le \sqrt{2\text {diam}(\varOmega )\text {KL}(p\Vert q)},$$

where \(\text {diam}(\varOmega ) = \sup _{x,y \in \varOmega } \{ d(x,y)\}\) and \(\text {KL}(p\Vert q)\) is the Kullback-Leibler divergence.

A first consequence of this inequality shows that the Wasserstein distance not only appears naturally and offers algorithmic advantages in DA but also gives tighter bounds than total variation distance (L1) used in [2, Theorem 1]. On the other hand, it is also tighter than bounds presented in [16] as the Wasserstein metric can be bounded by the Kullback-Leibler divergence which is a special case of Rényi divergence when \(\alpha \rightarrow 1\) as shown in [10]. Regarding the discrepancy distance and omitting the hypothesis class restriction, one has \(d_{min}\text {disc}(p,q) \le W_1(p,q)\), where \(d_{min} = \min _{x \ne y \in \varOmega } \{ d(x,y)\}\). This inequality, however, is not very informative as minimum distance between two distinct points can be dramatically small thus making it impossible to compare the considered distances directly.

Regarding computational guarantees, we note that the H-divergence used in [3] is defined as the error of the best hypothesis distinguishing between the source and target domain samples pseudo-labeled with 0’s and 1’s and thus presents an intractable problem in practice. For the discrepancy distance, authors provided a linear time algorithm for its calculation in 1D case and showed that in other cases it scales as \(O(N_S^2d^{2.5} + N_Td^2)\) when the squared loss is used [15]. In its turn, the Wasserstein distance with entropic regularization can be calculated based on the linear time Sinkhorn-Knopp algorithm regardless the choice of the cost function c that presents a clear advantage over the other distances considered above.

Finally, none of the distances previously introduced in the generalization bounds for DA take into account the geometry of the space meaning that the Wasserstein distance is a powerful and precise tool to measure the divergence between domains.

6 Conclusion

In this paper, we studied the problem of DA in the optimal transportation context. Motivated by the existing algorithmic advances in domain adaptation, we presented the generalization bounds for both single and multi-source learning scenarios where the distance between source and target probability distributions is measured by the Wasserstein metric. Apart from the distance term that taken alone justifies the use of optimal transport in domain adaptation, the obtained bounds also included the capability term depicting the existence of a good hypothesis for both source and target domains. A direct consequence of its appearance in the bounds is the need to regularize optimal transportation plan in a way that allows to ensure efficient learning in the source domain once the interpolation was done. This regularization, achieved in [6, 7] by the means of the class-based regularization, thus can be also viewed as an implication of the obtained results. Furthermore, it explains the superior performance of both class-based and Laplacian regularized optimal transport in domain adaptation compared to it simple entropy regularized form. On the other hand, we also showed that the use of the Wasserstein distance leads to tighter bounds compared to the bounds based on the total variation distance and Rényi divergence and is more computationally attractive than some other existing results. From the analysis of the bounds obtained for the multi-source DA, we derived a new algorithmic idea that suggests the minimization of two terms: first term corresponds to the Wasserstein barycenter problem calculated on the empirical source measures while the second one solves the optimal transport problem between this barycenter and the empirical target measure.

Future perspectives of this work are many and concern both the derivation of new algorithms for domain adaptation and the demonstration of new theoretical results. First of all, we would like to study the extent to which the cost function used in the derivation of the bounds can be used on actual real-world DA problems. This distance, defined as a norm of difference between two feature maps, can offer a flexibility in the calculation of the optimal transport metric due to its kernel representation. Secondly, we aim to produce new concentration inequalities for the \(\lambda \) term that will allow to bound the true best joint hypothesis by its empirical counter-part. These concentration inequalities will allow to access the adaptability of two domains from the given labelled samples while the speed of convergence may show how many data instances from the source domains is needed to obtain a reliable estimate of \(\lambda \). Finally, the introduction of the Wasserstein distance to the bounds means that new DA algorithms can be designed based on the other optimal coupling techniques. These include, for instance, Knothe-Rosenblatt coupling and Moser coupling.