Keywords

1 Introduction

Domain Adaptation problems arise each time we need to leverage labeled data in one or more related source domains, to learn a classifier for unseen or unlabeled data in a target domain. The domains are assumed to be related, but not identical. The underlying domain shift occurs in multiple real-world applications. Numerous approaches have been proposed in the last years to address textual and visual domain adaptation (we refer the reader to [23, 32, 36] for recent surveys on transfer learning and domain adaptation methods). For text data, the domain shift is frequent in named entity recognition, statistical machine translation, opinion mining, speech tagging and document ranking [3, 11, 33, 41]. Domain adaptation has equally received a lot of attention in computer vision [1, 1315, 17, 2022, 29, 34, 35] where domain shift is a consequence of changing conditions, such as background, location or pose, or considering different image types, such as photos, paintings, sketches [4, 9, 25].

In this paper, we build on an approach to domain adaptation based on noise marginalization [5]. In deep learning, a denoising autoencoder (DA) learns a robust feature representation from training examples. In the case of domain adaptation, it takes the unlabeled instances of both source and target data and learns a new feature representation by reconstructing the original features from their noised counterparts. A marginalized denoising autoencoder (MDA) is a technique to marginalize the noise at training time; it avoids the explicit data corruption and does not require an optimization procedure for learning the model parameters but computes the model in a closed form. This makes MDAs scalable and computationally faster than the regular denoising autoencoders. The principle of noise marginalization has been successfully extended to learning with corrupted features [30], link prediction and multi-label learning [6], relational learning [7], collaborative filtering [26] and heterogeneous cross-domain learning [27, 40].

The marginalized domain adaptation refers to such a denoising of source and target instances that explicitly makes their features domain invariant. To achieve this goal, we extend the MDA with a domain regularization term. We explore three ways of such a regularization. The first way uses the maximum mean discrepancy (MMD) measure [24]. The second way is inspired by the adversarial learning of deep neural networks [19]. The third regularization term is based on preserving accurate classification of the denoised source instances. In all cases, the regularization term belongs to the class of squared loss functions. This guarantees the noise marginalization and the computational efficiency, either as a closed form solution or as a solution of Sylvester linear matrix equation \(\mathbf{A}\mathbf{X}+\mathbf{X}\mathbf{B}=\mathbf{C}\).

2 Feature Denoising for Domain Adaptation

Let \(\mathbf{X}^s=[\mathbf{X}_1,\ldots ,\mathbf{X}_{n_S}]\) denote a set of \(n_S\) source domains, with the corresponding labels \(\mathbf{Y}^s=[\mathbf{Y}_1,\ldots \mathbf{Y}_{n_S}]\), and let \(\mathbf{X}^t\) denote the unlabeled target domain data. The Marginalized Denoising Autoencoder (MDA) approach [5] is to reconstruct the input data from partial random corruption [39] with a marginalization that yields optimal reconstruction weights \(\mathbf{W}\) in a closed form. The MDA minimizes the loss written as:

$$\begin{aligned} \mathcal {L}(\mathbf{W}, \mathbf{X}) = \frac{1}{K} \sum _{k = 1}^K\Vert \mathbf{X}-\tilde{\mathbf{X}}_k \mathbf{W}\Vert ^2 + \omega \Vert \mathbf{W}\Vert ^2, \end{aligned}$$
(1)

where \(\tilde{\mathbf{X}}_k \in \mathrm{I\!R}^{N \times d}\) is the k-th corrupted version of \(\mathbf{X}=[\mathbf{X}^s, \mathbf{X}^t]\) by random feature dropout with a probability p, \(\mathbf{W}\in \mathrm{I\!R}^{d \times d}\), and \(\omega \Vert \mathbf W \Vert ^2\) is a regularization term. To avoid the explicit feature corruption and an iterative optimization, Chen et al. [5] has shown that in the limiting case \(K\rightarrow \infty \), the weak law of large numbers allows to rewrite \(\mathcal {L}(\mathbf{W},\mathbf{X})\) as its expectation. The optimal solution \(\mathbf{W}\) can be written as \(\mathbf{W}=(\mathbf{Q}+\omega \mathbf{I}_d)^{-1} \mathbf{P}\), where \(\mathbf{P}=\mathbb {E}[\mathbf{X}^\top \tilde{\mathbf{X}}]\) and \(\mathbf{Q}=\mathbb {E}[\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}]\) depend only on the covariance matrix \(\mathbf{S}\) of the uncorrupted data, \(\mathbf{S}=\mathbf{X}^{\top } \mathbf{X}\), and the noise level p:

$$\begin{aligned} \mathbf{P}= (1-p)\mathbf{S}\quad \mathrm{and} \quad \mathbf{Q}_{ij} =\left[ \begin{array}{ll} \mathbf{S}_{i j} (1-p)^2 , &{} \mathrm{if} \quad i \ne j,\\ \mathbf{S}_{i j} (1-p), &{} \mathrm{if} \quad i = j. \end{array}\right. \end{aligned}$$
(2)

2.1 Domain Regularization

To better address the domain adaptation, we extend the feature denoising with a domain regularization in order to favor the learning of domain invariant features. We explore three versions of the domain regularization. We combine them with the loss () and show how to marginalize the noise for each version and to keep \(\mathbf{W}\) as a solution of a linear matrix equation. The three versions of the domain regularization are as follows:

Regularization \(\mathcal {R}_{m}\) Based on the Maximum Mean Discrepancy (MMD) with the Linear Kernel; It aims at reducing the gap between the denoised domain means. The MMD was already used for domain adaptation with feature transformation learning [2, 31] and as a regularizer for the cross-domain classifier learning [13, 28, 38]. In this paper, in contrast to these papers where the distributions are approximated with MMD using multiple nonlinear kernels we use MMD with the linear kernelFootnote 1, the only one allowing us to keep the solution for \(\mathbf{W}\) closed form.

The regularization term for K corrupted versions of \(\mathbf{X}\) is given by:

$$\begin{aligned} \mathcal {R}_{m}=\frac{1}{K} \sum _{k=1}^K Tr (\mathbf{W}^\top \tilde{\mathbf{X}}_k^\top \mathbf{N}\tilde{\mathbf{X}}_k \mathbf{W}), \quad \mathrm{where} \quad \mathbf{N}=\Big [ \begin{array}{rr} \frac{1}{N^2_s} \mathbf{1}^{s,s} &{} \frac{1}{N_s N_t} \mathbf{1}^{s,t} \\ \frac{1}{N_s N_t} \mathbf{1}^{s,t} &{}\frac{1}{N^2_t} \mathbf{1}^{t,t} \end{array}\Big ], \end{aligned}$$

\(\mathbf{1}^{a,b}\) is a constant matrix of size \(N_a\times N_b\) with all elements being equal to 1 and \(N_s, N_t\) are the number of source and target examples. After the noise marginalization, we obtain \(\mathbb {E}[\mathcal {R}_{m}] =Tr(\mathbf{W}^\top \mathbf{M}\mathbf{W})\), where \(\mathbf{M}=\mathbb {E}[\tilde{\mathbf{X}}^\top \mathbf{N}\tilde{\mathbf{X}}]\) is computed similarly to \(\mathbf{Q}\) in (), by using \(\mathbf{S}_m=\mathbf{X}^\top \mathbf{N}\mathbf{X}\) instead of the correlation matrix \(\mathbf{S}\).

Regularization \(\mathcal {R}_d\) Based on Domain Prediction; It explicitly pushes the denoised source examples toward target instances. The domain regularizer \(\mathcal {R}_{d}\), proposed in [8], is inspired by [18] where intermediate layers in a deep learning model are regularized using a domain prediction task. The main idea is to learn the denoising while pushing the source towards the target (or vice versa) and hence allowing the source classifier to perform better on the target. The regularization term \(\mathcal {R}_d\) can be written as follows:

$$\begin{aligned} \mathcal {R}_{d}=\frac{1}{K} \sum _{k=1}^K \Vert \mathbf{Y}_\mathcal{T}- {\tilde{\mathbf{X}}_k} \mathbf{W}\mathbf{Z}_\mathcal{D}\Vert ^2, \end{aligned}$$
(3)

where \(\mathbf{Z}_\mathcal{D}\in \mathrm{I\!R}^d\) is a domain classifier trained on the uncorrupted data to distinguish the target from the source and \(\mathbf{Y}_\mathcal{T}=\mathbf{1}^N\) is a vector containing only ones, as all denoised instances should look like the targetFootnote 2. After the noise marginalization, the partial derivatives on \(\mathbf{W}\) of this term expectation are the following:

Classification Regularization \(\mathcal {R}_l\); It encourages the denoised source data to remain well classified by the classifier pre-trained on source data. The regularizer \(\mathcal {R}_{l}\) is similar to \(\mathcal {R}_{d}\), except that \(\mathbf{Z}_l\) is trained on the uncorrupted source \(\mathbf{X}^s\) and acts only on the labeled source data. Also, instead of \(\mathbf{Y}_\mathcal{T}\), the groundtruth source labels \(\mathbf{Y}_l=\mathbf{Y}^s\) are usedFootnote 3. In the marginalized version of \(\mathcal {R}_l\), The partial derivatives on \(\mathbf{W}\) can be written as

$$\begin{aligned} \frac{\partial {{\mathrm{\mathbb {E}}}}[\mathcal {R}_l]}{\partial \mathbf{W}} = -2 (1-p) \mathbf{X}_l^\top \mathbf{Y}_l \mathbf{Z}_l+ 2 \mathbf{Q}_l \mathbf{W}\mathbf{Z}_l \mathbf{Z}_l^\top , \end{aligned}$$

where \(\mathbf{X}_l=\mathbf{X}^s\) and \(\mathbf{Q}_l\) is computed similarly to \(\mathbf{Q}\), with \(\mathbf{S}_l=\mathbf{X}_l^{\top } \mathbf{X}_l\).

Table 1. A summary of our models and corresponding notations.

2.2 Minimizing the Regularized Loss

We extend the noise marginalization framework for optimizing the data reconstruction loss () and minimize the expected loss \(\mathbb {E}[\mathcal {L}+ \gamma _{\phi } \mathcal {R}_{\phi }]\), denoted \({{\mathrm{\mathbb {E}}}}[\mathcal {L}_{\phi }]\), where in the regularization term \(\mathcal {R}_{\phi }\), \(\phi \) refers to md or l version. From the marginalized terms presented in the previous sections, it is easy to show that when minimizing these regularized losses, the optimal solution for \(\mathbf{W}\) given by \(\partial {{\mathrm{\mathbb {E}}}}[\mathcal {L}_{\phi }]/\partial \mathbf{W}=\mathbf{0}\) can be reduced to solving the linear matrix system \(\mathbf{A}\mathbf{W}=\mathbf{B}\), for which there exists a closed-form solution, or to a Sylvester linear matrix equation \(\mathbf{A}\mathbf{W}+\mathbf{W}\mathbf{B}=\mathbf{C}\) that can be solved efficiently using the Bartels-Stewart algorithm. Due to the limited space, we report all the details in the full version and summarize the baseline, three extensions and the corresponding solutions in Table 1.

Similarly to the stacked MDAs, we can stack several layers together with only forward learning, where the denoised features of the previous layer serve as the input to the next layer and nonlinear functions such as tangent hyperbolic or rectified linear units can be applied between the layers.

3 Experimental Results

Datasets. We run experiments on the popular OFF31 [34] and OC10 [22] datasets, both with the full training protocol [21] where all source data is used for training and with the sampling protocol [22, 34]. We evaluated our models both with the provided SURFBOV and the DECAF6 [12] features. In addition we run experiments with the full training protocol on the Testbed Cross-Dataset [37] (TB) using both the provided SIFTBOV and the DECAF7 features.

Table 2. Single source domain adaptation with a single (\(r=1\)) and 3 stacked layers (\(r=3\)). Bold indicates the best result per column, underline refers to best single layer results.

Parameter Setting. To compare different models we run all experiments with the same preprocessing and parameter valuesFootnote 4. Features are L2 normalized and the feature dimensionality is PCA reduced to 200 (BOV features are in addition power normalized). Parameter values are \(\omega =0.01\), \(\gamma _{\phi }=1\) and \(p=0.1\). Between layers we apply tangent hyperbolic nonlinearities and we concatenate the outputs of all layers with the original features (as in [5]).

We evaluate how the optimal denoising matrix \(\mathbf{W}\) influences three different classification methods, a regularized multi-class ridge classifier trained on the source (\(\mathbf{Z}= (\mathbf{X}_l^\top \mathbf{X}_l + \delta \mathbf{I}_d)^{-1} \mathbf{X}_l^\top \mathbf{Y}_l\)), the nearest neighbor classifier (NN) and the Domain Specific Class Means (DSCM) classifier [10] where a target test example is assigned to a class based on a soft-max distance to the domain specific class means. Two last classifiers are selected for their non-linearity. Also the NN is related to retrieval and DSCM to clustering, so the impact of \(\mathbf{W}\) on these two extra tasks is indirectly assessed.

Table 2 shows the domain adaptation results with a single source and Table 3 shows multi source results, both under the full training protocol. For each dataset, we consider all possible source-target pairs for the domain adaptation tasks. Hence we average over 9 tasks on OFF31 (with 3 domains A,D,W), and over 12 tasks on OC10 (4 domains (A,C,D,W) and TB (4 domains B,C,I,S).

Table 3. Multi-source adaptation results without stacking. Bold indicates best result per column.

Table 2 shows the results on L2 normalized DECAF features. It compares the domain regularization extensions to the baselines (BL) obtained with the L2 normalized features (full) and with the PCA reduced features as well as with MDA. As the table shows, the best results are often obtained with MRl, except in the OC10 case where MRd performs better. On the other hand, the \(\mathcal {R}_m\) regularizer (MRm) does not improve the M1 performance. Stacking several layers can further improve the results. When comparing these results to the literature we can see that on OC10 we perform comparably to DAM [14] (84 %) and DDC [38] (84.6 %) but worse than more complex methods such as JDA [29] (87.5 %), TTM [16] (87.5 %) or DAN [28] (87.3 %). On OFF31, the deep adaptation method DAN [28] (72.9 %) significantly outperforms our results. On the TD dataset, in order to compare our results on DECAF6 to CORAL+SVM [35] (40.2 %) we average six source-task pairs (without the domain B) and obtain 43.6 % with MRd+DSCM and 43.1 % with MRl+DSCM. We also outperformFootnote 5 CORAL+SVM [35] (64 %) with our MRd+Ridge (65.2 %) when using the sampling protocol on OFF31.

Concerning the BOV features, the best results (using 3 layers) with the full training protocol on OFF31 are with MRl+NN (29.7 %) and on OC10 with MRd+Ridge(48.2 %). The latter is comparable to CORAL+SVM [35] (48.8 %), but is below LSSA [1] (52.3 %) that first selects landmarks before learning the transformation. The landmark selection is complementary to our approach and can boost our results as well.

In Table 3, we report the averaged results for the multi-source cases, obtained with BOV features, under the full training protocol. For each dataset, all the configurations with at least 2 source domains are considered. It yields 6 such configurations for OFF31 and 16 configurations for OC10 and TB. The results indicate clearly that taking into account the domain regularization improves the performance.

4 Conclusion

In this paper we extended the marginalized denoising autoencoder (MDA) framework with a domain regularization to enforce domain invariance. We studied three versions of regularization, based on the maximum mean discrepancy measure, the domain prediction and the class predictions on source. We showed that in all these cases, the noise marginalization is reduced to closed form solution or to a Sylvester linear matrix system, for which there exist efficient and scalable solutions. This allows furthermore to easily stack several layers with low cost. We studied the effect of these domain regularizations and run single source and multi-source experiments on three benchmark datasets showing that adding the new regularization terms allow to outperform the baselines. Compared to the state-of-the-art, our method performs better than classical feature transformation methods but it is outperformed by more complex deep domain adaptation methods. Compared to the latter methods, the main advantage of the proposed approach, beyond its low computational cost, is that as we learn an unsupervised feature transformation, we can boost the performance of other tasks such as retrieval or clustering in the target space.