Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Many machine learning algorithms assume that the training and test data are independent and identically distributed (i.i.d.). However, this assumption rarely holds in practice as the data is likely to change over time and space. Even though state-of-the-art Deep Convolutional Neural Network features are invariant to low level cues to some degree [15, 16, 19], Donahue et al. [3] showed that they still are susceptible to domain shift. Instead of collecting labeled data and training a new classifier for every possible scenario, unsupervised domain adaptation methods [4, 6, 17, 18, 20, 21] try to compensate for the degradation in performance by transferring knowledge from labeled source domains to unlabeled target domains. A recently proposed CORAL method [18] aligns the second-order statistics of the source and target distributions with a linear transformation. Even though it is easy to implement, it works well for unsupervised domain adaptation. However, it relies on a linear transformation and is not end-to-end trainable: it needs to first extract features, apply the transformation, and then train an SVM classifier in a separate step.

In this work, we extend CORAL to incorporate it directly into deep networks by constructing a differentiable loss function that minimizes the difference between source and target correlations–the CORAL loss. Compared to CORAL, our proposed Deep CORAL approach learns a non-linear transformation that is more powerful and also works seamlessly with deep CNNs. We evaluate our method on standard benchmark datasets and show state-of-the-art performance.

2 Related Work

Previous techniques for unsupervised adaptation consisted of re-weighting the training point losses to more closely reflect those in the test distribution [9, 11] or finding a transformation in a lower-dimensional manifold that brings the source and target subspaces closer together [4, 68]. Re-weighting based approaches often assume a restricted form of domain shift–selection bias–and are thus not applicable to more general scenarios. Geodesic methods [6, 7] bridge the source and target domains by projecting them onto points along a geodesic path [7], or finding a closed-form linear map that transforms source points to target [6]. [4, 8] align the subspaces by computing the linear map that minimizes the Frobenius norm of the difference between the top n eigenvectors. In contrast, CORAL [18] minimizes domain shift by aligning the second-order statistics of source and target distributions.

Adaptive deep neural networks have recently been explored for unsupervised adaptation. DLID [1] trains a joint source and target CNN architecture with two adaptation layers. DDC [23] applies a single linear kernel to one layer to minimize Maximum Mean Discrepancy (MMD) while DAN [13] minimizes MMD with multiple kernels applied to multiple layers. ReverseGrad [5] and DomainConfusion [22] add a binary classifier to explicitly confuse the two domains.

Our proposed Deep CORAL approach is similar to DDC, DAN, and ReverseGrad in the sense that a new loss (CORAL loss) is added to minimize the difference in learned feature covariances across domains, which is similar to minimizing MMD with a polynomial kernel. However, it is more powerful than DDC (which aligns sample means only), much simpler to optimize than DAN and ReverseGrad, and can be integrated into different layers or architectures seamlessly.

3 Deep CORAL

We address the unsupervised domain adaptation scenario where there are no labeled training data in the target domain, and propose to leverage both the deep features pre-trained on a large generic domain (e.g., ImageNet [2]) and the labeled source data. In the meantime, we also want the final learned features to work well on the target domain. The first goal can be achieved by initializing the network parameters from the generic pre-trained network and fine-tuning it on the labeled source data. For the second goal, we propose to minimize the difference in second-order statistics between the source and target feature activations–the CORAL loss. Figure 1 shows a sample Deep CORAL architecture using our proposed correlation alignment layer for deep domain adaptation. We refer to Deep CORAL as any deep network incorporating the CORAL loss for domain adaptation.

Fig. 1.
figure 1

Sample Deep CORAL architecture based on a CNN with a classifier layer. For generalization and simplicity, here we apply the CORAL loss to the fc8 layer of AlexNet [12]. Integrating it into other layers or network architectures is also possible.

3.1 CORAL Loss

We first describe the CORAL loss between two domains for a single feature layer. Suppose we are given source-domain training examples \(D_S=\{\mathbf {x}_i\}, \mathbf {x}\in \mathbb {R}^d\) with labels \(L_S=\{y_i\}, i \in \{1,...,L\}\), and unlabeled target data \(D_T=\{\mathbf {u}_i\}, \mathbf {u}\in \mathbb {R}^d\). Suppose the numbers of source and target data are \(n_{S}\) and \(n_{T}\) respectively. Here both \(\mathbf {x}\) and \(\mathbf {u}\) are the d-dimensional deep layer activations \(\phi (I)\) of input I that we are trying to learn. Suppose \(D_S^{ij}~(D_T^{ij})\) indicates the j-th dimension of the i-th source (target) data example and \(C_{S}~(C_{T})\) denote the feature covariance matrices.

We define the CORAL loss as the distance between the second-order statistics (covariances) of the source and target features:

$$\begin{aligned} \begin{aligned} {\mathcal {L}_{CORAL}}= {\frac{1}{4d^2}}{\Vert C_{S} - C_{T} \Vert }^2_F\\ \end{aligned} \end{aligned}$$
(1)

where \({\Vert \cdot \Vert }^2_F\) denotes the squared matrix Frobenius norm. The covariance matrices of the source and target data are given by:

$$\begin{aligned} \begin{aligned}&C_{S}= {\frac{1}{n_{S}-1}}({D_S^{\top } D_S - \frac{1}{n_{S}}{(\mathbf{1 }^{\top }D_S})^{\top }{(\mathbf{1 }^{\top }D_S})}) \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} \begin{aligned}&C_{T}= {\frac{1}{n_{T}-1}}({D_T^{\top } D_T - \frac{1}{n_{T}}{(\mathbf{1 }^{\top }D_T})^{\top }{(\mathbf{1 }^{\top }D_T})}) \end{aligned} \end{aligned}$$
(3)

where \(\mathbf 1 \) is a column vector with all elements equal to 1.

The gradient with respect to the input features can be calculated using the chain rule:

$$\begin{aligned} \begin{aligned}&\frac{\partial {\mathcal {L}_{CORAL}}}{\partial {D_S^{ij}}}=\frac{1}{d^2(n_S-1)}((D_S^{\top }-\frac{1}{n_{S}}({\mathbf{1 }^{\top }D_S})^{\top }\mathbf{1 }^{\top })^{\top }(C_{S} - C_{T}))^{ij} \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} \begin{aligned} \frac{\partial {\mathcal {L}_{CORAL}}}{\partial {D_T^{ij}}}=-\frac{1}{d^2(n_T-1)}((D_T^{\top }-\frac{1}{n_{T}}({\mathbf{1 }^{\top }D_T})^{\top }\mathbf{1 }^{\top })^{\top }(C_{S} - C_{T}))^{ij} \end{aligned} \end{aligned}$$
(5)

We use batch covariances and the network parameters are shared between the two networks.

3.2 End-to-end Domain Adaptation with CORAL Loss

We describe our method by taking a multi-class classification problem as the running example. As mentioned before, the final deep features need to be both discriminative enough to train a strong classifier and invariant to the difference between source and target domains. Minimizing the classification loss itself is likely to lead to overfitting to the source domain, causing reduced performance on the target domain. On the other hand, minimizing the CORAL loss alone might lead to degenerated features. For example, the network could project all of the source and target data to a single point, making the CORAL loss trivially zero. However, no strong classifier can be constructed on these features. Joint training with both the classification loss and CORAL loss is likely to learn features that work well on the target domain:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}= {\mathcal {L}_{CLASS}} + \sum _{i=1}^{t}\lambda _{i}{\mathcal {L}_{CORAL}}\\ \end{aligned} \end{aligned}$$
(6)

where t denotes the number of CORAL loss layers in a deep network and \(\lambda \) is a weight that trades off the adaptation with classification accuracy on the source domain. As we show below, these two losses play counterparts and reach an equilibrium at the end of training, where the final features are discriminative and generalize well to the target domain.

4 Experiments

We evaluate our method on a standard domain adaptation benchmark–the Office dataset [17]. The Office dataset contains 31 object categories from an office environment in 3 image domains: AmazonDSLR, and Webcam.

We follow the standard protocol of [3, 5, 6, 13, 23] and use all the labeled source data and all the target data without labels. Since there are 3 domains, we conduct experiments on all 6 shifts (5 runs per shift), taking one domain as the source and another as the target.

In this experiment, we apply the CORAL loss to the last classification layer as it is the most general case–most deep classifier architectures (e.g., convolutional neural networks, recurrent neural networks) contain a fully connected layer for classification. Applying the CORAL loss to other layers or other network architectures is also possible.

The dimension of the last fully connected layer (fc8) was set to the number of categories (31) and initialized with \(\mathcal {N}(0,0.005)\). The learning rate of fc8 was set to 10 times the other layers as it was training from scratch. We initialized the other layers with the parameters pre-trained on ImageNet [2] and kept the original layer-wise parameter settings. In the training phase, we set the batch size to 128, base learning rate to \(10^{-3}\), weight decay to \(5\times 10^{-4}\), and momentum to 0.9. The weight of the CORAL loss (\(\lambda \)) is set in such way that at the end of training the classification loss and CORAL loss are roughly the same. It seems be a reasonable choice as we want to have a feature representation that is both discriminative and also minimizes the distance between the source and target domains. We used Caffe [10] and BVLC Reference CaffeNet for all of our experiments.

We compare to 7 recently published methods: CNN [12] (no adaptation), GFK [6], SA [4], TCA [14], CORAL [18], DDC [23], DAN [13]. GFK, SA, and TCA are manifold based methods that project the source and target distributions into a lower-dimensional manifold and are not end-to-end deep methods. DDC adds a domain confusion loss to AlexNet [12] and fine-tunes it on both the source and target domain. DAN is similar to DDC but utilizes a multi-kernel selection method for better mean embedding matching and adapts in multiple layers. For direct comparison, DAN in this paper uses the hidden layer fc8. For GFK, SA, TCA, and CORAL, we use the fc7 feature fine-tuned on the source domain (FT7 in [18]) as it achieves better performance than generic pre-trained features, and train a linear SVM [4, 18]. To have a fair comparison, we use accuracies reported by other authors with exactly the same setting or conduct experiments using the source code provided by the authors.

From Table 1 we can see that Deep CORAL (D-CORAL) achieves better average performance than CORAL and the other 6 baseline methods. In 3 out of 6 shifts, it achieves the highest accuracy. For the other 3 shifts, the margin between D-CORAL and the best baseline method is very small (\(\leqslant \!\!0.7\)).

Table 1. Object recognition accuracies for all 6 domain shifts on the standard Office dataset with deep features, following the standard unsupervised adaptation protocol.
Fig. 2.
figure 2

Detailed analysis of shift A\(\rightarrow \)W for training w/ v.s. w/o CORAL loss. (a): training and test accuracies for training w/ v.s. w/o CORAL loss. We can see that adding CORAL loss helps achieve much better performance on the target domain while maintaining strong classification accuracy on the source domain. (b): classification loss and CORAL loss for training w/ CORAL loss. As the last fully connected layer is randomly initialized with \(\mathcal {N}(0,0.005)\), CORAL loss is very small while classification loss is very large at the beginning. After training for a few hundred iterations, these two losses are about the same. (c): CORAL distance for training w/o CORAL loss (setting the weight to 0). The distance is getting much larger (\(\geqslant \!\!100\) times larger compared to training w/ CORAL loss).

To get a better understanding of Deep CORAL, we generate three plots for domain shift A\(\rightarrow \)W. In Fig. 2(a) we show the training (source) and testing (target) accuracies for training with v.s. without CORAL loss. We can clearly see that adding the CORAL loss helps achieve much better performance on the target domain while maintaining strong classification accuracy on the source domain.

In Fig. 2(b) we visualize both the classification loss and the CORAL loss for training w/ CORAL loss. As the last fully connected layer is randomly initialized with \(\mathcal {N}(0,0.005)\), in the beginning the CORAL loss is very small while the classification loss is very large. After training for a few hundred iterations, these two losses are about the same and reach an equilibrium. In Fig. 2(c) we show the CORAL distance between the domains for training w/o CORAL loss (setting the weight to 0). We can see that the distance is getting much larger (\(\geqslant \!\!100\) times larger compared to training w/ CORAL loss). Comparing Fig. 2(b) and (c), we can see that even though the CORAL loss is not always decreasing during training, if we set its weight to 0, the distance between source and target domains becomes much larger. This is reasonable as fine-tuning without domain adaptation is likely to overfit the features to the source domain. Our CORAL loss constrains the distance between source and target domain during the fine-tuning process and helps to maintain an equilibrium where the final features work well on the target domain.

5 Conclusion

In this work, we extended CORAL, a simple yet effective unsupervised domain adaptation method, to perform end-to-end adaptation in deep neural networks. Experiments on standard benchmark datasets show state-of-the-art performance. Deep CORAL works seamlessly with deep networks and can be easily integrated into different layers or network architectures.