1 Introduction

In the recent years deep learning has been exceptionally successful in supervised object recognition tasks [1, 2]. Despite its effectiveness in supervised regime, object recognition in unsupervised regime is still an open ended problem because the lack of labels makes the training complicated. Off-the-shelf networks pre-trained on some domain do not work well when transferred to a novel but related domain due to a problem called domain-shift [3]. To mitigate domain shift among datasets numerous Unsupervised Domain Adaptation (UDA) methods [4,5,6,7,8,9,10] have been proposed which leverage unlabeled target data together with labeled source data to learn a predictor for the target samples.

UDA methods can be roughly categorized under two broad categories. The first category includes Generative Adversarial Network (GAN) based methods [10,11,12] that learn a cross-domain mapping to emulate target-like source images, which are then leveraged for training a target classifier. The second category of methods aims to reduce the discrepancy between source and target domains by leveraging the first order statistics [13, 14] or second order statistics [15, 25]. Some of the methods from this category achieve alignment of feature distributions by directly embedding batch normalization (BN) based [9, 16, 17] domain alignment (DA) layers into the network.

Fig. 1.
figure 1

Visualization of 2D features with different normalization transformations: (a) Feature standardisation; and (b) Feature whitening.

While BN based methods align feature distributions by setting variance of features to 1 and mean to 0, yet they leave the feature correlations intact (see Fig. 1a), leading to sub-optimal alignment. Conversely, we argue that to completely eliminate discrepancy between domains the source and target features should have the same covariance matrix. This can be ensured by projecting the feature distributions onto a canonical unit hyper-sphere through full-feature whitening (see Fig. 1b), such that both source and target domain features have identity covariance matrix. While Roy et al. [8] proposed to align feature distributions with domain-specific grouped-feature whitening (DWT), it suffers from imperfect alignment due to partial feature whitening (see Sect. 2).

To overcome the drawbacks of previous DA layers we propose to first whiten the feature representations and then apply colouring. Our whiten operation use domain specific whitening, while colouring operation is domain agnostic and is used to re-project the whitened features to a distribution having an arbitrary covariance matrix. Inspired by [18], we realize these transformations through Full-Feature Whitening and Colouring Transform (\(\mathbf {{F}^{2}{WCT}}\)) blocks, embedded inside the network, replacing the BN-based and DWT-based DA layers. However, different from [18], which uses these operations for conditional image generation, we propose this technique for UDA. We also extend this to multi-source unsupervised DA (MSDA) setting where multiple source domains are available during training. Finally, we evaluate our proposed method on the digits datasets for both single source UDA and MSDA settings and set new state-of-the-art results.

Fig. 2.
figure 2

Covariance matrices of features undergoing different normalization transformations: (a) BN [9]; (b) DWT [8]; and (c) Full-Feature Whitening. Black pixels denote value 1, white pixels denote value 0 and gray denotes intermediate values.

2 Related Works

Single Source UDA. Several UDA methods have been proposed in the recent years that operate under the assumption that there is only a single source domain. A multitude of UDA methods have utilized GAN [10,11,12] to learn a mapping between the source and target domains in order to generate synthetic data in the target domain. SBADA-GAN [10] and CyCADA [11] are trained with adversarial and cycle-consistent losses to generate labeled target-like source samples which are used for training a classifier for the target domain. Although very effective, GAN based methods require large amount of data from each domain to capture the inherent data distributions.

Fig. 3.
figure 3

Visualization of 2D features after whitening and different feature re-projection techniques: (a) whitening with scale and shift as in DWT [8]; and (b) proposed whitening with colouring for aligning feature distributions. (Color figure online)

Another genre of UDA methods aim to reduce the discrepancy between source and target domains by leveraging the first and second order statistics. Minimum Mean Discrepancy based methods [13, 14] minimize the discrepancy between domains by minimizing the difference of the mean (i.e., first order statistics) of their respective feature representations. Correlation alignment methods [5, 15, 25] leverage second order statistics by minimizing the loss derived from the covariance matrices of source and target feature representations. Carlucci et al. [9] and Roy et al. [8] showed that discrepancy between domains can be reduced efficiently by directly embedding BN-based and DWT based DA layers into the network, respectively. Albeit effective, BN-based and DWT based DA layers result in features which are correlated and therefore imperfectly aligned. As can be observed from the covariance matrices in Fig. 2a and b, the variance of the features are 1 but the features are still correlated due to non-zero off-diagonal elements. Ideally, we would like to have identity covariance matrix, \(\varSigma = I\), (see Fig. 2c) to achieve complete alignment of features. This is achieved with our proposed \(\text {F}^{2}\text {WCT}\). Moreover, DWT [8] re-projects partially whitened features with scale and shift transforms of [9] which is sub-optimal because it reduces the capacity of the network [18]. Hence, we propose to re-project whitened features with colouring operation as shown in Fig. 3b. Different from scale and shift operation of DWT (see Fig. 3a), which can only have axis-aligned re-projection of features, our colouring operation can re-project the whitened features to any arbitrary orientation and the network is flexible to choose through training.

Multi Source UDA. In practical scenarios source data can possess different underlying marginal distributions and therefore multiple domain shifts need to be addressed coherently while adapting to the target domain. MSDA was first addressed in [19] which showed the necessity to borrow knowledge from nearest source domains to avoid negative transfer. Xu et al. [20] adapted to the distribution-weighted combining rule in [21] with an adversarial framework. More recently, Peng et al. [22] proposed a Moment Matching Network for reducing domain shift from multiple sources to the target domain. Departing from the above methods, we easily extend \(\text {F}^{2}\text {WCT}\) to simultaneously align feature distributions of multiple source and target domains to a reference distribution.

3 Method

In this section we present our proposed method for UDA and MSDA. Specifically, first we will discuss some preliminaries and then introduce the proposed \(\text {F}^{2}\text {WCT}\).

3.1 Preliminaries

Let us assume that \(\mathcal{S} = \{ (I_j^s, y_j^s) \}_{j=1}^{N_s}\) be the labeled source dataset, where \(I_j^s\) is the \(j^{th}\) source image and \(y_j^s \in \mathcal{Y} = \{1, 2 \dots , C \}\) be its associated label. Also, let \(\mathcal{T} = \{ I_i^t \}_{i=1}^{N_t}\) be the unlabeled target dataset where \(I_i^t\) is the \(i^{th}\) target image without any associated label. The aim of UDA is to train a target domain predictor by jointly utilizing samples from \(\mathcal{S}\) and \(\mathcal{T}\).

A fairly common technique to bridge domain shift is to use DA layers, which can either be BN-based [9] or DWT-based [8], that project source and target feature distributions onto a canonical distribution through per feature standardisation and grouped feature whitening, respectively. As mentioned in Sect. 1, we propose to replace these feature alignment techniques with domain specific full-feature whitening and domain agnostic colouring. Before introducing the proposed \(\text {F}^{2}\text {WCT}\) we will briefly recap BN [23] below.

A BN layer takes as input a mini-batch \(B = \{ \mathbf {x}_1, \ldots , \mathbf {x}_m \}\) of m samples, where \(\mathbf {x}_i\) is the \(i^{th}\) element in the batch B and \(\mathbf {x}_i \in \mathbb {R}^d\). As the name suggests, given a batch B the BN layer transforms each \(\mathbf {x}_i \in B\) in the following way:

$$\begin{aligned} \textit{BN}(x_{i,k}) = \gamma _k \frac{x_{i,k} - \mu _{B,k}}{\sqrt{\sigma _{B,k}^2 + \epsilon }} + \beta _k, \end{aligned}$$
(1)

where k (\(1 \le k \le d\)) signifies the k-th dimension of input data, \(\mu _{B,k}\) and \(\sigma _{B,k}\) are, respectively, the mean and the standard deviation corresponding to the k-th dimension of the samples in B and \(\epsilon \) is used to prevent division by zero. Finally, \(\gamma _k\) and \( \beta _k\) are learnable scaling and shifting parameters. In essence, BN transforms a batch of features into having zero mean and unit variance and then re-projects the features with \(\mathbf {\gamma }\) and \(\mathbf {\beta }\).

In Sect. 3.2 we present our proposed \(\text {F}^{2}\text {WCT}\) for UDA, while in Sect. 3.3 we extend the proposed \(\text {F}^{2}\text {WCT}\) for MSDA.

3.2 Full-Feature Whitening and Colouring Transform for UDA

As stated in Sect. 2 that BN based per-dimension feature standardization and DWT based grouped feature whitening is sub-optimal for marginal source and target distribution alignment due to the presence of correlated features. To alleviate domain shift we argue to replace BN and DWT with \(\text {F}^{2}\text {WCT}\), derived from [18], and is defined as follows:

$$\begin{aligned} \textit{F}^{\textit{2}}{} \textit{WCT}(\varvec{\mathrm {x}}_{i}; \varOmega )&= \varvec{\mathrm {\Gamma }} \hat{\varvec{\mathrm {x}}}_{i} + \varvec{\mathrm {\beta }}, \end{aligned}$$
(2)
$$\begin{aligned} \hat{\mathbf {x}}_i&= W_B (\mathbf {x}_i - \varvec{\mu }_B). \end{aligned}$$
(3)

In Eq. (3), \(\varvec{\mu }_B\) is the mean of B while \(W_B\) is the whitening matrix such that: \(W_B^\top W_B = \varSigma _B^{-1}\), where \(\varSigma _B\) is the covariance matrix derived from B. \(\varOmega = (\varvec{\mu }_B, \varSigma _B)\) indicates the batch-specific first and second-order statistics. Equation (3) performs the whitening of \(\mathbf {x}_i \in B\) and the resulting elements of \(\hat{B} = \{ \hat{\mathbf {x}}_1, \ldots , \hat{\mathbf {x}}_m \}\) lie in a hyper-spherical distribution, i.e., with a covariance matrix equal to the identity matrix (see Fig. 2c). Additionally, and differently from [8], in Eq. (2), with the help of learnable d dimensional vector \(\varvec{\beta }\) and \(d \times d\) dimensional matrix \(\varvec{\mathrm {\Gamma }}\) the whitened \(\hat{B}\) is projected back to a multivariate Gaussian distribution having an arbitrary covariance matrix through the colouring operation. Implementation wise Eq. (2) can be realized with a convolutional layer having kernel size 1 \(\times \) 1.

Our network, at any intermediate layer, takes as input two batches of input samples, \(B^s = \{ \mathbf {x}_1^s, \ldots , \mathbf {x}_m^s \}\) and \(B^t = \{ \mathbf {x}_1^{t}, \ldots , \mathbf {x}_m^{t} \}\) from the source and target domain, respectively. Every \(\mathbf {x}^{s}_{i} \in B^s\) and \(\mathbf {x}^t_{i} \in B^t\) is transformed through the \(\text {F}^{2}\text {WCT}\) block, where the whitening operation is domain specific but the colouring operation is domain agnostic. In details, using Eqs. (2)–(3) the output of \(\text {F}^{2}\text {WCT}\) blocks for the source and target samples are given respectively by:

$$\begin{aligned} \textit{F}^{\textit{2}}{} \textit{WCT}(\varvec{\mathrm {x}}^{s}_{i}; \varOmega ^s)&= \varvec{\mathrm {\Gamma }} W_{B^s} (\mathbf {x}^{s}_{i} - \varvec{\mu }_{B^s}) + \varvec{\mathrm {\beta }},\end{aligned}$$
(4)
$$\begin{aligned} \textit{F}^{\textit{2}}{} \textit{WCT}(\varvec{\mathrm {x}}^{t}_{i}; \varOmega ^t)&= \varvec{\mathrm {\Gamma }} W_{B^t} (\mathbf {x}^{t}_{i} - \varvec{\mu }_{B^t}) + \varvec{\mathrm {\beta }}. \end{aligned}$$
(5)

Separate statistics (\(\varOmega ^s = (\varvec{\mu }^{s}_B, \varSigma ^{s}_{B})\) and \(\varOmega ^t = (\varvec{\mu }^{t}_B, \varSigma ^{t}_{B})\)) are estimated for \(B^s\) and \(B^t\) which are then used for whitening the corresponding activations and then followed by colouring the spherical distribution to an arbitrary one (see Fig. 3b). Details about the computation of \(W_B\) can be found in [18]. In addition, the \(\text {F}^{2}\text {WCT}\) blocks maintain a moving average of the statistics \(\varOmega ^t_{avg}\) of the target domain which is used during inference.

3.3 Full-Feature Whitening and Colouring Transform for MSDA

In the MSDA scenario we have access to P labeled source datasets \(\{\mathcal{S}_{j} \}_{j=1}^P\), where \(\mathcal{S}_j = \{(I_i, y_i)\}_{i=1}^{N_j}\), and a target unlabeled dataset \(\mathcal{T} = \{I_i \}_{i=1}^{N_t}\). Since, we are addressing closed-set DA all the datasets share the same categories and each of them is associated to a domain \(\mathbf {D}_1^s, \ldots , \mathbf {D}_P^s, \mathbf {D}^t\), respectively. Our end goal is to learn a predictor for the target domain \(\mathbf {D}_t\) exploiting the data in \(\{\mathcal{S}_{j} \}_{j=1}^P \cup \mathcal{T}\).

Unlike many UDA methods [10, 11], the proposed \(\text {F}^{2}\text {WCT}\) can be extended to the MSDA setting in a very straightforward way by having dedicated \(\text {F}^{2}\text {WCT}\) blocks for every domain \(\mathbf {D}\), where the colouring parameters are shared amongst P + 1 domains. In details:

$$\begin{aligned} \textit{F}^{\textit{2}}{} \textit{WCT}(\varvec{\mathrm {x}}^{\mathbf {D}_1^s}_{i}; \varOmega ^{\mathbf {D}_1^s})&= \varvec{\mathrm {\Gamma }} W_{B^{\mathbf {D}_1^s}} (\mathbf {x}^{\mathbf {D}_1^s}_{i} - \varvec{\mu }_{B^{\mathbf {D}_1^s}}) + \varvec{\mathrm {\beta }},\end{aligned}$$
(6)
$$\begin{aligned}&\;\;\vdots \nonumber \\ \textit{F}^{\textit{2}}{} \textit{WCT}(\varvec{\mathrm {x}}^{\mathbf {D}_P^s}_{i}; \varOmega ^{\mathbf {D}_P^s})&= \varvec{\mathrm {\Gamma }} W_{B^{\mathbf {D}_P^s}} (\mathbf {x}^{\mathbf {D}_P^s}_{i} - \varvec{\mu }_{B^{\mathbf {D}_P^s}}) + \varvec{\mathrm {\beta }},\end{aligned}$$
(7)
$$\begin{aligned} \textit{F}^{\textit{2}}{} \textit{WCT}(\varvec{\mathrm {x}}^{\mathbf {D}^t}_{i}; \varOmega ^{\mathbf {D}^t})&= \varvec{\mathrm {\Gamma }} W_{B^{\mathbf {D}^t}} (\mathbf {x}^{\mathbf {D}^t}_{i} - \varvec{\mu }_{B^{\mathbf {D}^t}}) + \varvec{\mathrm {\beta }}. \end{aligned}$$
(8)

The whitening operation of \(\text {F}^{2}\text {WCT}\) projects the marginal feature distributions of all P + 1 domains onto a hyper-spherical reference distribution, thereby minimizing the multiple domain discrepancies in a coherent fashion. As in Sect. 3.2, the moving average of target statistics \(\varOmega ^{\mathbf {D}^t}_{avg}\) is maintained during training and is used during inference.

3.4 Training

Let \(B^s = \{ \mathbf {x}_1^s, \ldots , \mathbf {x}_m^s \}\) and \(B^t = \{ \mathbf {x}_1^{t}, \ldots , \mathbf {x}_m^{t} \}\) be two batches of the network’s last-layer activations, from the source and target domain, respectively. Since, the source samples are associated with labels, the standard cross-entropy loss (\(L^s\)) can be used for \(B^s\):

$$\begin{aligned} L^s(B^s) = - \frac{1}{m} \sum _{i=1}^m \log p(y_i^s | \mathbf {x}_i^s), \end{aligned}$$
(9)

However, for the target samples entropy loss is calculated as in [9], which acts as a regularizer. The entropy loss forces the network to be more confident in its predictions by producing peaked probability distribution at the output.

$$\begin{aligned} L^t(B^t) = - \frac{1}{m} \sum _{i=1}^m p(\mathbf {x}_i^t) \log p(\mathbf {x}_i^t), \end{aligned}$$
(10)

Finally, the network is trained with a weighted sum of \(L^s\) and \(L^t\):

$$\begin{aligned} L(B^s, B^t) = L^s(B^s) + \lambda L^t(B^t) \end{aligned}$$
(11)

4 Experimental Results

In this section we describe the datasets and provide details about the experimental protocols adopted. We also report our experimental evaluation on the considered datasets and compare our proposed method with the state-of-the-art methods in UDA and MSDA, respectively.

4.1 Datasets

We conduct all our experiments on the Digits-Five dataset, built for recognizing digits, consists of five unique domains having numerical digits ranging between 0 and 9. It includes the USPS, MNIST, MNIST-M, SVHN and Synthetic numbers (SYN) datasets. SVHN contains images of real-world house numbers acquired from Google Street View. SYN includes about 500K computer generated digits having varying orientation, position, color, etc. USPS and MNIST are datasets of digits scanned from U.S. envelopes but having different resolutions. Finally, MNIST-M is the colored counterpart of MNIST.

4.2 Experimental Setup

To ensure fair comparison with other UDA and MSDA methods we adopt base networks from [8] and [22] for UDA and MSDA experiments, respectively. In the network we have plugged \(\text {F}^{2}\text {WCT}\) blocks right after each of the first two convolutional layers. We reason that strong alignment of low level features (e.g., colour and texture) is very important to bridge the domain gap. As a consequence, we act in the early convolutional layers of the network, which deal with low level features, by fully aligning intermediate feature distributions with \(\text {F}^{2}\text {WCT}\) blocks. A typical block in the network is given by (Conv Layer \(\rightarrow \) \(\text {F}^{2}\text {WCT}\) \(\rightarrow \) ReLU). For the remainder layers we have used BN based DA layers as in [9].

We trained the networks with Adam for 150 epochs with an initial learning rate of 1e−3 and we dropped the learning rate by a factor of 10 after 50 and 90 epochs. To ensure well-conditioned covariance matrices we have used a mini-batch size of 128 and 512 for the UDA and MSDA settings, respectively. The source and target samples are drawn randomly such that each domain is well represented in a mini-batch. The value of \(\lambda \) in Eq. 11 is set to 0.1 as in [9].

4.3 Results and Discussion

In this section we analyze the impact of the proposed components on the final classification accuracy and compare \(\text {F}^{2}\text {WCT}\) with the state-of-the-art methods.

Ablation Study. We conduct ablation studies on the digits dataset for single source UDA to demonstrate the benefits of performing full-whitening followed by a colouring transformation. We consider the following models: (i) \(\text {F}^{2}\text {WCT}\), our full model, is composed of full-feature whitening and colouring; (ii) \(\text {F}^{2}\text {WT}\) where the colouring operation is replaced by scale-shift operation. This will validate the importance of colouring transform over scaling and shifting; and (iii) DWT [8] which considers grouped whitening. This comparison allows us to determine the necessity of full-feature whitening as opposed to grouped whitening.

Table 1. Ablation study of full-feature whitening and colouring transform versus relevant normalization techniques on Digits-Five. The target domain is shown in italics. The best numbers are highlighted in bold and the second best numbers are underlined.
Table 2. Classification accuracy (%) on the Digits-Five for single source UDA settings in comparison with the state-of-the-art methods. The target domain is shown in italics. The best numbers are highlighted in bold and the second best numbers are underlined.

As can be observed from Table 1 our proposed \(\text {F}^{2}\text {WCT}\) outperforms all other baselines. \(\text {F}^{2}\text {WT}\) demonstrates that the need of colouring is particularly evident for more complicated adaptation settings as in SVHN \(\rightarrow \) MNIST and MNIST \(\rightarrow \) MNIST-M. While in simpler MNIST \(\leftrightarrow \) USPS settings the network has enough capacity already. DWT [8] is especially worse than \(\text {F}^{2}\text {WCT}\) in the MNIST \(\rightarrow \) MNIST-M setting because grouped feature whitening can not align the source and target feature distributions optimally (see Sect. 2). Conversely, \(\text {F}^{2}\text {WCT}\) enables strong alignment of low level features through full whitening.

Comparison with State-of-the-Art Results. We compare our proposed \(\text {F}^{2}\text {WCT}\) with state-of-the-art methods, in both single source UDA and MSDA settings.

Table 3. Classification accuracy (%) on Digits-Five for multi-source domain adaptation settings. The target domain is shown in italics. Best number is in bold and second best is underlined.

Single-Source Unsupervised Domain Adaptation. In Table 2 we consider single-source adaptation settings where we adapt from a single source domain to a target domain. We consider four adaptation settings: MNIST \(\rightarrow \) USPS, USPS \(\rightarrow \) MNIST, SVHN \(\rightarrow \) MNIST and MNIST \(\rightarrow \) MNIST-M. The entire labeled train set of the source domain and unlabeled train set of the target domain is used for training a network whereas the dedicated test set of the target domain is used for evaluating the performance. We have considered the baselines reported in [8]. It is to be noted that we have chosen the baselines that do not utilize data augmentation. The variant of SE [24] which does not make use of data augmentation is therefore reported for fair comparison with other methods. However, for some baselines we could not report all the numbers due to the lack of availability in the corresponding adaptation settings.

From Table 2 we observe that on average our proposed \(\text {F}^{2}\text {WCT}\) outperforms all considered state-of-the-art methods by a considerable margin. Individually, our \(\text {F}^{2}\text {WCT}\) has the best accuracy in MNIST \(\leftrightarrow \) USPS settings and is the second best in SVHN \(\rightarrow \) MNIST and MNIST \(\rightarrow \) MNIST-M settings. Particularly, SBADA-GAN performs the best in the MNIST \(\rightarrow \) MNIST-M setting due to the implicit data-augmentation through generation of synthetic data. Surprisingly, in overall \(\text {F}^{2}\text {WCT}\) achieves at par performance with the target only setting without having access to any target label, demonstrating the effectiveness of our method.

Multi-source Unsupervised Domain Adaptation. In Table 3 we report results for MSDA setting where we adapt from multiple source domains to a single target domain. We consider all possible combinations of the 5 domains in Digits-Five for the experiments. For fairness in comparison with the baseline methods we follow the training protocol used in [22]. According to this protocol we randomly sample 25000 training images from each domain and 9000 images for evaluation. For the USPS, entire train and test set is used instead. We compare our method with DWT [8], Autodial: Automatic domain alignment layers [9] (AutoDIAL) and other baselines taken from [22]. We observe similar behaviour in the MSDA setting as our proposed \(\text {F}^{2}\text {WCT}\) also out-performs all the baselines on average accuracy, thereby obtaining state-of-the-art results. Notably, for the adaptation setting where MNIST-M is the target domain, the proposed full-feature whitening and colouring provides a boost of 12.79% over grouped whitening and scale-shifting in [8]. This validates our hypothesis that complete alignment of source and target feature distributions with full-feature whitening followed by colouring of the whitened features is more beneficial for tackling domain shift.

5 Conclusions

In this work we address UDA and MSDA by proposing domain alignment layers based on domain specific full-feature whitening and domain agnostic colouring with \(\text {F}^{2}\text {WCT}\) blocks. On the one hand, full-feature whitening of intermediate features allows optimal alignment of source and target feature distributions by guaranteeing same covariance matrices for both source and target features. On the other, the colouring transform helps in restoring the capacity of the network. The proposed \(\text {F}^{2}\text {WCT}\) blocks can be easily incorporated in any standard CNN. Our experiments on digits dataset show consistent improved performances over other state-of-the-art methods in both UDA and MSDA settings. As future work, we plan to adapt the proposed feature alignment technique for large scale benchmarks with deeper networks.