Keywords

1 Introduction

In computer vision, domain adaptation (DA) has become a very popular topic. It addresses the problem that we need to solve the same learning tasks across different domains [2, 20]. Generally, we can divide domain adaptation into two parts: unsupervised DA in which target domain data are completely unlabeled, and semi-supervised DA where a small number of instances in the target domain are labeled. We focus on the unsupervised scenario, which is especially challenging as the target domain does not provide explicitly any information on how to optimize classifiers. The goal of unsupervised domain adaptation is to derive a classifier for the unlabeled target domain data by extracting the information that is invariant across source and target domains.

Canonical correlation analysis (CCA) is often used to deal with DA problems since it can obtain two projection matrices to maximize the correlation between two different domains [9]. The derived correlation subspace can preserve common features of both domains very well.

In our work, an efficient unsupervised domain adaptation algorithm based on CCA is developed. Specifically, we first make use of CCA to derive the correlation subspace. In order to explore the target domain data further, we use the source domain data to train a SVM classifier and then obtain the pre-label of the target domain. Considering that the label space between source and target domain may be different or even disjoint, we introduce a class adaptation matrix to adapt them. Taking all factors mentioned above into consideration, we design an objective function. Finally, a fine classifier can be obtained by iterative optimization.

The rest of the paper is organized as follows. Section 2 first introduces the related work of DA and CCA. In Sect. 3, we discusses our proposed unsupervised domain adaptation algorithm based on canonical correlation analysis in detail. Section 4 shows the experimental results in a cross-domain action recognition dataset. The last section gives some conclusive discussions.

2 Related Work

We now review some state-of-the-art domain adaptation methods and the recent works related with deep learning are also discussed. Finally, we introduce the main idea of CCA.

2.1 Domain Adaptation Methods

Generally speaking, domain adaptation problems can be solved by instance-based and feature-based approaches.

The goal of instance-based approaches is to re-weight the source domain instances by making full use of the information of target domain. For example, Dai et al. [3] proposed an algorithm based on Adaboost, which can iteratively reinforce useful samples to help train classifiers. Shi et al. [21] attempted to find a new representation for the source domain, which can reduce the negative effect of misleading samples. In [11], a heuristic algorithm was developed to remove misleading instances of the source domain. Li et al. [13] proposed a framework that can iteratively learn a common space for both domain. Several methods [15, 16, 26,27,28] proposed by Wu et al. and Liu et al. can also help us solve the domain adaptation problem effectively.

The purpose of feature-based approaches is to discover common latent features. For instance, a method integrating subspaces on the Grassman manifold was developed to learn a feature projection matrix for both domains in [7]. Zhang et al. [34] introduced a novel feature extraction algorithm, which can efficiently encode the discriminative information from limited training data and the sample distribution information from unlimited test data. In [5], a projection aligning subspaces of both domains was designed. The distributions of the feature space and the label space are considered in [8] to learn conditional transferable components. In [22,23,24], three subspaces extraction methods were proposed, which provides the new way to find the common subspaces of both domains. The method in [19] attempted to project both domains into a Reproducing Kernel Hilbert Space (RKHS) and then obtain some transfer components based on Maximum Mean Discrepancy (MMD). In [30], the independence between the samples learned features and domain features is maximized to reduce the domains’ discrepancy.

The discrepancies between domains [32] can be reduced through deep networks, which learn feature representation disentangling the factors of variations behind data [1]. Recent works have demonstrated that deep neural networks are powerful for learning transferable features [6, 17, 18, 25]. Specifically, these methods embeds DA modules into deep networks to improve the performance, which mainly correct the shifts in marginal distributions, assuming conditional distributions remain unchanged after the marginal distribution adaptation. However, the recent research also finds that the features extracted in higher layers need to depend on the specific dataset [33].

2.2 Canonical Correlation Analysis

We briefly review canonical correlation analysis (CCA) as follows.

Suppose that \(X^{s}=\{x_{1}^{s},\dots ,x_{n}^{s}\}\in \mathbb {R}^{d_{s}\times n}\) and \(X^{t}=\{x_{1}^{t},\dots ,x_{n}^{t}\}\in \mathbb {R}^{d_{t}\times n}\) are source and target domain dataset respectively. n denotes the number of samples. CCA can obtain two projection vectors \(u^{s}\in \mathbb {R}^{d_{s}}\) and \(u^{t}\in \mathbb {R}^{d_{t}}\) to maximize the correlation coefficient \(\rho \):

$$\begin{aligned} \max \limits _{u^{s},u^{t}}\rho =\frac{{u^{s}}^{\top }\sum _{st}u^{t}}{\sqrt{{u^{s}}^{\top }\sum _{ss}u^{s}}\sqrt{{u^{t}}^{\top }\sum _{tt}u^{t}}}, \end{aligned}$$
(1)

where \(\sum _{st}= X^{s}{X^{t}}^{\top }\), \(\sum _{ss}= X^{s}{X^{s}}^{\top }\), \(\sum _{tt}= X^{t}{X^{t}}^{\top }\), and \(\rho \in \left[ 0,1 \right] \). According to [9], we can regard (1) as a generalized eigenvalue decomposition problem, there is

$$\begin{aligned} \sum \nolimits _{st}\left( \sum \nolimits _{tt}\right) ^{-1}\sum \nolimits _{st}^{\top }u^{s}=\eta \sum \nolimits _{ss}u^{s} \end{aligned}$$
(2)

Then, \(u^{t}\) can be calculated by \(\sum _{tt}^{-1}\sum _{st}^{\top }u^{s}/\eta \) after \(u^{s}\) is obtained. To avoid overfitting and singularity problems, two terms \(\lambda _{s}I\) and \(\lambda _{t}I\) are added into \(\sum _{ss}\) and \(\sum _{tt}\) respectively. We have

$$\begin{aligned} \sum \nolimits _{st}\left( \sum \nolimits _{tt}+\lambda _{t}I\right) ^{-1}\sum \nolimits _{st}^{\top }u^{s}=\eta \left( \sum \nolimits _{ss}+\lambda _{s}I\right) u^{s} \end{aligned}$$
(3)

Generally speaking, we can obtain more than one pair of projection vectors \(\left\{ u_{i}^{s} \right\} _{i=1}^{L}\) and \(\left\{ u_{i}^{t} \right\} _{i=1}^{L}\). L denotes the dimensions of the CCA subspace. CCA can determine projection matrices \(P_{s}=\{u_{1}^{s},\dots ,u_{d}^{s}\}\in \mathbb {R}^{d_{s}\times L}\) and \(P_{t}=\{u_{1}^{t},\dots ,u_{d}^{t}\}\in \mathbb {R}^{d_{t}\times L}\), which can project the source and target domain data (\(X^{s}\) and \(X^{t}\)) onto the correlation subspace. Once the correlation subspace spanned by \(\left\{ u_{i}^{s,t} \right\} _{i=1}^{L}\) is derived, we can recognize the target domain data by the model trained from the source domain data.

3 Our Method

Our approach mainly consists of four steps. Firstly, we use the CCA to find the source and target domain’s projection matrices and then project both domain data onto the correlation subspace. The second step is to train a SVM classifier to obtain the pre-label matrix of the target domain data. Then, we introduce a sigmoid function to process dataset on the correlation subspace. Finally, by minimizing the norm of classification errors, we obtain a class adaptation matrix and a classification matrix simultaneously.

3.1 The Correlation Subspace

We denote \(X_{S}=(x_{1},x_{2},\dots ,x_{N_{S}})^{\top },x_{i}\in \mathbb {R}^{d}\) as the source domain data and \(X_{Tu}=(x_{1},x_{2},\dots ,x_{N_{Tu}})^{\top },x_{i}\in \mathbb {R}^{d}\) as the target domain data. Then we can use CCA mentioned above to find the projection matrices \(P_{S} \in \mathbb {R}^{d\times L}\) and \(P_{Tu} \in \mathbb {R}^{d\times L}\) for labeled source domain and unlabeled target domain data respectively. L denotes the dimension of the correlation subspace. Moreover, we denote \(X_{S}^{P}\in \mathbb {R}^{N_{S}\times L}\) and \(X_{Tu}^{P}\in \mathbb {R}^{N_{Tu}\times L}\) as data matrix of source and target domain projected onto the correlation subspace. Then, we have

$$\begin{aligned} X_{S}^{P} = X_{S}P_{S} \end{aligned}$$
(4)
$$\begin{aligned} X_{Tu}^{P} = X_{Tu}P_{Tu} \end{aligned}$$
(5)

3.2 The Pre-label of Target Domain

Let \(Y_{S}=(y_{1},y_{2},\dots ,y_{N_{S}})^{T}\in \mathbb {R}^{N_{S}\times c}\) be the label matrix of source domain with c classes. In our algorithm, we propose to obtain the pre-label of target domain by training a SVM classifier on the CCA correlation subspace. And we denote \(Y_{Tu}=(y_{1},y_{2},\dots ,y_{N_{Tu}})^{T}\in \mathbb {R}^{N_{Tu}\times c}\) as the pre-label matrix.

3.3 The Sigmoid Function

What’s more, a sigmoid function \(G(\cdot )\) is introduced to process both domain dataset on the correlation subspace. The role of \(G(\cdot )\) is to preform a non-linear mapping, which can improve the generalization ability of our model further. Specifically, we have

$$\begin{aligned} R_{S}= G(X_{S}^{P})=G(X_{S}P_{S}) \end{aligned}$$
(6)
$$\begin{aligned} R_{Tu}= G(X_{Tu}^{P})=G(X_{Tu}P_{Tu}) \end{aligned}$$
(7)

3.4 The Classification Matrix and Class Adaptation Matrix

We first define a classification matrix \(\beta \in \mathbb {R}^{L\times c}\). It aims to classify both domain data onto the right class as accurate as possible. That is to say, \(R_{S}\beta \) and \(R_{Tu}\beta \) should be similar to \(Y_{S}\) and \(Y_{Tu}\) respectively. Specifically, we define the objective function as

$$\begin{aligned} \min _{\beta }F(\beta )=\left\| \beta \right\| _{q,p}+C_{S}\left\| R_{S}\beta -Y_{S} \right\| _{F}^{2}+ C_{Tu}\left\| R_{Tu}\beta -Y_{Tu} \right\| _{F}^{2} \end{aligned}$$
(8)

where \(\left\| \cdot \right\| _{q,p}\) and \(\left\| \cdot \right\| _{F}^{2}\) are the \(l_{q,p}\)-norm and Frobenius norm respectively. \(C_{S}\) and \(C_{Tu}\) are the penalty coefficient for both domain data. Specifically, \(\left\| \beta \right\| _{q,p}\) can be written as

$$\begin{aligned} \left\| \beta \right\| _{q,p}=(\sum _{i=1}^{m}(\sum _{j=1}^{n}\left| \beta _{ij} \right| ^{q})^{p/q})^{1/p} \end{aligned}$$
(9)

\(q\ge 2\) and \(0\le p\le 2\) are set to impose sparsity on \(\beta \). It’s difficult to solve the objective function when \(p=0\), therefore, we let \(p=1\). The classification accuracy will not be improved with larger q [10], so we set \(q=2\). Finally, the objective function can be described as

$$\begin{aligned} \min _{\beta }F(\beta )=\left\| \beta \right\| _{2,1}+C_{S}\left\| R_{S}\beta -Y_{S} \right\| _{F}^{2}+ C_{Tu}\left\| R_{Tu}\beta -Y_{Tu} \right\| _{F}^{2} \end{aligned}$$
(10)

We also introduce a class adaptation matrix \(\varTheta \in \mathbb {R}^{c\times c}\) to adapt in label space. This is because the label space between source and target domains may be different [29]. So label adaptation may help obtain a better classification model. To incorporate label adaptation into our method, we can redefine the objective function as

$$\begin{aligned} \min _{\beta ,\varTheta }&F(\beta ,\varTheta )=\left\| \beta \right\| _{2,1}+C_{S}\left\| R_{S}\beta -Y_{S} \right\| _{F}^{2}+ \nonumber \\&C_{Tu}\left\| R_{Tu}\beta -Y_{Tu}\circ \varTheta \right\| _{F}^{2}+ \gamma \left\| \varTheta - I \right\| _{F}^{2} \end{aligned}$$
(11)

\(\left\| \varTheta - I \right\| _{F}^{2}\) is a term to control the class distortion. And \(\gamma \) is the trade-off parameter. The symbol \(\circ \) denotes a multiplication operator, which can perform label adaptation between domains. In [4], the importance of unlabeled data has been emphasized. It’s believed that a large number of unlabeled target domain data containing meaningful information for classification may not be fully explored. We minimize the error between the \(R_{Tu}\beta \) and \(Y_{Tu}\circ \varTheta \) to explore the unlabeled data further.

The problem in our method turns out how to find the optimal classification matrix \(\beta \) and class adaptation matrix \(\varTheta \) simultaneously.

3.5 Optimization Algorithm

We can obtain the solution for the objective function (11) easily since \(\beta \) and \(\varTheta \) is differentiable.

Firstly, by fixing \(\varTheta = I\), we can get the derivative of (11) with respect to \(\beta \). And there is

$$\begin{aligned} \frac{\partial F(\beta ,\varTheta ) }{\partial \beta }=2Q\beta +2C_{S}R_{S}^{T}(R_{S}\beta -Y_{S})+2C_{Tu}R_{Tu}^{T}(R_{Tu}\beta -Y_{Tu}\circ \varTheta ) \end{aligned}$$
(12)

in which \(Q\in \mathbb {R}^{L\times L}\) is a diagonal matrix. We can regard the i-th element of Q as

$$\begin{aligned} Q_{ii}=\frac{1}{2\left\| \beta _{i} \right\| _{2}} \end{aligned}$$
(13)

in which \(\beta _{i}\) can be seen as the i-th row of \(\beta \).

In our algorithm, to avoid \(\beta _{i}=0\), we incorporate a very small value \(\epsilon >0\) into (13). Specifically, we use \(\left\| \beta _{i} \right\| _{2}+ \epsilon \) to update Q. So the Eq. (13) can be rewritten as follows

$$\begin{aligned} Q_{ii}=\frac{1}{2(\left\| \beta _{i} \right\| _{2}+\epsilon )},\epsilon >0 \end{aligned}$$
(14)

We can let the Eq. (12) be zero, namely \(\frac{\partial F(\beta ,\varTheta ) }{\partial \beta }=0\), then the optimal \(\beta \) can be obtained, there is

$$\begin{aligned} \beta =(Q+C_{S}R_{S}^{T}R_{S}+C_{Tu}R_{Tu}^{T}R_{Tu})^{-1}(C_{S}R_{S}^{T}Y_{S} +C_{Tu}R_{Tu}^{T}Y_{Tu}\circ \varTheta ) \end{aligned}$$
(15)

Second, according to the formula (15), we substitute the fixed \(\beta \) value into the objective function. The optimization problem (11) becomes

$$\begin{aligned} \min _{\varTheta }F(\varTheta )=C_{Tu}\left\| R_{Tu}\beta -Y_{Tu}\circ \varTheta \right\| _{F}^{2}+\gamma \left\| \varTheta -I \right\| _{F}^{2} \end{aligned}$$
(16)

Then, we can obtain the derivative of (16) with respect to \(\varTheta \). Specifically, we have

$$\begin{aligned} \frac{\partial F(\beta ,\varTheta )}{\partial \varTheta }=-2C_{Tu}Y_{Tu}^{T}(R_{Tu}\beta -Y_{Tu}\circ \varTheta )+2\gamma (\varTheta -I) \end{aligned}$$
(17)

Similarly, by setting (17) to be zero, we have

$$\begin{aligned} \varTheta =(C_{Tu}Y_{Tu}^{T}Y_{Tu}+\gamma I)^{-1}(C_{Tu}Y_{Tu}^{T}R_{Tu}\beta +\gamma I) \end{aligned}$$
(18)
figure a

The result can be obtained by iteratively optimizing \(\beta \) and \(\varTheta \). The optimization procedure of our model is summarized in Algorithm 1. \(T_{max}\) denotes the number of maximum iteration. In this paper, we set \(T_{max}\) to be 50. Once the number of iteration reach \(T_{max}\), the iterative update procedure would be terminated.

Fig. 1.
figure 1

Example actions of the IXMAS dataset. Each row represents an action at five different views.

4 Experimental Results

4.1 Experimental Setting

Dataset. The Inria Xmas Motion Acquisition Sequences (IXMAS)Footnote 1 records 11 actions. Each action can be seen as a category. There are 12 actors involved in this action shooting and they perform each action three times. Therefore, 396 instances are captured by one camera in total. As seen from Fig. 1, five cameras (domains) are used to capture the actions simultaneously. To extract features from each image, we follow the procedure in [14]. Finally, each image can be regarded as a vector of 1000 dimensions. This dataset aims to set a standard for human action recognition.

Implementation Details. We follow the operation in [31] to obtain the CCA projection matrices for both domains. Specifically, two thirds of domains’ samples in each catagory are selected. And the training set consists of 30 labeled samples per category in source domain and all unlabeled samples in target domain. The test set consists of all unlabeled target domain data. Then we follow the procedure mentioned in Sect. 3 to train a classifier and get the classification accuracies. The above procedure is repeated ten times. We give the average classification accuracy in Table 1.

4.2 Comparison Methods

We compare our framework with a baseline and several classic unsupervised domain adaptation methods.

SVM [12]. We regard SVM as the baseline. SVM has become a classic method to solve classification problems. To solve the DA problem, we use the original features in both domains directly. Specifically, We build a prediction model based on the source domain data and then classify instances in the target domain. Since SVM is not developed for DA problem, the final result on target domain may be the worst when compared with other methods.

Subspace Alignment (SA) [5]. This algorithm is very simple. It learns PCA subspaces of both domains at first. And then a linear mapping aligning the PCA subspaces is derived. After that, we can build models based on the source domain to classify the target domain data on the common subspace.

Table 1. The classification accuracies and standard errors (%) for all methods on the IXMAS dataset

Transfer Component Analysis (TCA) [19]. This algorithm is designed according to maximum mean discrepancy (MMD), which can measure the distance between two distributions. By minimizing MMD, a projection matrix narrowing the distance between both domains can be obtain. This method can also map both domain data into a kernel space. In our experiments, Gaussian RBF kernels are taken.

Geodesic Flow Subspaces (GFK) [7]. This method applies the Grassman manifold to solve DA problems. First of all, the PCA or PLSA subspaces of both domains are computed. Then the subspaces are embedded into the Grassman manifold. And we can use the subspaces to obtain super-vertors by transforming the original features. Finally, low dimensional feature vectors are derived and we can train a prediction model on them.

Maximum Independence Domain Adaptation (MIDA) [30]. MIDA introduces Hilbert-Schmidt independence criterion to adapt different domains. Specifically, in order to reduce the difference across domains, we can try to obtain the maximum of the independence between the learned features and the sample features.

4.3 Parameter Tuning

In our method, there are totally four parameters including \(C_{S}\), \(C_{Tu}\), \(\epsilon \) and \(\gamma \). Generally speaking, it is not appropriate for an algorithm to tune the four parameters at the same time. Actually, there is no need to tune all of them. We can find the optimal solution by freezing two parameters. To be specific, we set \(\epsilon =1\) and \(\gamma =0.1\). Then we search for the best values of Cs and \(C_{Tu}\) within the ranges \(\left\{ 4^{0},4^{1},4^{2},4^{3},4^{4},4^{5},4^{6} \right\} \) and \(\left\{ 10^{-3},10^{-2},10^{-1},10^{0},10^{1},10^{2},10^{3} \right\} \) respectively. Finally, the best performance of our model is reported.

For SVM and other four state-of-the-art DA methods, we follow the procedures in corresponding paper to tune parameters and then report the best classification results.

4.4 Experimental Results and Comparisons

The classification accuracies and standard errors are summarized in Table 1. Cam0-cam5 represent different domains. Specifically, the form A\(\rightarrow \)B states that A is the source domain and B is the target domain. For example, cam0\(\rightarrow \)cam1 represents that images captured by cam0 are used as the source domain and images captured by cam1 are regarded as the target domain. The classification accuracy of SVM can be seen from the second column of Table 1. And the results of the classic unsupervised DA methods are shown in the third to the sixth column. The last column is the result of our proposed method. Totally, 20 domain pairs are given and we bold the best results for each pair. From Table 1, we can conclude that

  • The classification model trained by SVM doesn’t perform well. As can be seen from the table, average accuracy is around 11% and most of the results are no more than 15%. In real applications, such a model is useless.

  • We can obtain better prediction models by training classifiers based on those classic unsupervised DA methods (SA, TCA, GFK, MIDA). The average classification accuracy for each method is above 50%. It’s worth noting that the result of SA is highest (71.4%) compared to TCA, GFK and MIDA. That is to say, SA is more suitable to deal with IXMAS dataset.

  • The classification result can be improved further by our model. Specifically, the average accuracy of our proposed algorithm is 88.1%. The result is good enough since it is improved around 77% points compared with SVM.

5 Conclusion

A new unsupervised domain adaptation algorithm based on canonical correlation analysis is proposed in this paper. Our method shows competitive performance when compared with some state-of-the-art methods, e.g. SVM, SA, TCA, GFK, MIDA.