Restoring latent factors against negative transfer using partial-adaptation nonnegative matrix factorization

  • Ming HeEmail author
  • Jiuling Zhang
  • Jiang Zhang
Regular Paper


Collaborative filtering usually suffers from limited performance due to a data sparsity problem. Transfer learning presents an unprecedented opportunity to alleviate this issue through the transfer of useful knowledge from an auxiliary domain to a target domain. However, the situation becomes complicated when the source and target domain share partial knowledge with each other. Transferring the unshared part across domains will cause negative transfer and may degrade the prediction accuracy in the target domain. To address this issue, in this paper, we present a novel model that exploits the latent factors in the target domain against the negative transfer. First, we transfer rating patterns from the source domain to approximate and reconstruct the target rating matrix. Second, to be specific, we propose a partial-adaptation nonnegative matrix factorization method to correct the transfer learning result and restore latent factors in the target. The final experiments completed on real world datasets demonstrate that our proposed approach effectively addresses the negative transfer and significantly outperforms the state-of-art transfer-learning model.


Transfer learning Cross-domain recommendation Negative transfer 

1 Introduction

Recommendation systems help users faced with an overwhelming selection of items by identifying particular items that are likely to match each user’s tastes or preferences. Increasingly, people are turning to recommender systems to help them find the information that is most valuable to them. One of the most successful technologies in recommender systems research and practice is collaborative filtering.

The CF approach gathers users and ratings and then predicts what users will rate based on their similarity to other users. Collaborative methods can be divided into two models: the neighbourhood-based model (NBM) (Alqadah et al. 2015; Xiaojun 2017) and the latent factor model (LFM) (Langseth and Nielsen 2015). Some of the most successful realizations of LFMs are based on matrix factorization (MF) (Yu et al. 2017; Abdollahi and Nasraoui 2016; Bokde et al. 2015). However, pure latent factor models suffer from several problems, such as poor prediction, sparsity, scalability, etc. In real-world recommender systems, users can rate a very limited number of items. Thus, the rating matrix is often extremely sparse. As a result, the available rating data that can be used for K-NN searches, probabilistic modelling, or matrix factorization are radically insufficient. The sparsity problem has become a major bottleneck for most CF methods.

To alleviate the sparsity problem, one promising approach is to gather the rating data from multiple rating matrices in related domains for knowledge transfer and sharing. The data sparsity problem has been addressed by transfer learning from various perspectives in different research areas. The purpose of cross-domain CF is to transform rating patterns from the source domain to the target domain to alleviate the sparsity problem in the target domain (Bokde et al. 2015). Codebook-based knowledge transfer (CBT) (Li et al. 2009) is a widely used algorithm in cross-domain collaborative filtering. The codebook consists of users clusters, and item clusters and is constructed by simultaneously clustering the users (rows) and items (columns) of the source rating matrix, using the orthogonal non-negative matrix tri-factorization (ONMTF) clustering algorithm (Ding et al. 2006). This method is equivalent to the two-way K-means clustering algorithm. Figure 1 shows the whole learning procedure for CBT, a two-step-based cross-domain CF algorithm.
Fig. 1

Two-step-based CBT algorithm

Figure 1 shows the two steps in the CBT algorithm where the unrated items in the target domain \(X_{tgt}\) are marked as “?”. The CBT algorithm is based on the ONMTF (Ding et al. 2006) clustering algorithm in step 1, which provides recommendations for a sparse target domain \(X_{tgt}\) in step 2, by sharing the latent common rating patterns knowledge in a latent space from the related dense domain \(X_{src}\) which is referred to as codebook S ( \(X_{src}U_{src} S{V_{src}^T}\)). Thus, the codebook S is constructed by simultaneously clustering the users (rows) and items (columns) of \(X_{src}\), indicating the rating that a user belonging to a specific user cluster \(u_{src}\) provides to an item belonging to a specific item cluster \(v_{src}\). Then the missing values in the target domain \(X_{tgt}\) can be learned by duplicating the rows and columns of the codebook using \(U_{tgt} S{V_{tgt}^T}\), This approximation can be achieved by the following matrix norm optimization:
$$\begin{aligned}&\mathop {\min }\limits _{{U_{tgt}} \ge 0,{V_{tgt}} \ge 0,S \ge 0} \left\| {{X_{tgt}} - {U_{tgt}}SV_{tgt}^T} \right\| _F^2\nonumber \\&s.t.\quad U_{tgt}^T{U_{tgt}} = I,\, \;V_{tgt}^T{V_{tgt}} = I \end{aligned}$$
where \({X_{tgt}} \in R_ + ^{p \times q}\), \({U_{tgt}} \in R_ + ^{p \times k}\), \({V_{tgt}} \in R_ + ^{q \times l}\), \({S} \in R_ + ^{k \times l}\) and \({\left\| \right\| _F}\) denotes the Frobenius norm of the matrix.

In our previous studies, we tried to use the context information to restore the target domain specific character and validate the effectiveness of the context restoration model. However, that model needs two necessary conditions. We first must get a ratings’ reference standard for the target domain to construct a rating bias matrix. Next, the target domain needs to contain context knowledge which can help group target items or users. In this way, we can get both relationship matrixes between user-context and item-context. In the real-world, some datasets well satisfy these two conditions, such as MovieLens, while the others have difficulty meetting them, such as BookCrossing. Therefore, in this study, we propose a model that restores the target specific characters after the transfer learning process without any additional conditions needed for the target domain.

2 Related work

2.1 Transfer learning

Since (Li et al. 2009) pioneered the rating patterns sharing approach in cross-domain recommendation systems with their CBT algorithm, CBT-based rating patterns transformation models have been well developed to alleviate the sparsity problem for traditional CF approaches. Figure 2 shows the underlying correspondence of the user-item rating patterns between two toy domain matrices. The different colors in permutated matrixes mean clustering different parts of matrixes to the right cluster-level codebooks. In contrast, the unshared parts between movie and book are colored by grey in the target matrix. Figure 2 illustrates the common circumstance in practice when source and target share partly rating knowledge: there are no two exact datasets in the real world. On one hand, all CBT-based models can be widely used because of loose limitations by assuming no users and items overlap between source and target. On the other hand, sharing cluster-level rating patterns cross domains will inevitably bring about more or less negative transfer which is caused by transferring different parts between source and target.
Fig. 2

Movie and book domains partly share cluster-level rating patterns by CBT

In the manifold theory, data is sampled from a submanifold which is embedded in high dimensional ambient space (Lin and Zha 2018). Zhou et al. (2005) and Cai et al. (2010) employ graph regularization to try to maintain intrinsic geometrical and discriminating structure of the data manifold. Furthermore, Long et al. (2013) propose to combat the negative effects of semantic features via graph regularized collective matrix factorization model. The graph regularization method, however, still needs to explicitly calculate the distance of users and items in \(sim\left( x_j,x_l\right)\) in Eq. (2) which is very hard to achieve in extremely sparse high-dimensional data space.
$$\begin{aligned} \begin{aligned} R&=\frac{1}{2}\sum _{j,l=1}^{N}\left\| z_{j}-z_{l} \right\| ^{2}W_{jl} \\ W_{jl}&=\left\{ \begin{aligned}&sim\left( x_{j},x_{l} \right) \quad for\quad x_{j} \in N_{p}\left( x_{l} \right) \vee \ x_{l} \in N_{p}\left( x_{j} \right) \\&0 \\ \end{aligned} \right. \end{aligned} \end{aligned}$$
where R is the graph regularized function and W is weight matrix on the graph. \({{W_{jl}}}\) is used to measure the closeness of two points \({x_l}\) and \({x_j}\). \({{N_p}\left( {{x_j}} \right) }\) is the p nearest neighbors of \({x_j}\). \({z_j}\) and \({z_l}\) are low dimensional representation of \({x_j}\) and \({x_l}\), respectively.

As for the research works of context-aware and adaptive models, tensor factorization (Yao et al. 2015) and context-based splitting methods (Zheng et al. 2014) have been proved useful in the single domain context-aware collaborative filtering. However, the methods used to restore a fulfilled matrix and enhance the transfer learning results are still a topic for further investigation. In addition, Fenza et al. (2011) present a hybrid context-aware system by combining fuzzy clustering and rule mining. All of these approaches still need additional target domain context knowledge.

2.2 Non-negative matrix factorization

Non-negative matrix factorization (NMF) is a matrix decomposition algorithm that focuses on the analysis of data matrices whose elements are nonnegative. In general, NMF factorizes input nonnegative data matrix X into 2 nonnegative matrices
$$\begin{aligned} X \approx U{V^T} \end{aligned}$$
where \(X = \left[ {{x_1},\ldots,{x_N}} \right] \in R_ + ^{M \times N}\) (\(R_{+}^{M \times N}\) is the set of all M-by-N matrices whose elements are nonnegative and each column of X is a sample vector, NMF aims to find two non-negative matrices \(U \in R_{+}^{M \times K}\) and \(V \in R_{+}^{N \times K}\) whose product can approximate the original matrix X well. The cost function of the approximation is defined in Eq. (3).
$$\begin{aligned} \mathop {\min }\limits _{U \ge 0,V \ge 0} {\left\| {X - U{V^T}} \right\| ^2} \end{aligned}$$
When X contains unlabeled zero items within the matrix, we have the Incomplete-NMF cost functions Eq. (4), as follows:
$$\begin{aligned} \mathop {\min }\limits _{U \ge 0,V \ge 0} {\left\| {W \circ \left( {X - U{V^T}} \right) } \right\| ^2} \end{aligned}$$
where \(\circ\) is the hadamard product operation and W is a mask matrix that \(W_{ij}=1\) if \(X_{tgt}\ne 0\), else \(W_{ij}=0\). Both Eqs. (3) and (4) can be optimized iteratively:
$$\begin{aligned}&{u_{ik}} \leftarrow {u_{ik}}\frac{{{{\left( {XV} \right) }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}} \quad {v_{jk}} \leftarrow {v_{jk}}\frac{{{{\left( {{X^T}U} \right) }_{jk}}}}{{{{\left( {V{U^T}U} \right) }_{jk}}}} \end{aligned}$$
$$\begin{aligned}&{u_{ik}} \leftarrow {u_{ik}}\frac{{{{\left[ {\left( {W \circ X} \right) V} \right] }_{ik}}}}{{{{\left[ {\left( {W \circ \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}} \qquad {v_{jk}} \leftarrow {v_{jk}}\frac{{{{\left[ {{{\left( {W \circ X} \right) }^T}U} \right] }_{jk}}}}{{{{\left[ {\left( {{W^T} \circ \left( {V{U^T}} \right) } \right) U} \right] }_{jk}}}} \end{aligned}$$
Equations (5) and (6) have been proved that they will convergence to the local minimum (Lee and Seung 2001; Langville et al. 2014; Chen and Plemmons 2010).

The non-negative constraints on U and V only allow additive combinations among different bases. This is the most significant difference between NMF and the other matrix factorization methods, e.g., SVD. NMF can learn parts-based representation of the data and significantly improves interpretability as compared to SVD. The advantages of this parts-based representation have been observed in many real-world problems such as sound analysis (Helen and Virtanen 2005), face recognition (Long et al. 2014), image annotation (Kalayeh et al. 2014), visual tracking (Wu et al. 2013), document clustering (Qin et al. 2017), cancer clustering (Wang et al. 2013) and DNA gene expression analysis (Gaujoux and Seoighe 2012).

In reality, we have \(K< < M\) and \(K< < N\). Thus, NMF essentially tries to find a compressed approximation of the original data matrix. We can view this approximation column by column as follows:
$$\begin{aligned} {X_{ * j}} \approx \sum \limits _{k = 1}^K {{U_{ * k}}{v_{jk}}} \end{aligned}$$
where each column \({X_{ * j}}\) is approximated by a linear combination of the \(1, \ldots ,K\) columns of U, Each column \({{U_{ * k}}}\) is weighted by the corresponding items \({v_{jk}}\) in V. Therefore, U can be viewed as containing a basis that is optimized for the linear approximation of the data in X. Since relatively few basis vectors are used to represent many data vectors, a good approximation can only be achieved if the basis vectors discover latent factors in the data (Huang et al. 2013). In this study, we propose a novel PA-NMF method to find and restore the latent factors in the target domain.

3 Proposed model

Our model consists of two stages, First, we transfer rating patterns from the source domain to alleviate the target domain sparsity. Second, we employ PA-NMF to learn the domain specific knowledge and restore latent factors against negative transfer in the target matrix. The restoration stage is based on the result of transfer learning.

3.1 Transfer rating patterns

In the transfer learning stage, we first learn a codebook \({B_n}\) from each source domain n using the ONMTF (Ding et al. 2006) method. Then, we take \({B_n}\) as the medium to transfer rating patterns from sources to the target. Equation (7) is the cost function of target matrix approximation.
$$\begin{aligned} \begin{aligned}&\mathop {\min }\limits _{\begin{array}{c} \scriptstyle {U_n} \in {\left\{ {0,1} \right\} ^{{p_n} \times {k_n}}}\\ \scriptstyle {V_n} \in {\left\{ {0,1} \right\} ^{{q_n} \times {l_n}}}\\ \scriptstyle {a_n} \in R\forall n \in N \end{array}} \left\| {\left[ {{X_{tgt}} - \sum \limits _{n = 1}^N {{\lambda _n}\left( {{{\left[ {{U_{tgt}}} \right] }_n}{B_n}{{\left[ {V_{tgt}^T} \right] }_n}} \right) } } \right] \circ W} \right\| _F^2\\&s.t. \quad {\left[ {{U_{tgt}}} \right] _n}1 = 1, \quad {\left[ {{V_{tgt}}} \right] _n}1 = 1,\quad \sum \limits _n {{\lambda _n} = 1,\quad } {\lambda _n} \ge 0 \end{aligned} \end{aligned}$$
where \({X_{tgt}} \in R_{\mathrm{+ }}^{p \times q}\) is the target matrix with p users and q items. By introducing multiple source domains, we can solve under-fitting problem in the single source CBT model. Furthermore, we also confine the value of relatedness coefficients \(\lambda _n\) in Eq. (7) to overcome the over-fitting problems in multiple sources cross-domain models (Moreno et al. 2012). In the end, we get the all ratings filled target matrix \({\tilde{X}_{tr}}\) by Eq. (8).
$$\begin{aligned} {\tilde{X}_{tr}} = W \circ {X_{tgt}} + \left[ {1 - W} \right] \circ \left[ {\sum \limits _{n = 1}^N {{\lambda _n}\left( {{{\left[ {{U_{tgt}}} \right] }_n}{B_n}{{\left[ {V_{tgt}^T} \right] }_n}} \right) } } \right] \end{aligned}$$

3.2 Latent factors restoration

The rating patterns transfer learning method is also known as sharing cluster-level latent factors cross domains in which we view the rating patterns as the cluster-level latent factors of the matrix (Gao et al. 2013). The cluster-level structures hidden across domains can be extracted by learning the rating patterns of user groups on the item clusters in the source domains. In the state-of-art transfer learning models (Li et al. 2009; Moreno et al. 2012; Gao et al. 2013), sparsity in the target domain is the most important cause of the low prediction accuracy, so these models just assume that the source and target domains share most of their latent factors. Then, they use all the source domain latent factors to approximate and reconstruct the target domain matrix as in Eqs. (7) and (8). However, as we have introduced before, the assumption of sharing all latent factors cross domains dose not always hold in real-world circumstances, where the ratings from multiple domains cannot share all their correspondence in the cluster level. There are neither two identical datasets nor entirely sharing latent factors datasets in the real world. Therefore, that transfer learning stage must lead to a more or less negative transfer cross-domains. For this reason, we first formulate the latent factors sharing assumption and then, we propose a method to correct the transfer learning result \({\tilde{X}_{tr}}\) and restore the latent factors of the target domain \(X_{tgt}\).

3.2.1 Problem definition

We get the all ratings filled target matrix \({\tilde{X}_{tr}}\) after the transfer learning stage. In Eq. (8), the \({\tilde{X}_{tr}}\) is composed of two elements. The first term \(W\circ {X_{tgt}}\) contains original rating items in the target domain by which we know that \(W\circ {X_{tgt}}=W\circ {\tilde{X}_{tr}}\). The second term \(\left( {1 - W} \right) \circ {\tilde{X}_{tr}}\) is reconstructed by transferred latent factors from the source domain. If we assume that the transfer learning stage is adaptive and accurate, then we can infer that the transfer learning result \({\tilde{X}_{tr}}\) and the original ratings in the target domain \(W\circ {\tilde{X}_{tr}}\) should share the same tribe of latent factors. Therefore, we are able to formulate the description using matrix decomposition method in Eq. (9).
$$\begin{aligned} {\tilde{X}_{tr}} \approx U{V^T} \cup W \circ {\tilde{X}_{tr}} \approx W \circ \left( {U{V^T}} \right) \end{aligned}$$
The cost function Eqs. (3) and (4) should be equivalent for \({\tilde{X}_{tr}}\) when Eq. (9) is established. Then we get Eq. (10) as follows:
$$\begin{aligned} \min {\left\| {\tilde{X}_{tr} - U{V^T}} \right\| ^2} \Leftrightarrow \min {\left\| {W \circ \left( {\tilde{X}_{tr} - U{V^T}} \right) } \right\| ^2} \end{aligned}$$
where U and V represent the latent factors which share \({\tilde{X}_{tr}}\) and \(W\circ {\tilde{X}_{tr}}\). Since \(W\circ {\tilde{X}_{tr}}\) is the fixed part in \({\tilde{X}_{tr}}\) in Eq. (10), we demonstrate our objective equation in Eq. (11)
$$\begin{aligned} {\arg _{u,v,(1 - W)\circ {\tilde{X}_{re}}}}\left\{ {\min {{\left\| {\tilde{X}_{tr} - UV} \right\| }^2} \Leftrightarrow \min {{\left\| {W \circ \left( {\tilde{X}_{tr} - UV} \right) } \right\| }^2}} \right\} \end{aligned}$$
where we try to find the latent factors matrix U, V and adjust the variable part \(\left( {1 - W} \right) {\tilde{X}_{tr}}\) so that Eq. (9) is established.

3.2.2 Partial-adaptation NMF

The state-of-art NMF methods will find latent factors for all valued items and approximate the whole matrix \(\tilde{X}_{tr}\). Thus, we cannot use this kind of NMF to fix one part of the matrix \(W\circ {\tilde{X}_{tr}}\) and adjust the other part \(\left( {1 - W} \right) \circ {\tilde{X}_{tr}}\). Therefore, we introduce the partial adaptation concept to NMF by where we can adjust the partial matrix and make it more adaptive with the latent factors of the fixed part in the matrix.

Based on the state-of-art NMF updating algorithm Eqs. (5) and (6), we solve the formulation Eq. (11) iteratively. The Partial-Adaptation NMF is composed of two steps. In the first step, we use the updating functions in Eq. (5) to inherit the transfer learning result \(\tilde{X}_{tr}\) is used to roughly estimate the latent factors matrixes U and V. Second, We try to fix part of the target matrix \(W \circ {\tilde{X}_{tr}}\) then, we update and adjust the other part of the matrix \(\left( {1 - W} \right) \circ {\tilde{X}_{tr}}\), Therefore, we replace X with \(W \circ {\tilde{X}_{tr}} + \left( {1 - W} \right) \circ U{V^T}\) in the iteration functions in Eq. (5). The first term of the replacement \(W \circ {\tilde{X}_{tr}}\) is the fixed part of the matrix and the second term \(\left( {1 - W} \right) \circ U{V^T}\) is the transferred ratings that can be iteratively adjusted based on the latent factors matrix U and V in the target domain. The updating function for PA-NMF is done using Eqs. (12) and (13).
$$\begin{aligned}&{u_{ik}} \leftarrow {u_{ik}}\frac{{{{\left[ {\left( {W \circ \tilde{X}_{tr} + \left( {1 - W} \right) \circ U{V^T}} \right) V} \right] }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}} \end{aligned}$$
$$\begin{aligned}&{v_{jk}} \leftarrow {v_{jk}}\frac{{{{\left[ {\left( {{W^T} \circ {\tilde{X}_{tr}^T} + {{\left( {1 - W} \right) }^T} \circ {{\left( {U{V^T}} \right) }^T}} \right) U} \right] }_{jk}}}}{{{{\left( {V{U^T}U} \right) }_{jk}}}} \end{aligned}$$
In this way, specific knowledge U and V of the target latent factors can be restored by adjusting \(\left( {1 - W} \right) \circ {\tilde{X}_{tr}}\) during the iterations. Finally, we can get the restored matrix \(\tilde{X}_{re}\) by Eq. (14).
$$\begin{aligned} {\tilde{X}_{re}} = W \circ {\tilde{X}_{tr}} + (1 - W) \circ \left( {U{V^T}} \right) \end{aligned}$$

3.2.3 Convergence analysis

The updating formula Eq. (6) can be transformed to
$$\begin{aligned} u_{ik}^{t + 1}&= u_{ik}^t\frac{{{{\left[ {\left( {W \circ X} \right) V} \right] }_{ik}}}}{{{{\left[ {\left( {W \circ \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}}\\ u_{ik}^{t + 1} - u_{ik}^t&= u_{ik}^t\frac{{{{\left[ {\left( {W \circ X} \right) V} \right] }_{ik}}}}{{{{\left[ {\left( {W \circ \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}} - u_{ik}^t\\&= u_{ik}^t\left( {\frac{{{{\left[ {\left( {W \circ X} \right) V} \right] }_{ik}}}}{{{{\left[ {\left( {W \circ \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}} - 1} \right) \\&= u_{ik}^t\left( {\frac{{{{\left[ {\left( {W \circ X} \right) V - \left( {W \circ \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left[ {\left( {W \circ \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}}} \right) \\&= u_{ik}^t\left( {\frac{{{{\left[ {W \circ \left( {X - \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left[ {\left( {W \circ \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}}} \right) \end{aligned}$$
Since the Eq. (6) has been proved to be convergent, we can infer that
$$\begin{aligned} \mathop {\lim }\limits _{t \rightarrow \infty } u_{ik}^t\left( {\frac{{{{\left[ {W \circ \left( {X - \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left[ {\left( {W \circ \left( {U{V^T}} \right) } \right) V} \right] }_{ik}}}}} \right) = 0 \end{aligned}$$
Then, we can prove that the PA-NMF updating formula Eq. (12) is convergent
$$\begin{aligned}&\frac{{{{\left[ {\left( {W \circ X + \left( {1 - W} \right) \circ U{V^T}} \right) V} \right] }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}}\\&\quad = \frac{{{{\left[ {\left( {W \circ X + U{V^T} - W \circ U{V^T}} \right) V} \right] }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}}\\&\quad = \frac{{{{\left[ {\left( {U{V^T} + W \circ \left( {X - U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}}\\&\quad = \frac{{{{\left( {U{V^T}V} \right) }_{ik}} + {{\left[ {\left( {W \circ \left( {X - U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}}\\&\quad = 1 + \frac{{{{\left[ {\left( {W \circ \left( {X - U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}}\\&\Rightarrow u_{ik}^{t + 1} - u_{ik}^t = u_{ik}^t\frac{{{{\left[ {\left( {W \circ \left( {X - U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}} \\&\quad \le u_{ik}^t\frac{{{{\left[ {\left( {W \circ \left( {X - U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left( {W \circ \left( {U{V^T}} \right) V} \right) }_{ik}}}}\\&\mathop {\lim }\limits _{t \rightarrow \infty } u_{ik}^t\frac{{{{\left[ {\left( {W \circ \left( {X - U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left( {W \circ \left( {U{V^T}} \right) V} \right) }_{ik}}}} = 0 \\&\quad \Rightarrow \mathop {\lim }\limits _{t \rightarrow \infty } u_{ik}^t\frac{{{{\left[ {\left( {W \circ \left( {X - U{V^T}} \right) } \right) V} \right] }_{ik}}}}{{{{\left( {U{V^T}V} \right) }_{ik}}}} = 0 \end{aligned}$$
Equation (13) can also be proved in the same way as Eq. (12). Alternatively updating U and V for the target matrix can monotonically reduce the value of the two equivalent cost functions in Eq. (10) and make them converge to a local minimum.

3.2.4 Algorithm

Algorithm 1 demonstrates the first (line 7–9) and second (line 10–20) steps of PA-NMF. In practice, it is crucial to deal with overfitting in the latent factors restoration stage. This problem can be overcome by dividing the training data into two parts and using the validation set to decide whether to stop the iteration or not. Besides, we deal with this issue by cooperating multiple parameters to stop the iteration early. The K in Eq. (3) can be set to a relatively large value to prevent the restoration stage from introducing new noise in PA-NMF process. In addition, setting \(M< < T\) can speed up the learning progress. The values of the iteration count T and error threshold \(\psi\) help us to stop the iteration under the appropriate circumstances. All these parameters are peculiar for a specific target domain and can be decided by cross validation.

As our restoration model is running at the tipping point between overfitting and underfitting, the rule of thumb in practice is to get a relatively lower value of T or a higher value of \(\psi\) to let the iteration stop early and make slightly underfit the results U and V for Eq. (10) (line 10–16). Moreover, we try the PA-NMF algorithm N times in the outermost loop (line 3–21) and seek \({U_f}\) and \({V_f}\) in order to produce the smallest value of the cost function Eq. (10) to compensate for the loss of accuracy due to the underfitting that we needed to end the iteration early (line 18–20).

4 Experiments

We validated our model on multiple publicly available real-world datasets. To the best of our knowledge, this is the first work that introduces the concept of target domain restoration into cross-domain transfer learning. Therefore, we compared the prediction accuracy with and without the latent factors restoration stage after the first transfer learning stage, which confirmed the effectiveness of our model against negative transfer and improved prediction performance. For the experiment, we first employed transfer learning to transfer rating patterns in the target domain and recorded the prediction accuracy, then, we restored the latent factors based on the results of the transfer learning stage. We used the mean absolute error as the evaluation metric (MAE):
$$\begin{aligned} MAE = \frac{{\sum \nolimits _{i = 1}^T {\left| {{P_i} - {R_i}} \right| } }}{T} \end{aligned}$$
where \({P_i}\) is the predicted rating and \({R_i}\) is the actual rating. Smaller MAE values indicate a higher accuracy.

4.1 Datasets setup

We used Netflix and Jester as the source domains in the transfer learning stage and predict the missing ratings on MovieLens and BookCrossing. We extracted a relatively dense part from the huge Netflix dataset with 38,934 ratings and \(97.3\%\) density. The Jester dataset we used is \(100\%\) dense. As for the target domains, MovieLens is \(3.8\%\) dense with 89,132 ratings and BookCrossing is \(2.9\%\) dense with 11,003 ratings.

We repeated the experiment multiple times on each target domain and randomly selected \(80\%\) items to train the model and \(20\%\) remains for the test in each experiment. All ratings in BookCrossing were divided by 2 to normalize their rating scale 1–10 to 1–5. Moreover, in order to not introduce a rating bias in the target matrix, the odd ratings in the original BookCrossing matrix will retained the fractional part ‘.5’, which is produced due to division by 2 in the normalized matrix. In addition, we used \(X_{tgt}^T \in R_ + ^{q \times p}\) instead of \({X_{tgt}} \in R_ + ^{p \times q}\) in the latent factors restoration stage for the BookCrossing dataset because the number of users p is obviously larger than the number of items q. The setup for all the datasets is shown in Table 1.
Table 1

Datasets setup




Density (%)




200 × 200




200 × 100





671 × 3473




1157 × 325



4.2 Experiment results

Table 2 and Fig. 3 compare the prediction accuracy of 7 experiments performed using on MovieLens and BookCrossing data. As we expected, our latent factors restoration (LFR) model clearly and solidly improves the accuracy of the transfer learning (TL) results. For MovieLens, the MAE values decrease from 0.037 to 0.055 (the average is 0.043). For BookCrossing, the MAE values decrease from 0.032 to 0.068 (the average is 0.057). Furthermore, in Fig. 4, we can clearly distinguish the latent factors restoration stage from the start-up transfer learning stage when the curves drop down again after being flat. Figure 4 demonstrates that the transfer learning stage converges after 5–12 iterations on both the MovieLens and BookCrossing data. As for the difference between the two target datasets, Fig. 4 shows that the test error iteration curves have greater fluctuations for BookCrossing than for MovieLens.

We can also observe in Fig. 4 and Table 2 that the differences of MAE values tend to become smaller after restoring latent factors in the target domains among multiple experiments. Moreover, we recognize in Fig. 4 that the curves are gathered at the end. This observation is validated by the standard deviation (STDEV) column in Table 2. We view this tendency as the result of overcoming the negative transfer by restoring latent factors on both MovieLens and BookCrossing datasets. For each experiment, the transfer learning stage obtains has a great difference in prediction accuracy because of the different scales of negative transfer. However our model can restore all these negative transfer scales and will result in minor changes to the prediction accuracy in all the experiments.
Table 2

MAE values of 7 times experiments on MovieLens and BookCrossing























































The bold is used to emphasize the experimental results of the proposed model in this paper

Fig. 3

MAE accuracy results of 7 times experiments on MovieLens and BookCrossing

Fig. 4

Convergence curves of the test error MAE on MovieLens and BookCrossing



The work is supported by the Beijing Natural Science Foundation (No. 4192008) and the General Project of Beijing Municipal Education Commission (No. KM201710005023).


  1. Abdollahi, B., Nasraoui, O.: Explainable matrix factorization for collaborative filtering. In: Proceedings of the 25th International Conference Companion on World Wide Web, International World Wide Web Conferences Steering Committee, 5–6 (2016, April)Google Scholar
  2. Alqadah, F., Reddy, C.K., Hu, J., Alqadah, H.F.: Biclustering neighborhood-based collaborative filtering method for top-n recommender systems. Knowl. Inf. Syst. 44(2), 475–491 (2015)CrossRefGoogle Scholar
  3. Bokde, D., Girase, S., Mukhopadhyay, D.: Matrix factorization model in collaborative filtering algorithms: a survey. Procedia Comput. Sci. 49, 136–146 (2015)CrossRefGoogle Scholar
  4. Cai, D., He, X., Han, J.: Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1548–1560 (2010)Google Scholar
  5. Chen, D., Plemmons, R.J.: Nonnegativity constraints in numerical analysis. In: Bultheel, A. (ed.) The Birth of Numerical Analysis, pp. 109–139. World Scientific, Singapore (2010)zbMATHGoogle Scholar
  6. Ding, C., Li, T., Peng, W.: Orthogonal nonnegative matrix t-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 126–135 (2006)Google Scholar
  7. Fenza, G., Fischetti, E., Furno, D.: A hybrid context aware system for tourist guidance based on collaborative filtering. In: IEEE International Conference on Fuzzy Systems, pp. 131–138 (2011)Google Scholar
  8. Gao, S., Luo, H., Chen, D.: Cross-domain recommendation via cluster-level latent factor model. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 161–176. Springer, Berlin (2013)Google Scholar
  9. Gaujoux, R., Seoighe, C.: Semi-supervised nonnegative matrix factorization for gene expression deconvolution: a case study. Infect. Genet. Evolut. 12(5), 913–921 (2012)CrossRefGoogle Scholar
  10. Helen, M., Virtanen, T.: Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine. In: 13th European Signal Processing Conference, 1–4 (2005)Google Scholar
  11. Huang, K., Sidiropoulos, N.D., Swami, A.: Non-negative matrix factorization revisited: uniqueness and algorithm for symmetric decomposition. IEEE Trans. Signal Process. 62(1), 211–224 (2013)MathSciNetCrossRefGoogle Scholar
  12. Kalayeh, M.M., Idrees, H., Shah, M.: NMF-KNN: image annotation using weighted multi-view non-negative matrix factorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 184–191 (2014)Google Scholar
  13. Langseth, H., Nielsen, T.D.: Scalable learning of probabilistic latent models for collaborative filtering. Decis. Support Syst. 74, 1–11 (2015)CrossRefGoogle Scholar
  14. Langville, A.N., Meyer, C.D., Albright, R. (2014) Algorithms, initializations, and convergence for the nonnegative matrix factorization. arXiv:1407.7299 (2014)
  15. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)Google Scholar
  16. Li, B., Yang, Q., Xue, X.: Can movies and books collaborate? Cross-domain collaborative filtering for sparsity reduction. In: Twenty-First International Joint Conference on Artificial Intelligence (2009)Google Scholar
  17. Lin, T., Zha, H.: Riemannian manifold learning. IEEE Trans. Pattern Anal. Mach. Intell. 30(5), 796–809 (2018)Google Scholar
  18. Long, X., Lu, H., Peng, Y.: Graph regularized discriminative non-negative matrix factorization for face recognition. Multimed. Tools Appl. 72(3), 2679–2699 (2014)CrossRefGoogle Scholar
  19. Long, M., Wang, J., Ding, G.: Adaptation regularization: a general framework for transfer learning. IEEE Trans. Knowl. Data Eng. 26(5), 1076–1089 (2013)CrossRefGoogle Scholar
  20. Moreno, O., Shapira, B., Rokach, L.: TALMUD: transfer learning for multiple domains. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 425–434 (2012)Google Scholar
  21. Qin, A., Shang, Z., Tian, J.: Maximum correntropy criterion for convex anc semi-nonnegative matrix factorization. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1856–1861 (2017)Google Scholar
  22. Wang, J.J.Y., Wang, X., Gao, X.: Non-negative matrix factorization by maximizing correntropy for cancer clustering. BMC Bioinform. 14(1), 107 (2013)MathSciNetCrossRefGoogle Scholar
  23. Wu, Y., Shen, B., Ling, H.: Visual tracking via online nonnegative matrix factorization. IEEE Trans. Circuits Syst. Video Technol. 24(3), 374–383 (2013)CrossRefGoogle Scholar
  24. Xiaojun, L.: An improved clustering-based collaborative filtering recommendation algorithm. Clust. Comput. 20(2), 1281–1288 (2017)CrossRefGoogle Scholar
  25. Yao, L., Sheng, Q.Z., Qin, Y.: Context-aware point-of-interest recommendation using tensor factorization with social regularization. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1007–1010 (2015)Google Scholar
  26. Yu, Y., Wang, C., Wang, H., Gao, Y.: Attributes coupling based matrix factorization for item recommendation. Appl. Intell. 46(3), 521–533 (2017)CrossRefGoogle Scholar
  27. Zheng, Y., Burke, R., Mobasher, B.: Splitting approaches for context-aware recommendation: an empirical study. In: Proceedings of the 29th Annual ACM Symposium on Applied Computing, pp. 274–279 (2014)Google Scholar
  28. Zhou, D., Hofmann, T., Schölkopf, B.: Semi-supervised learning on directed graphs. In: Advances in Neural Information Processing Systems, pp. 1633–1640 (2005)Google Scholar

Copyright information

© China Computer Federation (CCF) 2019

Authors and Affiliations

  1. 1.Faculty of Information TechnologyBeijing University of TechnologyBeijingChina
  2. 2.State Grid YINGDA International Holdings CO., LTDBeijingChina

Personalised recommendations