Keywords

1 Introduction

Multi-view clustering has been achieved great development and has been successfully applied in many applications, such as image retrieval [9], webpage classification [1, 25], and speech recognition [12]. Recently, many methods have been proposed, such as multi-view k-means clustering [2], multi-view spectral clustering via bipartite graph [10], and co-regularized multi-view spectral clustering [8], etc. Compared with the single-view clustering, multi-view clustering can exploit the complementary information among multiple views, and thus has the potential to achieve a better performance [29].

For the conventional multi-view clustering, they commonly require that the available samples should have all of the views. However, it always happens that some views are missing for parts of samples in real world applications [18]. For example, the data obtained by the blood test and images scanned by the magnetic resonance can be regarded as two necessary views for diagnosing the disease. However, it is often the case that we only have the results of one view for some individuals since they would like to take only one of the two tests. In this case, the conventional methods fail. In this paper, we refer to the clustering task with incomplete views as incomplete multi-view clustering (IMC).

For IMC, a few methods have been proposed, which can be commonly categorized into two groups. The one group is based on completing the incomplete views. For example, Trivedi et al. proposed a kernel CCA based method, which tries to recover the kernel matrix of the incomplete view and then learns two projections for the two views, respectively [18]. However, it requires at least one complete view for reference. In other words, it is not applicable to the case that all views are incomplete. To address this issue, Gao et al. proposed a two-step approach, which first fills in the missing views with the corresponding average of all samples, and then learns the common representation for the two views based on the spectral graph theory [5]. The shortcoming of this approach is that it introduces some useless even noisy information to the data. For data with small incomplete percentages, this approach may be effective. However, for the data with large incomplete percentages, this approach is harmful to find the common representation since these useless information may dominate the representation learning [17]. The other group focuses on directly learning the common latent subspace or representation for all views, in which the most representative works are the partial multi-view clustering (PVC) [30], multi-incomplete-view clustering (MIC) [17], and incomplete multi-modality grouping (IMG) [28]. Based on the non-negative matrix factorization (NMF), PVC directly learns a common latent representation for two views by simply regularizing different views of the same sample to have the same representation [30]. MIC jointly learns the latent representation of each view and the consensus representation by utilizing the weighted NMF algorithm, in which the missing views are constrained with the small weight even 0 during learning [17]. IMG can be viewed as an extension of PVC, which further embeds an adaptively learned graph on the latent representation [28].

Although some methods have been proposed to address the IMC problem, several problems still exist which limit their performances. For example, these methods all ignore the geometric structure of data. This indicates that the intrinsic geometric structure of data may be destroyed in the representation space, which may lead to a bad performance. The second shortcoming of these methods, especially MIC and IMG, is that there are many penalty parameters (more than three) to be set. These tunable parameters directly influence the clustering performance and limit its real applications because it is still an open problem to adaptively select the optimal parameter for different datasets. The third shortcoming is that these methods all cannot handle the out-of-sample problem. In this paper, we propose a novel and simple IMC method, named incomplete multi-view clustering via graph regularized matrix factorization (IMC_GRMF), to solve the above problems and improve the performance. Similar to PVC, the matrix factorization technique is exploited to learn the common latent representation, in which the representation corresponding to those samples with all views are regularized to be consistent. In addition, a nearest neighbor graph is neatly imposed on the reconstruction errors of the matrix factorization to exploit the local geometric structure of data, which enables the method to learn a more compact and discriminative representation for clustering. Compared with the other methods, our approach does not introduce any extra regularization term and corresponding penalty parameter to preserve the locality structure of data. Extensive experimental results prove the effectiveness of the proposed method for incomplete multi-view clustering.

2 Notations and Related Work

2.1 Notations

Let \({X^{( k )}} = {[ {X_c^{\left( k \right) T};{{\bar{X}}^{\left( k \right) T}}}]^T} \in {R^{( {{n_c} + {n_k}}) \times {m_k}}}\) be the kth view of data, where each sample in the corresponding view is represented by a row vector with \({m_k}\) features, \({n_c}\) is the number of paired samples (i.e., there are no missing views for these samples). \(x_i^{(k)}\) denotes the features of the kth view of the ith sample. We refer to the kth view as \(Vi\left( k \right) \). \({\bar{X}^{\left( k \right) }} \in {R^{{n_k} \times {m_k}}}\) represents that \({n_k}\) samples only contain the features of \(Vi\left( k \right) \) while the features of the other views are missing. The total samples of the data is \(n = {n_c} + \sum \limits _{k = 1}^v {{n_k}} \). For a matrix \(A \in {R^{m \times n}}\), its \({l_F}\) norm and \({l_1}\) norm are defined as \({\left\| A \right\| _F} = \sqrt{\sum \limits _{j = 1}^n {\sum \limits _{i = 1}^m {a_{i,j}^2} } } \) and \({\left\| A \right\| _1} = \sum \limits _{j = 1}^n {\sum \limits _{i = 1}^m {\left| {{a_{i,j}}} \right| } } \), respectively, where \({a_{i,j}}\) denotes the ith row and jth column element of matrix A [14, 23]. \(Tr\left( \cdot \right) \) is the trace operation. We use \({A^T}\) to denote the transposition of matrix A [15]. I is the identity matrix. \(A \ge 0\) means that all elements of matrix A are not less than zero.

2.2 Partial Multi-View Clustering (PVC)

For data with two incomplete views, PVC seeks to learn a common latent subspace for both two views, where different views of the same sample should have the same representation [14]. The learning model of PVC is formulated as follows:

$$\begin{aligned} \begin{array}{r} \mathop {\min }\limits _{{P_c},{{\bar{P}}^{(1)}},{{\bar{P}}^{(2)}},{U^{(1)}},{U^{(2)}}} \left\| {\left[ \begin{array}{l} X_c^{(1)}\\ {{\bar{X}}^{(1)}} \end{array} \right] - \left[ \begin{array}{l} {P_c}\\ {{\bar{P}}^{(1)}} \end{array} \right] {U^{(1)}}} \right\| _F^2 + \lambda {\left\| {\left[ \begin{array}{l} {P_c}\\ {{\bar{P}}^{(1)}} \end{array} \right] } \right\| _1}\\ + \left\| {\left[ \begin{array}{l} X_c^{(2)}\\ {{\bar{X}}^{(2)}} \end{array} \right] - \left[ \begin{array}{l} {P_c}\\ {{\bar{P}}^{(2)}} \end{array} \right] {U^{(2)}}} \right\| _F^2 + \lambda {\left\| {\left[ \begin{array}{l} {P_c}\\ {{\bar{P}}^{(2)}} \end{array} \right] } \right\| _1}\\ s.t.{} {} {U^{(1)}} \ge 0,{U^{(2)}} \ge 0,{P_c} \ge 0,{{\bar{P}}^{(1)}} \ge 0,{{\bar{P}}^{(2)}} \ge 0, \end{array} \end{aligned}$$
(1)

where \(\lambda \) is the penalty parameter. \({U^{(1)}} \in {R^{K \times {m_1}}}\) and \({U^{(2)}} \in {R^{K \times {m_2}}}\) are the latent space basis matrices for the two views, \({P_c} \in {R^{{n_c} \times K}}\), \({\bar{P}^{(1)}} \in {R^{{n_1} \times K}}\), and \({\bar{P}^{(2)}} \in {R^{{n_2} \times K}}\) are the latent representations of the original data, K is the feature dimension in the latent space.

For PVC, the new representation corresponding to all samples can be expressed as \(P = \left[ \begin{array}{l}{P_c}\\ {{\bar{P}}^{(1)}}\\ {{\bar{P}}^{(2)}}\end{array} \right] \in {R^{n \times K}}\). Then the conventional k-means can be performed on it to obtain the final clustering result.

3 The Proposed Method

For multi-view data, learning a common latent representation for all views is one of the most favorite approaches in the field of multi-view clustering. However, how to learn a compact and discriminative common representation for the incomplete multi-view data is a challenge task. In this section, a novel multi-view clustering framework shown in Fig. 1 is provided to address this issue, in which the local information of each view and the complementary information across different views are jointly integrated.

Fig. 1.
figure 1

The description of IMC_GRMF. In this work, we suppose that there are only \(n_c\) samples (paired samples) have features of all views.

3.1 Learning Model of the Proposed Method

In past years, exploiting the locality geometric structure of data has been proved an effective approach for representation learning, which not only can improve the discriminability and compactness of the learned representation, but also avoids overfitting [3, 13, 16, 20, 22, 26, 27]. For example, in [13, 16], a nearest neighbor graph is introduced to constrain the new representation or basis for incomplete multi-view clustering. Although the purpose is realized, the complexity is also increased because they commonly introduce at least one tunable penalty parameter to the model. Since some basic models already have two or more tuned parameters, introducing any extra tuned parameter to the model will greatly increase the burden in parameter selection. So the conventional graph embedding approaches are not a good choice to guide the representation learning. In this section, we propose a novel and simple approach to solve this challenge, in which the local information of each view are embedded into the learning model based on the following Lemma [21].

Lemma 1:

For three samples \(\left\{ {{x_1},{x_2},{x_3}} \right\} \in {R^m}\), suppose \({x_1}\) and \({x_2}\) are the nearest neighbor to each other, \({x_3}\) is not the nearest neighbor to samples \({x_1}\) and \({x_2}\). If there is a complete dictionary \(U \in {R^{k \times m}}\) that satisfies \({x_i} = {p_i}U\) (\(i = \left\{ {1,2,3} \right\} \)), where \({p_i} \in {R^k}\) can be viewed as the reconstruction coefficient. Then we have the following conclusion: the reconstructed sample \({p_2}U\) (\({p_1}U\)) is also the nearest neighbor to the original sample \({x_1}\) (\({x_2}\)) and is still not the nearest neighbor to sample \({x_3}\).

The proof to Lemma 1 is very simple and thus we omit it here. From Lemma 1, we know that the reconstruction operation does not destroy the local geometric structure of the original data. Inspired by this motivation, we design the following objective function to exploit the local information of data for common representation learning:

$$\begin{aligned} \begin{array}{r} \mathop {\min }\limits _{{P^{\left( k \right) }},{U^{\left( k \right) }}} \sum \limits _{k = 1}^v {\sum \limits _{j = 1}^{{n_c} + {n_k}} {\sum \limits _{i = 1}^{{n_c} + {n_k}} {\left\| {x_i^{(k)} - p_j^{(k)}{U^{(k)}}} \right\| _2^2w_{i,j}^{(k)}} } } \mathrm{{ + }}{\lambda _2}\sum \limits _{k = 1}^v {{{\left\| {{P^{(k)}}} \right\| }_1}} \\ s.t.{U^{(k)}}{{{U^{(k)T}}}} = I, \end{array} \end{aligned}$$
(2)

where \({\lambda _2}\) is a penalty parameter. \(p_j^{\left( k \right) }\) is the new representation of the jth sample in the kth view. \(w_{i,j}^{(k)}\) is a binary value which is simply pre-defined as follows:

$$\begin{aligned} w_{i,j}^{(k)} = \left\{ {\begin{array}{*{20}{c}} {1,}&{}{if{} {} x_i^{\left( k \right) } \in \varPhi \left( {x_j^{\left( k \right) }} \right) {} {} {} or{} {} x_j^{\left( k \right) } \in \varPhi \left( {x_i^{\left( k \right) }} \right) }\\ {0,}&{}{otherwise}, \end{array}} \right. \end{aligned}$$
(3)

where \(\varPhi \left( {x_j^{\left( k \right) }} \right) \) denotes the sample set of nearest neighbors to sample \(x_j^{\left( k \right) }\).

By introducing the binary weight to regularize the data reconstruction, the locality structure of the original data in each view can be well preserved. Meanwhile, from (2), we can find that the proposed method does not introduce any extra regularization term and corresponding tuned parameter to preserve such locality property, which greatly reduces the complexity of penalty parameter selection in comparison with the other graph regularized IMC methods, such as DCNMF [13] and GPMVC [16] which all commonly introduce at least an extra tuned penalty parameter to preserve such locality property. For the paired samples across different views, their new representation should be consensus. To this end, we further add a regularization term based on the paired information of different views as follows:

$$\begin{aligned} \begin{array}{r} \mathop {\min }\limits _{{P^{\left( k \right) }},{P^c},{U^{\left( k \right) }}} \sum \limits _{k = 1}^v {\sum \limits _{j = 1}^{{n_c} + {n_k}} {\sum \limits _{i = 1}^{{n_c} + {n_k}} {\left\| {x_i^{(k)} - p_j^{(k)}{U^{(k)}}} \right\| _2^2w_{i,j}^{(k)}} } } \\ \mathrm{{ + }}{\lambda _1}\sum \limits _{k = 1}^v {\left\| {{G^{(k)}}{P^{(k)}} - {P^c}} \right\| _F^2} + {\lambda _2}\sum \limits _{k = 1}^v {{{\left\| {{P^{(k)}}} \right\| }_1}} \\ s.t.{U^{(k)}}{{{U^{(k)T}}}} = I, \end{array} \end{aligned}$$
(4)

where \({\lambda _1}\) is a penalty parameter. \({P^c} \in {R^{c \times K}}\) is the common latent representation for the paired samples of different views. \({G^{\left( k \right) }} \in {R^{{n_c} \times \left( {{n_c} + {n_k}} \right) }}\) can be viewed as an index matrix used to remove the unpaired representation \({\bar{P}^{\left( k \right) }}\) from \({P^{\left( k \right) }} = \left[ \begin{array}{l}P_c^{\left( k \right) }\\ {{\bar{P}}^{\left( k \right) }}\end{array} \right] \). Since the first \(n_c\) samples of each view are regarded as the paired samples, matrix \({G^{\left( k \right) }}\) can be simply defined as follows:

$$\begin{aligned} G_{i,j}^{\left( k \right) } = \left\{ {\begin{array}{*{20}{c}} {1,}&{}{if{} {} i = j}\\ {0,}&{}{otherwise}. \end{array}} \right. \end{aligned}$$
(5)

For model (4), \(P = [P^{cT},{\bar{P}}^{( 1)T}, \ldots ,{\bar{P}}^{( v)T}]^T\) can be viewed as the new representations for all samples. After obtaining the new representations, we use k-means algorithm to partition those samples into their respective groups. Several good properties of the proposed model (4) are summarized as follows.

Remark 1:

The proposed method is not only a clustering algorithm, but also an unsupervised classification algorithm because it can handle the out-of-sample. In essence, for any sample \(x_i^{(k)}\) in the kth view, its new representation is obtained by the matrix factorization \(x_i^{(k)}=p_i^{(k)}U^{(k)}\), which is equivalent to \(x_i^{(k)}{U^{\left( k \right) T}}\mathrm{{\, =\, }}p_i^{(k)}\) since \({U^{(k)}}{{{U^{(k)T}}}} = I\). Therefore, when the basis matrix \({U^{\left( k \right) }}\) is obtained, we can first achieve the discriminative representation for any new coming sample \({y^{\left( k \right) }}\) by projecting it onto the basis matrix as \(p_y^{(k)}\mathrm{{\, =\, }}{{{y}}^{\left( k \right) }}{U^{\left( k \right) T}}\), and then use the conventional unsupervised classification methods like k-nearest neighbor classify to predict its label.

Remark 2:

The proposed model (4) is a unified multi-view learning framework, which can be applied to the incomplete and complete cases by defining different index matrices \({G^{(k)}}\).

Remark 3:

The proposed method simultaneously exploits the local information of each view and the complementary information across different views, which is beneficial to learn a more compact and discriminative representation for clustering, and thus has the potential to perform better. Moreover, embedding the local information into the model can avoid the overfitting in handing the new sample.

Remark 4:

Most importantly, we do not introduce any extra regularization term to preserve the local geometric structure of data. In other words, compared with the conventional graph embedding methods, the proposed method does not increase the burden of parameter tuning.

Remark 5:

The proposed method has the potential to recover the missing views. Specifically, for a sample with only the kth view \({x^{\left( k \right) }}\), when its new representation \({p_{{x^{\left( k \right) }}}}\) is obtained via the proposed method, we can recover its fth missing view via \({x^{\left( f \right) }} = {p_{{x^{\left( k \right) }}}}{U^{\left( f \right) }}\).

3.2 Solution to IMC_GRMF

For the first term of (4), we can rewrite it into the following equivalent formula

$$\begin{aligned}&\sum \limits _{k = 1}^v {\sum \limits _{j = 1}^{{n_c} + {n_k}} {\sum \limits _{i = 1}^{{n_c} + {n_k}} {\left\| {x_i^{(k)} - p_j^{(k)}{U^{(k)}}} \right\| _2^2w_{i,j}^{(k)}} } } \nonumber \\ =&\sum \limits _{k = 1}^v {\left( \begin{array}{l} Tr\left( {{X^{\left( k \right) T}}{D^{\left( k \right) }}{X^{\left( k \right) }}} \right) + Tr\left( {{U^{\left( k \right) T}}{P^{\left( k \right) T}}{D^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) \\ - 2Tr\left( {{X^{\left( k \right) T}}{W^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) \end{array} \right) }, \end{aligned}$$
(6)

where \({D^{\left( k \right) }}\) is a diagonal matrix with each diagonal element \(D_{i,i}^{\left( k \right) } = \sum \limits _{j = 1}^{{n_c} + {n_k}} {W_{i,j}^{\left( k \right) }} \). Considering that the first term of (6) is constant and condition \({U^{(k)}}{{{U^{(k)T}}}} = I\), we can simplify (4) as follows according to (6):

$$\begin{aligned} {\begin{matrix} L\left( {{P^{\left( k \right) }},{P^c},{U^{\left( k \right) }}} \right) = {\lambda _1}\sum \limits _{k = 1}^v {\left\| {{G^{(k)}}{P^{(k)}} - {P^c}} \right\| _F^2} + {\lambda _2}\sum \limits _{k = 1}^v {{{\left\| {{P^{(k)}}} \right\| }_1}} \\ + \sum \limits _{k = 1}^v {\left( {Tr\left( {{P^{\left( k \right) T}}{D^{\left( k \right) }}{P^{\left( k \right) }}} \right) - 2Tr\left( {{X^{\left( k \right) T}}{W^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) } \right) }. \end{matrix}} \end{aligned}$$
(7)

Then all variables can be calculated alternatively as follows.

Step 1: Calculate \({U^{\left( k \right) }}\). The basis matrix \({U^{\left( k \right) }}\) for each view can be calculated by optimizing the following problem:

$$\begin{aligned} {\begin{matrix} \mathop {\min }\limits _{{U^{\left( k \right) }}{U^{\left( k \right) T}} = I} - 2Tr\left( {{X^{\left( k \right) T}}{W^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) . \end{matrix}} \end{aligned}$$
(8)

Then we can obtain the optimum value of \({U^{\left( k \right) }}\) as [19, 31]:

$$\begin{aligned} {U^{(k)}} = {J^{(k)}}{B^{(k)T}}, \end{aligned}$$
(9)

where \({J^{(k)}}\) and \({B^{(k)}}\) are the right and left singular matrices of \(({X^{(k)T}}{W^{(k)}}{P^{(k)}})\), i.e., \({X^{(k)T}}{W^{(k)}}{P^{(k)}} = {B^{(k)}}{\Sigma ^{(k)}}{J^{(k)T}}\).

Step 2: Calculate \({P^{\left( k \right) }}\). Fixing the other variables, variable \({P^{\left( k \right) }}\) can be calculated by minimizing the following problem:

$$\begin{aligned} {\begin{matrix} \mathop {\min }\limits _{{P^{\left( k \right) }}} {\lambda _1}\left\| {{G^{(k)}}{P^{(k)}} - {P^c}} \right\| _F^2 + {\lambda _2}{\left\| {{P^{(k)}}} \right\| _1}\\ + Tr\left( {{P^{\left( k \right) T}}{D^{\left( k \right) }}{P^{\left( k \right) }}} \right) - 2Tr\left( {{X^{\left( k \right) T}}{W^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) . \end{matrix}} \end{aligned}$$
(10)

Define \({A^{\left( k \right) }} = {U^{\left( k \right) }}{X^{\left( k \right) T}}{W^{\left( k \right) }} + {\lambda _1}{P^{cT}}{G^{\left( k \right) }}\), \({M^{\left( k \right) }} = {D^{\left( k \right) }} + {\lambda _1}{G^{\left( k \right) T}}{G^{\left( k \right) }}\). Obviously, \({M^{\left( k \right) }}\) is still a diagonal matrix with all diagonal elements \(M_{i,i}^{\left( k \right) } > 0\). Thus, (10) can be rewritten into the following equivalent problem:

$$\begin{aligned} {\begin{matrix} \mathop {\min }\limits _{{P^{\left( k \right) }}} \left\| {{{\left( {{M^{\left( k \right) }}} \right) }^{\frac{1}{2}}}{P^{\left( k \right) }} - {{\left( {{A^{\left( k \right) }}{{\left( {{M^{\left( k \right) }}} \right) }^{\mathrm{{ - }}\frac{1}{2}}}} \right) }^T}} \right\| _F^2 + {\lambda _2}{\left\| {{P^{\left( k \right) }}} \right\| _1}. \end{matrix}} \end{aligned}$$
(11)

Define \({C^{\left( k \right) }} = ({A^{\left( k \right) }}{\left( {{M^{\left( k \right) }}} \right) ^{\mathrm{{ - }}\frac{1}{2}}})^T\), problem (11) can be rewritten as follows

(12)

where \(P_{i,:}^{\left( k \right) }\) and \(C_{i,:}^{\left( k \right) }\) denote the ith row vector of matrices \({P^{\left( k \right) }}\) and \({C^{\left( k \right) }}\), respectively. For problem (12), its solution can be computed independently to each row by the conventional shrinkage operation as follows [19]:

(13)

where \(\varTheta \) denotes the shrinkage operator.

Step 3: Calculate \({P^c}\). Fixing the other variables, the common latent representation \({P^c}\) can be calculated by solving the following minimization problem:

$$\begin{aligned} \mathop {\min }\limits _{{P^c}} \sum \limits _{k = 1}^v {\left\| {{G^{(k)}}{P^{(k)}} - {P^c}} \right\| _F^2}. \end{aligned}$$
(14)

Problem (14) has the following closed form solution:

$$\begin{aligned} {P^c} = {\sum \limits _{k = 1}^v {{G^{\left( k \right) }}{P^{\left( k \right) }}}}/v. \end{aligned}$$
(15)

Algorithm 1 summarizes the computing procedures of IMC_GRMF.

figure a

3.3 Computational Complexity and Convergence Property

For Algorithm 1, it is obvious that the biggest computational cost is the singular value decomposition (SVD) in Step 1. Note that the computational complexities of matrix multiplication and addition are ignored since their computational costs are far less than SVD. Thus, we only take into account the computational complexity of Step 1. Generally, the computational complexity of SVD is \(O({mn^2})\) for a \(m \times n\) matrix [11]. Therefore, the computational complexity of Step 1 is about \(O\left( {v{mK^2}} \right) \). v is the number of views, K is the reduced dimension or the number of clusters. Therefore, the computation complexity of the proposed method listed in Algorithm 1 is about \(O\left( {\tau vm{K^2}} \right) \), where \(\tau \) is the iteration number.

From the above presentations, it is obvious to see that the proposed optimization problem (7) is convex with respect to variables \({P^{\left( k \right) }},{P^c},{U^{\left( k \right) }}\), respectively. Then we have the following Theorem 1.

Theorem 1:

The objective function value of problem (4) is monotonically decreasing during the iteration.

Proof

Suppose \(\varUpsilon \left( {P_t^{\left( k \right) },P_t^c,U_t^{\left( k \right) }} \right) \) denotes the objective function value at the tth iteration. Since all sub-problems with respect to variables \({P^{\left( k \right) }},{P^c},{U^{\left( k \right) }}\) are convex and have the closed form solution, the following inequations are satisfied:

$$\begin{aligned} \begin{array}{l} \varUpsilon \left( {P_t^{\left( k \right) },P_t^c,U_t^{\left( k \right) }} \right) \ge \varUpsilon \left( {P_t^{\left( k \right) },P_t^c,U_{t + 1}^{\left( k \right) }} \right) \\ \ge \varUpsilon \left( {P_{t + 1}^{\left( k \right) },P_t^c,U_{t + 1}^{\left( k \right) }} \right) \ge \varUpsilon \left( {P_{t + 1}^{\left( k \right) },P_{t + 1}^c,U_{t + 1}^{\left( k \right) }} \right) . \end{array} \end{aligned}$$
(16)

This inequation illustrates that the objective function value of problem (4) is monotonically decreasing during the iteration. Thus we complete the proof.

Meanwhile, we can find that problem (4) is lower bounded because it at least satisfies the condition \(\varUpsilon \left( {P_t^{\left( k \right) },P_t^c,U_t^{\left( k \right) }} \right) \ge 0\), thus Theorem 1 guarantees that the proposed method will finally converge to the local optimal solution after a few iterations.

4 Experiments and Analysis

4.1 Experimental Settings

Dataset: (1) Handwritten digit dataset [2]: The used handwritten digit is composed of 2000 samples from 10 digits, i.e., 0–9. Each sample is represented by two views, in which the one is represented by a feature vector with 240 features obtained by the average of pixels in \(2 \times 3\) windows, and the other one is represented by the Fourier coefficient vector with 76 features. (2) BUAA-visnir face dataset (BUAA) [7]: Following the experimental settings in [28], we evaluate different methods on the first 10 persons with 90 visual images and 90 near infrared images. Each image was pre-resized to a \(10 \times 10\) matrix and then transformed into the vector. (3) Cornell dataset [1, 6]: This dataset contains 195 webpages collected from the Cornell University. Webpages in the dataset are partitioned into five classes and each webpage is represented by two views, i.e., the content view and citation view. (4) Caltech101 dataset [4]: The original Caltech101 dataset contains 8677 images from 101 objects. In the experiments, a subset named Caltech7 [10], which is composed of 1474 images from 7 classes, is used to compare different methods. The popular two types of features, i.e., GIST and LBP, are extracted from each image as the two views. The above used datasets are briefly summarized in Table 1.

Evaluation: Three well-known matrices, i.e., clustering accuracy (ACC), normalized mutual information (NMI), and purity are chosen to evaluate the performance of different methods [2]. For the above datasets, we randomly select the percentage of 10, 30, 50, 70, and 90 samples as the paired samples with all views, and treat the remaining samples as incomplete samples, in which half of samples only have one of the views. All methods are repeatedly performed 5 times and their average values (%) are reported for comparison.

Compared Methods: Following the experimental settings in [17, 28], we compare the proposed method with the following baselines. (1) BSV (Best Single View): BSV first fills in the missing views with the average of samples in the corresponding view, and then performs k-means on each view separately. Finally, the best clustering result of the two views is reported. (2) Concat: It first fills in all missing views with the average of samples of the corresponding view, and then concatenates all views of each sample into one feature vector, followed by performing k-means to obtain the clustering result. (3) PVC [30]. PVC uses the non-negative matrix factorization technique to learn a common latent representation for incomplete multi-view clustering. (4) IMG [28]: IMG extends the PVC by embedding the adaptively learned Laplacian graph. (5) Double constrained NMF (DCNMF) [13]: DCNMF is an extension of PVC, which further introduces a Laplacian graph regularizer into PVC. (6) Graph regularized partial multi-view clustering (GPMVC) [16]: GPMVC can be viewed as an improved method to DCNMF, which exploits a scale normalization technique in the consensus representation learning term. The code of the proposed method is available at: http://www.yongxu.org/lunwen.html.

Table 1. Description of the used benchmark datasets.
Table 2. ACCs/NMIs (%) of different methods on the handwritten digit dataset.
Table 3. ACCs/NMIs (%) of different methods on the BUAA dataset.
Table 4. ACCs/NMIs (%) of different methods on the cornell dataset.
Table 5. ACCs/NMIs (%) of different methods on the Caltech7 dataset.
Fig. 2.
figure 2

Purity (%) of different methods on the above four datasets.

4.2 Experimental Results and Analyses

The clustering results of different methods on the above four datasets are enumerated in Tables 2, 3, 4, 5 and Fig. 2. It is obvious to see that the proposed method can significantly improve the ACC, NMI, and purity. In particular, the proposed method archives nearly 8% higher than those of the related methods in terms of the ACC on the BUAA dataset. The good performance strongly validates the effectiveness of the proposed method in handling the IMC tasks. Besides, we can obtain the following observations from the experimental results.

(1) Generally, with the ratio of missing views decreases, the clustering performances of all methods improve obviously. This proves that the complementary information of different views is very useful in multi-view learning.

(2) In most cases, BSV and Concat perform much worse than the other methods. This proves that filling in the missing views with the average of samples of the corresponding view is not a useful approach.

(3) DCNMF, GPMVC and the proposed method perform better than PVC in most cases. Compared with PVC, the other two methods and the proposed method all exploit the local geometric structure of each view to guide the representation learning. Thus, the experimental results prove that the local information of each view contain very useful information, which is beneficial to learn a more compact and discriminative representation. Meanwhile, we can find that our method achieves better performance than DCNMF and GPMVC, which further proves the effectiveness of the proposed novel graph regularization term.

4.3 Parameter Analysis

Figure 3 shows the ACC versus the parameters \({\lambda _1}\) and \({\lambda _2}\) on the handwritten digit and BUAA datasets with 70% paired samples. It is obvious that the ACC of the proposed method is relatively stable in some local areas, which indicates that the proposed method is insensitive to the selection of parameters to some extent. Moreover, we can find that when the two parameters are selected with proper values from the candidate range of (\([ {{{10}^0},{{10}^2}}]\), \([ {{{10}^{\mathrm{{ - 5}}}},{{10}^{\mathrm{{ - 1}}}}}]\)), the proposed method can achieve the satisfactory performance. This indicates that a relative larger value of parameter \(\lambda _1\) encourages a better performance. In our work, we use the grid searching approach to find the optimal combinations of the two parameters from the two dimensional grid formed by (\([ {{{10}^0},{{10}^2}}]\), \([ {{{10}^{\mathrm{{ - 5}}}},{{10}^{\mathrm{{ - 1}}}}}]\)) [24].

Figure 4 plots the relationships of ACC and the number of nearest neighbors of the proposed method on the handwritten digit and BUAA datasets. From the figures, we have the following conclusions: (1) The clustering performance is insensitive to the selection of nearest neighbor number to some extent when the nearest neighbor number is located in the proper range, such as \(\left[ {8,18} \right] \) for the handwritten digit dataset and \(\left[ {2,6} \right] \) for the BUAA dataset. (2) Generally, the number of nearest neighbors should better be less than the number of sample of each class. For example, from Fig. 3(b), we can find that when the number of nearest neighbors is larger than the number of sample per class, i.e., \(N > 10\), the ACC decreases dramatically. However, in the real world applications, it is impossible to obtain the real number of sample per class. In this work, we use the following criterion to select the number of the nearest neighbors. Suppose we try to partition the available multi-view data with n samples into c groups, . If \(m \gg 10\), then we empirically select 10 as the number of nearest neighbors, otherwise we select \(min(m - 4,2)\) as the nearest neighbor number.

Fig. 3.
figure 3

ACC (%) versus parameters \({\lambda _\mathrm{{1}}}\) and \({\lambda _\mathrm{{2}}}\) of the proposed method on (a) handwritten digit and (b) BUAA datasets with 70% paired samples.

Fig. 4.
figure 4

ACC (%) versus the number of nearest neighbors of our method on (a) handwritten digit and (b) BUAA datasets with 50% and 70% paired samples.

4.4 Experimental Convergence Study

Figure 5 shows the objective function value and ACC at each iteration step on the handwritten digit and BUAA datasets with 70% paired samples. From the figures, it is obvious to see that the objective function value decreases dramatically in the first few iteration steps (within 20 iterations). The experimental results plotted in the two figures prove the good convergence property of our method.

Fig. 5.
figure 5

The objective function value and ACC (%) versus the iteration step of the proposed method on (a) handwritten digit and (b) BUAA datasets with 70% paired samples.

5 Conclusions

In this paper, we propose a novel framework for multi-view learning, which not only can handle the incomplete and complete multi-view clustering, but also is able to deal with the out-of-sample. Moreover, the proposed method has the potential to complete the missing views for any sample. Besides, we provide a novel approach to exploit the local information of data without introducing any extra regularization term and penalty parameter, which does not increase the complexity and computational burden. Extensive experimental results prove the effectiveness of the proposed method.