Incomplete Multi-view Clustering via Graph Regularized Matrix Factorization

Wen, Jie; Zhang, Zheng; Xu, Yong; Zhong, Zuofeng

doi:10.1007/978-3-030-11018-5_47

Jie Wen¹⁴,
Zheng Zhang^14,15,
Yong Xu¹⁴ &
…
Zuofeng Zhong¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11132))

Included in the following conference series:

European Conference on Computer Vision

1536 Accesses
34 Citations

Abstract

Clustering with incomplete views is a challenge in multi-view clustering. In this paper, we provide a novel and simple method to address this issue. Specially, the proposed method simultaneously exploits the local information of each view and the complementary information among views to learn the common latent representation for all samples, which can greatly improve the compactness and discriminability of the obtained representation. Compared with the conventional graph embedding methods, the proposed method does not introduce any extra regularization term and corresponding penalty parameter to preserve the local structure of data, and thus does not increase the burden of extra parameter selection. By imposing the orthogonal constraint on the basis matrix of each view, the proposed method is able to handle the out-of-sample. Moreover, the proposed method can be viewed as a unified framework for multi-view learning since it can handle both incomplete and complete multi-view clustering and classification tasks. Extensive experiments conducted on several multi-view datasets prove that the proposed method can significantly improve the clustering performance.

J. Wen and Z. Zhang—Equally contributed.

You have full access to this open access chapter, Download conference paper PDF

Semi-supervised multi-view clustering with dual hypergraph regularized partially shared non-negative matrix factorization

Article 24 May 2022

Multiview Learning via Non-negative Matrix Factorization for Clustering Applications

Multi-view latent structure learning with rank recovery

Article 30 September 2022

Keywords

1 Introduction

Multi-view clustering has been achieved great development and has been successfully applied in many applications, such as image retrieval [9], webpage classification [1, 25], and speech recognition [12]. Recently, many methods have been proposed, such as multi-view k-means clustering [2], multi-view spectral clustering via bipartite graph [10], and co-regularized multi-view spectral clustering [8], etc. Compared with the single-view clustering, multi-view clustering can exploit the complementary information among multiple views, and thus has the potential to achieve a better performance [29].

For the conventional multi-view clustering, they commonly require that the available samples should have all of the views. However, it always happens that some views are missing for parts of samples in real world applications [18]. For example, the data obtained by the blood test and images scanned by the magnetic resonance can be regarded as two necessary views for diagnosing the disease. However, it is often the case that we only have the results of one view for some individuals since they would like to take only one of the two tests. In this case, the conventional methods fail. In this paper, we refer to the clustering task with incomplete views as incomplete multi-view clustering (IMC).

For IMC, a few methods have been proposed, which can be commonly categorized into two groups. The one group is based on completing the incomplete views. For example, Trivedi et al. proposed a kernel CCA based method, which tries to recover the kernel matrix of the incomplete view and then learns two projections for the two views, respectively [18]. However, it requires at least one complete view for reference. In other words, it is not applicable to the case that all views are incomplete. To address this issue, Gao et al. proposed a two-step approach, which first fills in the missing views with the corresponding average of all samples, and then learns the common representation for the two views based on the spectral graph theory [5]. The shortcoming of this approach is that it introduces some useless even noisy information to the data. For data with small incomplete percentages, this approach may be effective. However, for the data with large incomplete percentages, this approach is harmful to find the common representation since these useless information may dominate the representation learning [17]. The other group focuses on directly learning the common latent subspace or representation for all views, in which the most representative works are the partial multi-view clustering (PVC) [30], multi-incomplete-view clustering (MIC) [17], and incomplete multi-modality grouping (IMG) [28]. Based on the non-negative matrix factorization (NMF), PVC directly learns a common latent representation for two views by simply regularizing different views of the same sample to have the same representation [30]. MIC jointly learns the latent representation of each view and the consensus representation by utilizing the weighted NMF algorithm, in which the missing views are constrained with the small weight even 0 during learning [17]. IMG can be viewed as an extension of PVC, which further embeds an adaptively learned graph on the latent representation [28].

Although some methods have been proposed to address the IMC problem, several problems still exist which limit their performances. For example, these methods all ignore the geometric structure of data. This indicates that the intrinsic geometric structure of data may be destroyed in the representation space, which may lead to a bad performance. The second shortcoming of these methods, especially MIC and IMG, is that there are many penalty parameters (more than three) to be set. These tunable parameters directly influence the clustering performance and limit its real applications because it is still an open problem to adaptively select the optimal parameter for different datasets. The third shortcoming is that these methods all cannot handle the out-of-sample problem. In this paper, we propose a novel and simple IMC method, named incomplete multi-view clustering via graph regularized matrix factorization (IMC_GRMF), to solve the above problems and improve the performance. Similar to PVC, the matrix factorization technique is exploited to learn the common latent representation, in which the representation corresponding to those samples with all views are regularized to be consistent. In addition, a nearest neighbor graph is neatly imposed on the reconstruction errors of the matrix factorization to exploit the local geometric structure of data, which enables the method to learn a more compact and discriminative representation for clustering. Compared with the other methods, our approach does not introduce any extra regularization term and corresponding penalty parameter to preserve the locality structure of data. Extensive experimental results prove the effectiveness of the proposed method for incomplete multi-view clustering.

2 Notations and Related Work

2.1 Notations

Let ${X^{( k )}} = {[ {X_c^{\left( k \right) T};{{\bar{X}}^{\left( k \right) T}}}]^T} \in {R^{( {{n_c} + {n_k}}) \times {m_k}}}$ be the kth view of data, where each sample in the corresponding view is represented by a row vector with ${m_k}$ features, ${n_c}$ is the number of paired samples (i.e., there are no missing views for these samples). $x_i^{(k)}$ denotes the features of the kth view of the ith sample. We refer to the kth view as $Vi\left( k \right) $. ${\bar{X}^{\left( k \right) }} \in {R^{{n_k} \times {m_k}}}$ represents that ${n_k}$ samples only contain the features of $Vi\left( k \right) $ while the features of the other views are missing. The total samples of the data is $n = {n_c} + \sum \limits _{k = 1}^v {{n_k}} $. For a matrix $A \in {R^{m \times n}}$, its ${l_F}$ norm and ${l_1}$ norm are defined as ${\left\| A \right\| _F} = \sqrt{\sum \limits _{j = 1}^n {\sum \limits _{i = 1}^m {a_{i,j}^2} } } $ and ${\left\| A \right\| _1} = \sum \limits _{j = 1}^n {\sum \limits _{i = 1}^m {\left| {{a_{i,j}}} \right| } } $, respectively, where ${a_{i,j}}$ denotes the ith row and jth column element of matrix A [14, 23]. $Tr\left( \cdot \right) $ is the trace operation. We use ${A^T}$ to denote the transposition of matrix A [15]. I is the identity matrix. $A \ge 0$ means that all elements of matrix A are not less than zero.

2.2 Partial Multi-View Clustering (PVC)

For data with two incomplete views, PVC seeks to learn a common latent subspace for both two views, where different views of the same sample should have the same representation [14]. The learning model of PVC is formulated as follows:

$$\begin{aligned} \begin{array}{r} \mathop {\min }\limits _{{P_c},{{\bar{P}}^{(1)}},{{\bar{P}}^{(2)}},{U^{(1)}},{U^{(2)}}} \left\| {\left[ \begin{array}{l} X_c^{(1)}\\ {{\bar{X}}^{(1)}} \end{array} \right] - \left[ \begin{array}{l} {P_c}\\ {{\bar{P}}^{(1)}} \end{array} \right] {U^{(1)}}} \right\| _F^2 + \lambda {\left\| {\left[ \begin{array}{l} {P_c}\\ {{\bar{P}}^{(1)}} \end{array} \right] } \right\| _1}\\ + \left\| {\left[ \begin{array}{l} X_c^{(2)}\\ {{\bar{X}}^{(2)}} \end{array} \right] - \left[ \begin{array}{l} {P_c}\\ {{\bar{P}}^{(2)}} \end{array} \right] {U^{(2)}}} \right\| _F^2 + \lambda {\left\| {\left[ \begin{array}{l} {P_c}\\ {{\bar{P}}^{(2)}} \end{array} \right] } \right\| _1}\\ s.t.{} {} {U^{(1)}} \ge 0,{U^{(2)}} \ge 0,{P_c} \ge 0,{{\bar{P}}^{(1)}} \ge 0,{{\bar{P}}^{(2)}} \ge 0, \end{array} \end{aligned}$$

(1)

where $\lambda $ is the penalty parameter. ${U^{(1)}} \in {R^{K \times {m_1}}}$ and ${U^{(2)}} \in {R^{K \times {m_2}}}$ are the latent space basis matrices for the two views, ${P_c} \in {R^{{n_c} \times K}}$, ${\bar{P}^{(1)}} \in {R^{{n_1} \times K}}$, and ${\bar{P}^{(2)}} \in {R^{{n_2} \times K}}$ are the latent representations of the original data, K is the feature dimension in the latent space.

For PVC, the new representation corresponding to all samples can be expressed as $P = \left[ \begin{array}{l}{P_c}\\ {{\bar{P}}^{(1)}}\\ {{\bar{P}}^{(2)}}\end{array} \right] \in {R^{n \times K}}$. Then the conventional k-means can be performed on it to obtain the final clustering result.

3 The Proposed Method

For multi-view data, learning a common latent representation for all views is one of the most favorite approaches in the field of multi-view clustering. However, how to learn a compact and discriminative common representation for the incomplete multi-view data is a challenge task. In this section, a novel multi-view clustering framework shown in Fig. 1 is provided to address this issue, in which the local information of each view and the complementary information across different views are jointly integrated.

3.1 Learning Model of the Proposed Method

In past years, exploiting the locality geometric structure of data has been proved an effective approach for representation learning, which not only can improve the discriminability and compactness of the learned representation, but also avoids overfitting [3, 13, 16, 20, 22, 26, 27]. For example, in [13, 16], a nearest neighbor graph is introduced to constrain the new representation or basis for incomplete multi-view clustering. Although the purpose is realized, the complexity is also increased because they commonly introduce at least one tunable penalty parameter to the model. Since some basic models already have two or more tuned parameters, introducing any extra tuned parameter to the model will greatly increase the burden in parameter selection. So the conventional graph embedding approaches are not a good choice to guide the representation learning. In this section, we propose a novel and simple approach to solve this challenge, in which the local information of each view are embedded into the learning model based on the following Lemma [21].

Lemma 1:

For three samples $\left\{ {{x_1},{x_2},{x_3}} \right\} \in {R^m}$, suppose ${x_1}$ and ${x_2}$ are the nearest neighbor to each other, ${x_3}$ is not the nearest neighbor to samples ${x_1}$ and ${x_2}$. If there is a complete dictionary $U \in {R^{k \times m}}$ that satisfies ${x_i} = {p_i}U$ ($i = \left\{ {1,2,3} \right\} $), where ${p_i} \in {R^k}$ can be viewed as the reconstruction coefficient. Then we have the following conclusion: the reconstructed sample ${p_2}U$ (${p_1}U$) is also the nearest neighbor to the original sample ${x_1}$ (${x_2}$) and is still not the nearest neighbor to sample ${x_3}$.

The proof to Lemma 1 is very simple and thus we omit it here. From Lemma 1, we know that the reconstruction operation does not destroy the local geometric structure of the original data. Inspired by this motivation, we design the following objective function to exploit the local information of data for common representation learning:

$$\begin{aligned} \begin{array}{r} \mathop {\min }\limits _{{P^{\left( k \right) }},{U^{\left( k \right) }}} \sum \limits _{k = 1}^v {\sum \limits _{j = 1}^{{n_c} + {n_k}} {\sum \limits _{i = 1}^{{n_c} + {n_k}} {\left\| {x_i^{(k)} - p_j^{(k)}{U^{(k)}}} \right\| _2^2w_{i,j}^{(k)}} } } \mathrm{{ + }}{\lambda _2}\sum \limits _{k = 1}^v {{{\left\| {{P^{(k)}}} \right\| }_1}} \\ s.t.{U^{(k)}}{{{U^{(k)T}}}} = I, \end{array} \end{aligned}$$

(2)

where ${\lambda _2}$ is a penalty parameter. $p_j^{\left( k \right) }$ is the new representation of the jth sample in the kth view. $w_{i,j}^{(k)}$ is a binary value which is simply pre-defined as follows:

$$\begin{aligned} w_{i,j}^{(k)} = \left\{ {\begin{array}{*{20}{c}} {1,}&{}{if{} {} x_i^{\left( k \right) } \in \varPhi \left( {x_j^{\left( k \right) }} \right) {} {} {} or{} {} x_j^{\left( k \right) } \in \varPhi \left( {x_i^{\left( k \right) }} \right) }\\ {0,}&{}{otherwise}, \end{array}} \right. \end{aligned}$$

(3)

where $\varPhi \left( {x_j^{\left( k \right) }} \right) $ denotes the sample set of nearest neighbors to sample $x_j^{\left( k \right) }$.

By introducing the binary weight to regularize the data reconstruction, the locality structure of the original data in each view can be well preserved. Meanwhile, from (2), we can find that the proposed method does not introduce any extra regularization term and corresponding tuned parameter to preserve such locality property, which greatly reduces the complexity of penalty parameter selection in comparison with the other graph regularized IMC methods, such as DCNMF [13] and GPMVC [16] which all commonly introduce at least an extra tuned penalty parameter to preserve such locality property. For the paired samples across different views, their new representation should be consensus. To this end, we further add a regularization term based on the paired information of different views as follows:

$$\begin{aligned} \begin{array}{r} \mathop {\min }\limits _{{P^{\left( k \right) }},{P^c},{U^{\left( k \right) }}} \sum \limits _{k = 1}^v {\sum \limits _{j = 1}^{{n_c} + {n_k}} {\sum \limits _{i = 1}^{{n_c} + {n_k}} {\left\| {x_i^{(k)} - p_j^{(k)}{U^{(k)}}} \right\| _2^2w_{i,j}^{(k)}} } } \\ \mathrm{{ + }}{\lambda _1}\sum \limits _{k = 1}^v {\left\| {{G^{(k)}}{P^{(k)}} - {P^c}} \right\| _F^2} + {\lambda _2}\sum \limits _{k = 1}^v {{{\left\| {{P^{(k)}}} \right\| }_1}} \\ s.t.{U^{(k)}}{{{U^{(k)T}}}} = I, \end{array} \end{aligned}$$

(4)

where ${\lambda _1}$ is a penalty parameter. ${P^c} \in {R^{c \times K}}$ is the common latent representation for the paired samples of different views. ${G^{\left( k \right) }} \in {R^{{n_c} \times \left( {{n_c} + {n_k}} \right) }}$ can be viewed as an index matrix used to remove the unpaired representation ${\bar{P}^{\left( k \right) }}$ from ${P^{\left( k \right) }} = \left[ \begin{array}{l}P_c^{\left( k \right) }\\ {{\bar{P}}^{\left( k \right) }}\end{array} \right] $. Since the first $n_c$ samples of each view are regarded as the paired samples, matrix ${G^{\left( k \right) }}$ can be simply defined as follows:

$$\begin{aligned} G_{i,j}^{\left( k \right) } = \left\{ {\begin{array}{*{20}{c}} {1,}&{}{if{} {} i = j}\\ {0,}&{}{otherwise}. \end{array}} \right. \end{aligned}$$

(5)

For model (4), $P = [P^{cT},{\bar{P}}^{( 1)T}, \ldots ,{\bar{P}}^{( v)T}]^T$ can be viewed as the new representations for all samples. After obtaining the new representations, we use k-means algorithm to partition those samples into their respective groups. Several good properties of the proposed model (4) are summarized as follows.

Remark 1:

The proposed method is not only a clustering algorithm, but also an unsupervised classification algorithm because it can handle the out-of-sample. In essence, for any sample $x_i^{(k)}$ in the kth view, its new representation is obtained by the matrix factorization $x_i^{(k)}=p_i^{(k)}U^{(k)}$, which is equivalent to $x_i^{(k)}{U^{\left( k \right) T}}\mathrm{{\, =\, }}p_i^{(k)}$ since ${U^{(k)}}{{{U^{(k)T}}}} = I$. Therefore, when the basis matrix ${U^{\left( k \right) }}$ is obtained, we can first achieve the discriminative representation for any new coming sample ${y^{\left( k \right) }}$ by projecting it onto the basis matrix as $p_y^{(k)}\mathrm{{\, =\, }}{{{y}}^{\left( k \right) }}{U^{\left( k \right) T}}$, and then use the conventional unsupervised classification methods like k-nearest neighbor classify to predict its label.

Remark 2:

The proposed model (4) is a unified multi-view learning framework, which can be applied to the incomplete and complete cases by defining different index matrices ${G^{(k)}}$.

Remark 3:

The proposed method simultaneously exploits the local information of each view and the complementary information across different views, which is beneficial to learn a more compact and discriminative representation for clustering, and thus has the potential to perform better. Moreover, embedding the local information into the model can avoid the overfitting in handing the new sample.

Remark 4:

Most importantly, we do not introduce any extra regularization term to preserve the local geometric structure of data. In other words, compared with the conventional graph embedding methods, the proposed method does not increase the burden of parameter tuning.

Remark 5:

The proposed method has the potential to recover the missing views. Specifically, for a sample with only the kth view ${x^{\left( k \right) }}$, when its new representation ${p_{{x^{\left( k \right) }}}}$ is obtained via the proposed method, we can recover its fth missing view via ${x^{\left( f \right) }} = {p_{{x^{\left( k \right) }}}}{U^{\left( f \right) }}$.

3.2 Solution to IMC_GRMF

For the first term of (4), we can rewrite it into the following equivalent formula

$$\begin{aligned}&\sum \limits _{k = 1}^v {\sum \limits _{j = 1}^{{n_c} + {n_k}} {\sum \limits _{i = 1}^{{n_c} + {n_k}} {\left\| {x_i^{(k)} - p_j^{(k)}{U^{(k)}}} \right\| _2^2w_{i,j}^{(k)}} } } \nonumber \\ =&\sum \limits _{k = 1}^v {\left( \begin{array}{l} Tr\left( {{X^{\left( k \right) T}}{D^{\left( k \right) }}{X^{\left( k \right) }}} \right) + Tr\left( {{U^{\left( k \right) T}}{P^{\left( k \right) T}}{D^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) \\ - 2Tr\left( {{X^{\left( k \right) T}}{W^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) \end{array} \right) }, \end{aligned}$$

(6)

where ${D^{\left( k \right) }}$ is a diagonal matrix with each diagonal element $D_{i,i}^{\left( k \right) } = \sum \limits _{j = 1}^{{n_c} + {n_k}} {W_{i,j}^{\left( k \right) }} $. Considering that the first term of (6) is constant and condition ${U^{(k)}}{{{U^{(k)T}}}} = I$, we can simplify (4) as follows according to (6):

$$\begin{aligned} {\begin{matrix} L\left( {{P^{\left( k \right) }},{P^c},{U^{\left( k \right) }}} \right) = {\lambda _1}\sum \limits _{k = 1}^v {\left\| {{G^{(k)}}{P^{(k)}} - {P^c}} \right\| _F^2} + {\lambda _2}\sum \limits _{k = 1}^v {{{\left\| {{P^{(k)}}} \right\| }_1}} \\ + \sum \limits _{k = 1}^v {\left( {Tr\left( {{P^{\left( k \right) T}}{D^{\left( k \right) }}{P^{\left( k \right) }}} \right) - 2Tr\left( {{X^{\left( k \right) T}}{W^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) } \right) }. \end{matrix}} \end{aligned}$$

(7)

Then all variables can be calculated alternatively as follows.

Step 1: Calculate ${U^{\left( k \right) }}$. The basis matrix ${U^{\left( k \right) }}$ for each view can be calculated by optimizing the following problem:

$$\begin{aligned} {\begin{matrix} \mathop {\min }\limits _{{U^{\left( k \right) }}{U^{\left( k \right) T}} = I} - 2Tr\left( {{X^{\left( k \right) T}}{W^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) . \end{matrix}} \end{aligned}$$

(8)

Then we can obtain the optimum value of ${U^{\left( k \right) }}$ as [19, 31]:

$$\begin{aligned} {U^{(k)}} = {J^{(k)}}{B^{(k)T}}, \end{aligned}$$

(9)

where ${J^{(k)}}$ and ${B^{(k)}}$ are the right and left singular matrices of $({X^{(k)T}}{W^{(k)}}{P^{(k)}})$, i.e., ${X^{(k)T}}{W^{(k)}}{P^{(k)}} = {B^{(k)}}{\Sigma ^{(k)}}{J^{(k)T}}$.

Step 2: Calculate ${P^{\left( k \right) }}$. Fixing the other variables, variable ${P^{\left( k \right) }}$ can be calculated by minimizing the following problem:

$$\begin{aligned} {\begin{matrix} \mathop {\min }\limits _{{P^{\left( k \right) }}} {\lambda _1}\left\| {{G^{(k)}}{P^{(k)}} - {P^c}} \right\| _F^2 + {\lambda _2}{\left\| {{P^{(k)}}} \right\| _1}\\ + Tr\left( {{P^{\left( k \right) T}}{D^{\left( k \right) }}{P^{\left( k \right) }}} \right) - 2Tr\left( {{X^{\left( k \right) T}}{W^{\left( k \right) }}{P^{\left( k \right) }}{U^{\left( k \right) }}} \right) . \end{matrix}} \end{aligned}$$

(10)

Define ${A^{\left( k \right) }} = {U^{\left( k \right) }}{X^{\left( k \right) T}}{W^{\left( k \right) }} + {\lambda _1}{P^{cT}}{G^{\left( k \right) }}$, ${M^{\left( k \right) }} = {D^{\left( k \right) }} + {\lambda _1}{G^{\left( k \right) T}}{G^{\left( k \right) }}$. Obviously, ${M^{\left( k \right) }}$ is still a diagonal matrix with all diagonal elements $M_{i,i}^{\left( k \right) } > 0$. Thus, (10) can be rewritten into the following equivalent problem:

$$\begin{aligned} {\begin{matrix} \mathop {\min }\limits _{{P^{\left( k \right) }}} \left\| {{{\left( {{M^{\left( k \right) }}} \right) }^{\frac{1}{2}}}{P^{\left( k \right) }} - {{\left( {{A^{\left( k \right) }}{{\left( {{M^{\left( k \right) }}} \right) }^{\mathrm{{ - }}\frac{1}{2}}}} \right) }^T}} \right\| _F^2 + {\lambda _2}{\left\| {{P^{\left( k \right) }}} \right\| _1}. \end{matrix}} \end{aligned}$$

(11)

Define ${C^{\left( k \right) }} = ({A^{\left( k \right) }}{\left( {{M^{\left( k \right) }}} \right) ^{\mathrm{{ - }}\frac{1}{2}}})^T$, problem (11) can be rewritten as follows

(12)

where $P_{i,:}^{\left( k \right) }$ and $C_{i,:}^{\left( k \right) }$ denote the ith row vector of matrices ${P^{\left( k \right) }}$ and ${C^{\left( k \right) }}$, respectively. For problem (12), its solution can be computed independently to each row by the conventional shrinkage operation as follows [19]:

(13)

where $\varTheta $ denotes the shrinkage operator.

Step 3: Calculate ${P^c}$. Fixing the other variables, the common latent representation ${P^c}$ can be calculated by solving the following minimization problem:

$$\begin{aligned} \mathop {\min }\limits _{{P^c}} \sum \limits _{k = 1}^v {\left\| {{G^{(k)}}{P^{(k)}} - {P^c}} \right\| _F^2}. \end{aligned}$$

(14)

Problem (14) has the following closed form solution:

$$\begin{aligned} {P^c} = {\sum \limits _{k = 1}^v {{G^{\left( k \right) }}{P^{\left( k \right) }}}}/v. \end{aligned}$$

(15)

Algorithm 1 summarizes the computing procedures of IMC_GRMF.

3.3 Computational Complexity and Convergence Property

For Algorithm 1, it is obvious that the biggest computational cost is the singular value decomposition (SVD) in Step 1. Note that the computational complexities of matrix multiplication and addition are ignored since their computational costs are far less than SVD. Thus, we only take into account the computational complexity of Step 1. Generally, the computational complexity of SVD is $O({mn^2})$ for a $m \times n$ matrix [11]. Therefore, the computational complexity of Step 1 is about $O\left( {v{mK^2}} \right) $. v is the number of views, K is the reduced dimension or the number of clusters. Therefore, the computation complexity of the proposed method listed in Algorithm 1 is about $O\left( {\tau vm{K^2}} \right) $, where $\tau $ is the iteration number.

From the above presentations, it is obvious to see that the proposed optimization problem (7) is convex with respect to variables ${P^{\left( k \right) }},{P^c},{U^{\left( k \right) }}$, respectively. Then we have the following Theorem 1.

Theorem 1:

The objective function value of problem (4) is monotonically decreasing during the iteration.

Proof

Suppose $\varUpsilon \left( {P_t^{\left( k \right) },P_t^c,U_t^{\left( k \right) }} \right) $ denotes the objective function value at the tth iteration. Since all sub-problems with respect to variables ${P^{\left( k \right) }},{P^c},{U^{\left( k \right) }}$ are convex and have the closed form solution, the following inequations are satisfied:

$$\begin{aligned} \begin{array}{l} \varUpsilon \left( {P_t^{\left( k \right) },P_t^c,U_t^{\left( k \right) }} \right) \ge \varUpsilon \left( {P_t^{\left( k \right) },P_t^c,U_{t + 1}^{\left( k \right) }} \right) \\ \ge \varUpsilon \left( {P_{t + 1}^{\left( k \right) },P_t^c,U_{t + 1}^{\left( k \right) }} \right) \ge \varUpsilon \left( {P_{t + 1}^{\left( k \right) },P_{t + 1}^c,U_{t + 1}^{\left( k \right) }} \right) . \end{array} \end{aligned}$$

(16)

This inequation illustrates that the objective function value of problem (4) is monotonically decreasing during the iteration. Thus we complete the proof.

Meanwhile, we can find that problem (4) is lower bounded because it at least satisfies the condition $\varUpsilon \left( {P_t^{\left( k \right) },P_t^c,U_t^{\left( k \right) }} \right) \ge 0$, thus Theorem 1 guarantees that the proposed method will finally converge to the local optimal solution after a few iterations.

4 Experiments and Analysis

4.1 Experimental Settings

Dataset: (1) Handwritten digit dataset [2]: The used handwritten digit is composed of 2000 samples from 10 digits, i.e., 0–9. Each sample is represented by two views, in which the one is represented by a feature vector with 240 features obtained by the average of pixels in $2 \times 3$ windows, and the other one is represented by the Fourier coefficient vector with 76 features. (2) BUAA-visnir face dataset (BUAA) [7]: Following the experimental settings in [28], we evaluate different methods on the first 10 persons with 90 visual images and 90 near infrared images. Each image was pre-resized to a $10 \times 10$ matrix and then transformed into the vector. (3) Cornell dataset [1, 6]: This dataset contains 195 webpages collected from the Cornell University. Webpages in the dataset are partitioned into five classes and each webpage is represented by two views, i.e., the content view and citation view. (4) Caltech101 dataset [4]: The original Caltech101 dataset contains 8677 images from 101 objects. In the experiments, a subset named Caltech7 [10], which is composed of 1474 images from 7 classes, is used to compare different methods. The popular two types of features, i.e., GIST and LBP, are extracted from each image as the two views. The above used datasets are briefly summarized in Table 1.

Evaluation: Three well-known matrices, i.e., clustering accuracy (ACC), normalized mutual information (NMI), and purity are chosen to evaluate the performance of different methods [2]. For the above datasets, we randomly select the percentage of 10, 30, 50, 70, and 90 samples as the paired samples with all views, and treat the remaining samples as incomplete samples, in which half of samples only have one of the views. All methods are repeatedly performed 5 times and their average values (%) are reported for comparison.

Compared Methods: Following the experimental settings in [17, 28], we compare the proposed method with the following baselines. (1) BSV (Best Single View): BSV first fills in the missing views with the average of samples in the corresponding view, and then performs k-means on each view separately. Finally, the best clustering result of the two views is reported. (2) Concat: It first fills in all missing views with the average of samples of the corresponding view, and then concatenates all views of each sample into one feature vector, followed by performing k-means to obtain the clustering result. (3) PVC [30]. PVC uses the non-negative matrix factorization technique to learn a common latent representation for incomplete multi-view clustering. (4) IMG [28]: IMG extends the PVC by embedding the adaptively learned Laplacian graph. (5) Double constrained NMF (DCNMF) [13]: DCNMF is an extension of PVC, which further introduces a Laplacian graph regularizer into PVC. (6) Graph regularized partial multi-view clustering (GPMVC) [16]: GPMVC can be viewed as an improved method to DCNMF, which exploits a scale normalization technique in the consensus representation learning term. The code of the proposed method is available at: http://www.yongxu.org/lunwen.html.

Table 1. Description of the used benchmark datasets.

Full size table

Table 2. ACCs/NMIs (%) of different methods on the handwritten digit dataset.

Full size table

Table 3. ACCs/NMIs (%) of different methods on the BUAA dataset.

Full size table

Table 4. ACCs/NMIs (%) of different methods on the cornell dataset.

Full size table

Table 5. ACCs/NMIs (%) of different methods on the Caltech7 dataset.

Full size table

4.2 Experimental Results and Analyses

The clustering results of different methods on the above four datasets are enumerated in Tables 2, 3, 4, 5 and Fig. 2. It is obvious to see that the proposed method can significantly improve the ACC, NMI, and purity. In particular, the proposed method archives nearly 8% higher than those of the related methods in terms of the ACC on the BUAA dataset. The good performance strongly validates the effectiveness of the proposed method in handling the IMC tasks. Besides, we can obtain the following observations from the experimental results.

(1) Generally, with the ratio of missing views decreases, the clustering performances of all methods improve obviously. This proves that the complementary information of different views is very useful in multi-view learning.

(2) In most cases, BSV and Concat perform much worse than the other methods. This proves that filling in the missing views with the average of samples of the corresponding view is not a useful approach.

(3) DCNMF, GPMVC and the proposed method perform better than PVC in most cases. Compared with PVC, the other two methods and the proposed method all exploit the local geometric structure of each view to guide the representation learning. Thus, the experimental results prove that the local information of each view contain very useful information, which is beneficial to learn a more compact and discriminative representation. Meanwhile, we can find that our method achieves better performance than DCNMF and GPMVC, which further proves the effectiveness of the proposed novel graph regularization term.

4.3 Parameter Analysis

Figure 3 shows the ACC versus the parameters ${\lambda _1}$ and ${\lambda _2}$ on the handwritten digit and BUAA datasets with 70% paired samples. It is obvious that the ACC of the proposed method is relatively stable in some local areas, which indicates that the proposed method is insensitive to the selection of parameters to some extent. Moreover, we can find that when the two parameters are selected with proper values from the candidate range of ($[ {{{10}^0},{{10}^2}}]$, $[ {{{10}^{\mathrm{{ - 5}}}},{{10}^{\mathrm{{ - 1}}}}}]$), the proposed method can achieve the satisfactory performance. This indicates that a relative larger value of parameter $\lambda _1$ encourages a better performance. In our work, we use the grid searching approach to find the optimal combinations of the two parameters from the two dimensional grid formed by ($[ {{{10}^0},{{10}^2}}]$, $[ {{{10}^{\mathrm{{ - 5}}}},{{10}^{\mathrm{{ - 1}}}}}]$) [24].

Figure 4 plots the relationships of ACC and the number of nearest neighbors of the proposed method on the handwritten digit and BUAA datasets. From the figures, we have the following conclusions: (1) The clustering performance is insensitive to the selection of nearest neighbor number to some extent when the nearest neighbor number is located in the proper range, such as $\left[ {8,18} \right] $ for the handwritten digit dataset and $\left[ {2,6} \right] $ for the BUAA dataset. (2) Generally, the number of nearest neighbors should better be less than the number of sample of each class. For example, from Fig. 3(b), we can find that when the number of nearest neighbors is larger than the number of sample per class, i.e., $N > 10$, the ACC decreases dramatically. However, in the real world applications, it is impossible to obtain the real number of sample per class. In this work, we use the following criterion to select the number of the nearest neighbors. Suppose we try to partition the available multi-view data with n samples into c groups, . If $m \gg 10$, then we empirically select 10 as the number of nearest neighbors, otherwise we select $min(m - 4,2)$ as the nearest neighbor number.

4.4 Experimental Convergence Study

Figure 5 shows the objective function value and ACC at each iteration step on the handwritten digit and BUAA datasets with 70% paired samples. From the figures, it is obvious to see that the objective function value decreases dramatically in the first few iteration steps (within 20 iterations). The experimental results plotted in the two figures prove the good convergence property of our method.

5 Conclusions

In this paper, we propose a novel framework for multi-view learning, which not only can handle the incomplete and complete multi-view clustering, but also is able to deal with the out-of-sample. Moreover, the proposed method has the potential to complete the missing views for any sample. Besides, we provide a novel approach to exploit the local information of data without introducing any extra regularization term and penalty parameter, which does not increase the complexity and computational burden. Extensive experimental results prove the effectiveness of the proposed method.

References

Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT, pp. 92–100. ACM (1998)
Google Scholar
Cai, X., Nie, F., Huang, H.: Multi-view k-means clustering on big data. In: IJCAI, pp. 2598–2604 (2013)
Google Scholar
Fei, L., Xu, Y., Zhang, B., Fang, X., Wen, J.: Low-rank representation integrated with principal line distance for contactless palmprint recognition. Neurocomputing 218, 264–275 (2016)
Article Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106(1), 59–70 (2007)
Article Google Scholar
Gao, H., Peng, Y., Jian, S.: Incomplete multi-view clustering. In: Shi, Z., Vadera, S., Li, G. (eds.) IIP 2016. IAICT, vol. 486, pp. 245–255. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48390-0_25
Chapter Google Scholar
Guo, Y.: Convex subspace representation learning from multi-view data. In: AAAI, vol. 1, pp. 387–393 (2013)
Google Scholar
Huang, D., Sun, J., Wang, Y.: The buaa-visnir face database instructions. Technical report IRIP-TR-12-FR-001, School of Computational Science and Engineering, Beihang University, Beijing, China (2012)
Google Scholar
Kumar, A., Rai, P., Daume, H.: Co-regularized multi-view spectral clustering. In: NIPS, pp. 1413–1421 (2011)
Google Scholar
Li, M., Xue, X.B., Zhou, Z.H.: Exploiting multi-modal interactions: a unified framework. In: IJCAI, pp. 1120–1125 (2009)
Google Scholar
Li, Y., Nie, F., Huang, H., Huang, J.: Large-scale multi-view spectral clustering via bipartite graph. In: AAAI, pp. 2750–2756 (2015)
Google Scholar
Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. IEEE TPAMI 35(1), 171–184 (2013)
Article Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML, pp. 689–696 (2011)
Google Scholar
Qian, B., Shen, X., Gu, Y., Tang, Z., Ding, Y.: Double constrained NMF for partial multi-view clustering. In: DICTA, pp. 1–7. IEEE (2016)
Google Scholar
Qin, J., et al.: Binary coding for partial action analysis with limited observation ratios. In: CVPR, pp. 146–155 (2017)
Google Scholar
Qin, J., et al.: Zero-shot action recognition with error-correcting output codes. In: CVPR, pp. 2833–2842 (2017)
Google Scholar
Rai, N., Negi, S., Chaudhury, S., Deshmukh, O.: Partial multi-view clustering using graph regularized NMF. In: ICPR, pp. 2192–2197. IEEE (2016)
Google Scholar
Shao, W., He, L., Yu, P.S.: Multiple incomplete views clustering via weighted nonnegative matrix factorization with $L_{2,1}$ regularization. In: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9284, pp. 318–334. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23528-8_20
Chapter Google Scholar
Trivedi, A., Rai, P., Daumé III, H., DuVall, S.L.: Multiview clustering with incomplete views. In: NIPSW, pp. 1–7 (2010)
Google Scholar
Wen, J., et al.: Robust sparse linear discriminant analysis. In: IEEE TCSVT (2018). https://doi.org/10.1109/TCSVT.2018.2799214
Wen, J., Fang, X., Xu, Y., Tian, C., Fei, L.: Low-rank representation with adaptive graph regularization. Neural Netw. 108, 83–96 (2018)
Article Google Scholar
Wen, J., Han, N., Fang, X., Fei, L., Yan, K., Zhan, S.: Low-rank preserving projection via graph regularized reconstruction. In: IEEE TCYB, vol. 99, pp. 1–13 (2018). https://doi.org/10.1109/TCYB.2018.2799862
Zhang, Y., Zhang, Z., Qin, J., Zhang, L., Li, B., Li, F.: Semi-supervised local multi-manifold isomap by linear embedding for feature extraction. Pattern Recogn. 76, 662–678 (2018)
Article Google Scholar
Zhang, Z., Zhao, M., Chow, T.W.: Binary-and multi-class group sparse canonical correlation analysis for feature extraction and classification. IEEE TKDE 25(10), 2192–2205 (2013)
Google Scholar
Zhang, Z., Lai, Z., Xu, Y., Shao, L., Wu, J., Xie, G.S.: Discriminative elastic-net regularized linear regression. IEEE TIP 26(3), 1466–1481 (2017)
MathSciNet Google Scholar
Zhang, Z., et al.: Highly-economized multi-view binary compression for scalable image clustering. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part XII. LNCS, vol. 11216, pp. 731–748. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_44
Chapter Google Scholar
Zhang, Z., Shao, L., Xu, Y., Liu, L., Yang, J.: Marginal representation learning with graph structure self-adaptation. IEEE TNNLS (2017). https://doi.org/10.1109/TNNLS.2017.2772264
Article Google Scholar
Zhang, Z., Xu, Y., Shao, L., Yang, J.: Discriminative block-diagonal representation learning for image recognition. IEEE TNNLS 29(7), 3111–3125 (2018)
MathSciNet Google Scholar
Zhao, H., Liu, H., Fu, Y.: Incomplete multi-modal visual data grouping. In: IJCAI, pp. 2392–2398 (2016)
Google Scholar
Zhao, J., Xie, X., Xu, X., Sun, S.: Multi-view learning overview: recent progress and new challenges. Inf. Fusion 38, 43–54 (2017)
Article Google Scholar
Zhi, S.Y., Zhou, H.: Partial multi-view clustering. In: AAAI, pp. 1968–1974 (2014)
Google Scholar
Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work is supported in part by Economic, Trade and information Commission of Shenzhen Municipality (Grant no. 20170504160426188).

Author information

Authors and Affiliations

Bio-Computing Research Center, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, Guangdong, China
Jie Wen, Zheng Zhang, Yong Xu & Zuofeng Zhong
The University of Queensland, Brisbane, Australia
Zheng Zhang

Authors

Jie Wen
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zuofeng Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Xu .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, J., Zhang, Z., Xu, Y., Zhong, Z. (2019). Incomplete Multi-view Clustering via Graph Regularized Matrix Factorization. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11132. Springer, Cham. https://doi.org/10.1007/978-3-030-11018-5_47

Download citation

DOI: https://doi.org/10.1007/978-3-030-11018-5_47
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11017-8
Online ISBN: 978-3-030-11018-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Incomplete Multi-view Clustering via Graph Regularized Matrix Factorization

Abstract

Similar content being viewed by others

Semi-supervised multi-view clustering with dual hypergraph regularized partially shared non-negative matrix factorization

Multiview Learning via Non-negative Matrix Factorization for Clustering Applications

Multi-view latent structure learning with rank recovery

Keywords

1 Introduction

2 Notations and Related Work

2.1 Notations

2.2 Partial Multi-View Clustering (PVC)

3 The Proposed Method

3.1 Learning Model of the Proposed Method

Lemma 1:

Remark 1:

Remark 2:

Remark 3:

Remark 4:

Remark 5:

3.2 Solution to IMC_GRMF

3.3 Computational Complexity and Convergence Property

Theorem 1:

Proof

4 Experiments and Analysis

4.1 Experimental Settings

4.2 Experimental Results and Analyses

4.3 Parameter Analysis

4.4 Experimental Convergence Study

5 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation