Keywords

1 Introduction

Spectral clustering treats clustering problem as a graph partitioning problem. It can solve the graph cut objective function using the eigenvectors of graph Laplacian matrix [1]. Compared with the conventional clustering algorithms, spectral clustering is able to recognize more complex data structures, especially suitable for non-convex data sets. Recently, an improved version of normalized cut named Cheeger cut has aroused much attention [2]. Research shows that Cheeger cut is able to produce more balanced clusters through graph p-Laplacian matrix [3]. p-Laplacian matrix is a nonlinear generalization form of graph Laplacian.

p-spectral clustering is based on Cheeger cut to group data points. As it has solid theoretical foundation and good clustering results, the research in this area is very active at present. Dhanjal et al. present an incremental spectral clustering which updates the eigenvectors of the Laplacian in a computationally efficient way [4]. Gao et al. construct the sparse affinity graph on a small representative dataset and use local interpolation to improve the extension of the clustering results [5]. Semertzidis et al. inject the pairwise constraints to a small affinity sub-matrix and use a sparse coding strategy of a landmark spectral clustering to preserve low complexity [6].

Nowadays, science and technology is growing by leaps and bounds and massive data result in “data explosion”. These data are often accompanied by high dimensions. When dealing with high-dimensional data, some clustering algorithms that perform well in low-dimensional data space are often unable to get good clustering results, and even invalid [7]. Attribute reduction is an effective way to decrease the size of data, and it is often used as a preprocessing step for data mining. The essence of attribute reduction is to remove irrelevant or unnecessary attributes while maintaining the classification ability of knowledge base. Efficient attribute reduction not only can improve the knowledge clarity in intelligent information systems, but also reduce the cost of information systems to some extent. In order to effectively deal with high-dimensional data, we design a novel attribute reduction method based on neighborhood granulation and combine it with p-spectral clustering. The proposed algorithm inherits the advantages of neighborhood rough set and graph p-Laplacian. Its effectiveness is demonstrated by comprehensive experiments on benchmark data sets.

This paper is organized as follows: Sect. 2 introduces p-spectral clustering; Sect. 3 uses information entropy to improve the attribute reduction based on neighborhood rough sets; Sect. 4 improves p-spectral clustering with the neighborhood attribute granulation; Sect. 5 verifies the effectiveness of the proposed algorithm using benchmark data sets; finally, we summarize the main contribution of this paper.

2 p-Spectral Clustering

The idea of spectral clustering comes from spectral graph partition theory. Given a data set, we can construct an undirected weighted graph G = (V,E), where V is the set of vertices represented by data points, E is the set of edges weighted by the similarities between the edge’s two vertices. Suppose A is a subset of V, the complement of A is written as \( \bar{A} = V\backslash A \). The cut of A and \( \bar{A} \) is defined as:

$$ cut(A,\bar{A}) = \sum\limits_{{i \in A,\,j \in \bar{A}}} {w_{ij} } $$
(1)

where w ij is the similarity between vertex i and vertex j.

In order to get more balanced clusters, Cheeger et al. propose Cheeger cut criterion, denoted as Ccut [8]:

$$ Ccut(A,\bar{A}) = \frac{{cut(A,\bar{A})}}{{\hbox{min} \{ \left| A \right|,\left| {\bar{A}} \right|\} }} $$
(2)

where \( \left| A \right| \) is the number of data points in set A. Cheeger cut is to minimize formula (2) to get a graph partition. The optimal graph partition means that the similarities within a cluster are as large as possible, while the similarities between clusters are as small as possible. But according to the Rayleigh quotient principle, calculating the optimal Cheeger cut is an NP-hard problem. Next we will try to get an approximate solution of Cheeger cut by introducing p-Laplacian into spectral clustering.

Hein et al. define the inner product form of graph p-Laplacian Δ p as follows [9]:

$$ \left\langle {\text{f} ,\Delta_{p} \text{f} } \right\rangle = \frac{1}{2}\sum\limits_{i,\,j = 1}^{n} {w_{ij} (f_{i} - f_{j} )^{p} } $$
(3)

where p ∈ (1,2], f is the eigenvector of p-Laplacian matrix.

Theorem 1.

For p > 1 and every partition of V into A, \( \bar{A} \) there exists a function (f, A) such that the functional F p associated to the p-Laplacian satisfies

$$ F_{p} (f,A) = \frac{{\left\langle {\text{f} ,\Delta_{p} \text{f} } \right\rangle }}{{\left\| \text{f} \right\|^{p} }} = cut(A,\bar{A})\left| {\frac{1}{{\left| A \right|^{{\tfrac{1}{p - 1}}} }} + \frac{1}{{\left| {\bar{A}} \right|^{{\tfrac{1}{p - 1}}} }}} \right|^{p - 1} $$
(4)

where \( \left\| \text{f} \right\|^{p} = \sum\limits_{i = 1}^{n} {\left| {f_{i} } \right|^{p} } \). The expression (4) can be interpreted as a balanced graph cut criterion, and we have the special cases

$$ \mathop {\lim }\limits_{p \to 1} F_{p} (f,A) = Ccut(A,\bar{A}) $$
(5)

Theorem 1 shows that Cheeger cut can be solved in polynomial time using p-Laplacian operator. So the solution of F p (f) is a relaxed approximate solution of Cheeger cut and the optimal solution can be obtained by the eigen-decomposition of p-Laplacian:

$$ \lambda_{p} = \mathop {\arg \hbox{min} }\limits_{p \to 1} F_{p} (f) $$
(6)

where λ p is the eigenvalue corresponding to eigenvector f.

Specifically, the second eigenvector \( v_{p}^{(2)} \) of p-Laplacian matrix will lead to a bipartition of the graph by setting an appropriate threshold [3]. The optimal threshold is determined by minimizing the corresponding Cheeger cut. For the second eigenvector \( v_{p}^{(2)} \) of graph p-Laplacian Δ p , the threshold should satisfy:

$$ \mathop {\arg \hbox{min} }\limits_{{A_{t} = \{ i \in V|v_{p}^{(2)} (i) > t\} }} Ccut(A_{t} ,\bar{A}_{t} ) $$
(7)

3 Neighborhood Attribute Granulation

Rough set theory is proposed by professor Pawlak in 1982 [10]. Attribute reduction is one of the core contents of rough set knowledge discovery. However, Pawlak rough set is only suitable for discrete data. To solve this problem, Hu et al. propose neighborhood rough set model [11]. This model can directly analyze the attributes with continuous values. Therefore, it has great advantages in feature selection and classification accuracy.

Definition 1.

Domain \( U = \{ x_{1} ,x_{2} , \cdots ,x_{n} \} \) is a non-empty finite set in real space, for \( x_{i} \in U \), the δ-neighborhood of x i is defined as:

$$ \delta (x_{i} ) = \{ x|x \in U,\Delta (x,x_{i} ) \le \delta \} $$
(8)

where \( \delta \ge 0 \), \( \delta (x_{i} ) \) is called the neighborhood particle of x i , Δ is a distance function.

Definition 2.

Given a domain \( U = \{ x_{1} ,x_{2} , \cdots ,x_{n} \} \) located in real space. A represents the attribute set of U; D represents the decision attribute. If A is able to generate a family of neighborhood relationship of domain U, then \( NDT = \left\langle {U,A,D} \right\rangle \) is called a neighborhood decision system.

For a neighborhood decision system \( NDT = \left\langle {U,A,D} \right\rangle \), domain U is divided into N equivalence classes by decision attribute \( D:X_{1} ,X_{2} , \cdots ,X_{N} \). \( \forall B \subseteq A \), the lower approximation is \( \underline{{N_{B} }} D = \bigcup\limits_{i = 1}^{N} {\underline{{N_{B} }} X_{i} } \), where \( \underline{{N_{B} }} X_{i} = \{ x_{i} |\delta_{B} (x_{i} ) \subseteq X_{i} ,x_{i} \in U\} \).

According to the nature of lower approximation, we can define the dependence of decision attribute D on condition attribute B:

$$ \gamma_{B} (D) = \frac{{Card(\underline{{N_{B} }} D)}}{Card(U)} $$
(9)

where \( 0 \le \gamma_{B} (D) \le 1 \). Obviously, the greater the positive region \( \underline{{N_{B} }} D \), the stronger the dependence of decision D on condition B.

Definition 3.

Given a neighborhood decision system \( NDT = \left\langle {U,A,D} \right\rangle \), \( B \subseteq A \), \( \forall a \in A - B \), then the significant degree of a relative to B is defined as:

$$ SIG(a,B,D) = \gamma_{B \cup a} (D) - \gamma_{B} (D) $$
(10)

However, sometimes several attributes may have the same greatest importance degree. Traditional reduction algorithms take the approach of randomly choosing one of the attributes, which is obviously arbitrary does not taking into account the impact of other factors on attribute selection and may lead to poor reduction results. From the viewpoint of information theory to analyze attribute reduction can improve the reduction accuracy [12]. Here, we use information entropy as another criterion to evaluate attributes. The definition of entropy is given below.

Definition 4.

Given knowledge P and its partition \( U/P = \{ X_{1} ,X_{2} , \cdots ,X_{n} \} \) exported on domain U. The information entropy of knowledge P is defined as:

$$ H(P) = - \sum\limits_{i = 1}^{n} {p(X_{i} )\log p(X_{i} )} $$
(11)

where \( p(X_{i} ) = \left| {X_{i} } \right|/\left| U \right| \) represents the probability of equivalence class X i on domain U.

If multiple attributes have the same greatest importance degree, then we may compare their information entropy and select the attribute with the minimum entropy (because it carries the least uncertain information). Incorporate the selected attribute into the reduction set, and repeat this process for each attribute until the reduction set no longer changes. This improved attribute reduction algorithm is shown as Algorithm 1.

4 p-Spectral Clustering Based on Neighborhood Attribute Granulation

Massive high-dimensional data processing has been a challenge problem in data mining. High-dimensional data is often accompanied by the “curse of dimensionality”, so traditional p-spectral clustering algorithms cannot play to their strengths very well. Moreover, real data sets often contain noise and irrelevant features, likely to cause “dimension trap”. It would interfere with the clustering process of algorithms, affecting the accuracy of clustering results [13]. To solve this problem, we propose a novel p-spectral clustering algorithm based on neighborhood attribute granulation (NAG-pSC). The detailed steps of NAG-pSC algorithm is given in Algorithm 2.

5 Experimental Analysis

To test the effectiveness of the proposed NAG-pSC algorithm, we use six benchmark data sets to do the experiments. The characteristics of these data sets are shown in Table 1.

Table 1. Data sets used in the experiments

In this paper, we use F-measure to evaluate the merits of clustering results [14]. The F-score of each class i and the total F index of the clustering results are defined as:

$$ F(i) = \frac{2 \times P(i) \times R(i)}{P(i) + R(i)} $$
(12)
$$ F = \frac{1}{n}\sum\limits_{i = 1}^{k} {[N_{i} \times F(i)]} $$
(13)

where \( P(i) = N_{ii*} /N_{i*} \) is the precision rate and \( R(i) = N_{ii*} /N_{i} \) is the recall rate; N ii* is the size of the intersection of class i and cluster i*; N i is the size of class i; N i* is the size of cluster i*; n is the number of data points; k is the class number; N i is the size of class i. \( F \in [0,1] \), the greater the F index is, means the clustering results of the algorithm is closer to the real data category.

In the experiment, NAG-pSC algorithm is compared with the traditional spectral clustering (SC), density sensitive spectral clustering (D-SC) [1] and p-spectral clustering (pSC) [3]. The threshold δ is important in neighborhood rough set. Hu et al. recommend a value range [0.2, 0.4] of δ based on experimental analysis [11]. So we set the neighborhood size δ via a cross-validatory search in the range [0.2, 0.4] (with step size 0.05) for each data set. The clustering results of these four algorithms are shown in Fig. 1. The horizontal axis of the figure is the cluster label, and the vertical axis is the F-score of each cluster.

Fig. 1.
figure 1

Clustering results on different datasets

From Fig. 1 we can see that, the performance of SC algorithm is close to D-SC algorithm. This is mainly because that they all based on graph theory and turn the clustering problem into a graph partitioning problem. Using the p-Laplacian transform, pSC may find the global optimum solution. SC works well on Sonar data set. D-SC deals well with Colon Cancer data set. pSC can generate balanced clusters on WDBC data set. But for high dimensional clustering problems, their F-scores are lower than the proposed NAG-pSC algorithm. Because the information in each attribute of the instances is different, and they also make different contributions to the clustering. Improper feature selection would cause a greate impact on the clustering results. Traditional spectral clustering algorithm does not take this into account, susceptible to the interference of noise and irrelevant attributes. For further comparison, Table 2 lists the overall F index for each algorithm and the number of condition attributes of different data sets.

Table 2. Total F index of different algorithms

Table 2 shows that NAG-pSC algorithm can well deal with high-dimensional data. NAG-pSC algorithm uses neighborhood rough sets to optimize data instances. The neighborhood attribute reduction based on information entropy diminishes the negative impact of noise data and redundant attributes on the clustering. So in most cases, NAG-pSC algorithm has higher clustering accuracy. NAG-pSC algorithm combines the advantages of p-spectral clustering and neighborhood attribute granulation. It has good robustness and strong generalization ability.

6 Conclusions

To improve the performance of p-spectral clustering on high-dimensional data, we modify the attribute reduction method based on neighborhood rough sets. In the new method, the attribute importance is combined with information entropy to select the appropriate attributes. Then we propose NAG-pSC algorithm based on the optimized attribute reduction set. Experiments show that NAG-pSC algorithm is superior to traditional spectral clustering, density sensitive spectral clustering and p-spectral clustering. In the future, we will study how to apply NAG-pSC algorithm to web data mining, image retrieval and other realistic scenes.