A K-AP Clustering Algorithm Based on Manifold Similarity Measure

Jia, Hongjie; Wang, Liangjun; Song, Heping; Mao, Qirong; Ding, Shifei

doi:10.1007/978-3-030-00828-4_3

A K-AP Clustering Algorithm Based on Manifold Similarity Measure

Hongjie Jia¹⁸,
Liangjun Wang¹⁸,
Heping Song¹⁸,
Qirong Mao¹⁸ &
…
Shifei Ding^19,20

Conference paper
First Online: 26 September 2018

1181 Accesses
4 Citations

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 538))

Abstract

K-AP clustering algorithm is a kind of affinity propagation (AP) clustering that can directly generate specified K clusters without adjusting the preference parameter. Similar to AP clustering algorithm, the clustering process of K-AP algorithm is also based on the similarity matrix. How to measure the similarities of data points is very important for K-AP algorithm. Since the original Euclidean distance is not suit for complex manifold data structure, we design a manifold similarity measurement and proposed a K-AP clustering algorithm based on the manifold similarity measure (MKAP). If two points lie on the same manifold, we assume that there is a path inside the manifold to connect the two points. The manifold similarity measure uses the length of the path as the manifold distance between the two points, so as to compress the distance of the data points in high-density region, while enlarge the distance of data points in low-density region. The clustering performance of the proposed MKAP algorithm is tested by comprehensive experiments. The clustering results show that MKAP algorithm can well deal with the datasets with complex manifold structures.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Clustering is an important approach to analyze the intrinsic structure of data. Affinity propagation (AP) clustering, proposed by Frey and Dueck [1], is a popular clustering method. AP clustering aims to find the optimal representative point, called ‘exemplar’, for each data point. It is more useful to find representative points than separate date points into several classes in many application domains [2,3,4,5]. For example, the representative points recognized from a document can be used to summarize and refine an essay. Different from k-means, the AP algorithm does not need specifying the initial cluster centers in advance [6, 7]. In contrast, it regards all data points as potential cluster center, therefore avoiding the arbitrary of the selection of the initial cluster centers.

However, AP clustering algorithm cannot directly specify the final class number, and the number of ultimate clusters is affected by a user-defined parameter. In order to generate K clusters, Zhang et al. [8] propose K-AP clustering algorithm. Similar to AP algorithm, K-AP algorithm needs constructing similarity matrix firstly, so it is crucial to select an appropriate distance measurement to describe the real structure of dataset. The data points belong to the same cluster should have high similarity, and keep the spatial coherency [9]. K-AP algorithm has better clustering performance on linear separable data, but not suit the clustering problem of manifold data. Because K-AP algorithm measures the similarity between data points based on Euclidean distance, which cannot correctly reflect the distribution of complex manifold data set [10]. This will significantly reduce the performance of K-AP, causing bad clustering results. According to the assumption of local-coherence and global-coherence of cluster, this paper designs a manifold similarity measure. We use a density-adjustable length to calculate the distance of data points, so that it is able to describe the manifold data distribution much better. Then the manifold similarity measure is used to improve the performance of K-AP algorithm.

To solve the difficulties of handling manifold data faced by K-AP clustering algorithm, we propose a K-AP clustering algorithm based on manifold similarity measure (MKAP). The rest paper is organized as follows: Sect. 2 introduces the basic theory of K-AP Clustering algorithm; Sect. 3 describes the manifold similarity measure; Sect. 4 presents the MKAP algorithm and gives its detail steps; Sect. 5 verifies the effectiveness of MKAP algorithm on artificial data sets and real world data sets; the last part is conclusion.

2 Basic K-AP Clustering

In AP clustering algorithm, the cluster number is affected by the preference parameter. It is not easy to set an appropriate preference parameter for AP algorithm to get the desired number of clusters [11]. K-AP clustering algorithm solves this problem very well. It uses the specified cluster number k as an input parameter and can directly classifies data points into k groups. K-AP algorithm searches the optimal representative point set of clusters and maximize the energy function by passing messages between data points. Equation (1) is the energy function of K-AP algorithm:

$$ E(\varepsilon ) = \sum\limits_{j = 1}^{K} {\sum\limits_{{x_{i} :c(x_{i} ) = e_{j} }} {s(x_{i} ,e_{j} )} } $$

(1)

where K is the cluster number and the number of representative points; $ \varepsilon = \{ e_{1} , \cdots ,e_{k} \} $ is the collection of representative points; c(x_i) is the mapping function between x_i and its closest representative point; $ s(x_{i} ,e_{j} ) $ is the similarity between x_i and cluster representative point e_j.

To find K representative points, we may introduce binary variables $ \left\{ {b_{ij} \in \{ 0,1\} ,i,j = 1, \cdots ,N} \right\} $ to indicate the distribution of representative points: $ b_{ij} = 1 $, $ i \ne j $ means x_i chooses x_j as its representative point; $ b_{ii} = 1 $ means x_i is a representative point. Then Eq. (1) is equal to Eq. (2):

$$ E(\left\{ {b_{ij} } \right\}) = \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {b_{ij} s(x_{i} ,x_{j} )} } $$

(2)

Equation (2) satisfies three conditions: $ \sum\limits_{j = 1}^{N} {b_{ij} } = 1 $; $ b_{ii} = 1 $, if $ \exists b_{ji} = 1 $; $ \sum\limits_{i = 1}^{N} {b_{ii} } = K $. The three conditions mean that: (a) every x_i can only have one representative point; (b) if there is a point x_j select x_i as its representative point, then x_i is a representative point; (c) the number of representative points must be K. These constraint conditions can be solved by factor graph model. Then the problem of finding K representative points turns into searching the optimal value of b_ij in factor graph. Equation (3) is the objective function of K-AP:

$$ \begin{array}{*{20}l} {F(b;s;K)} \hfill \\ { = \prod\limits_{i = 1}^{N} {\left( {e^{{b_{ii} }} \prod\limits_{j = 1,\,j \ne i}^{N} {e^{{b_{ij} s(i,\,j)}} } } \right)} h(b_{11} , \cdots ,b_{NN} |K)\prod\limits_{j = 1}^{N} {f_{j} (b_{1j} , \cdots ,b_{Nj} )} \prod\limits_{i = 1}^{N} {g_{i} (b_{i1} , \cdots ,b_{iN} )} } \hfill \\ \end{array} $$

(3)

where {g_i},{f_i} and h are three constraint functions. The above linear programming problem can be solved by Belief Propagation (BP) method [8].

3 Manifold Similarity Measure

The standard K-AP clustering algorithm measures the similarity between data points by Gaussian kernel function. Gaussian kernel is based on Euclidean distance, but Euclidean distance is not a proper distance measure for manifold data. Figure 1 is an example to illustrate the shortcomings of Euclidean distance.

It can be seen from Fig. 1 that point b and point c are on the same manifold, point a and point b are on different manifolds. We hope that the similarity between point b and point c is greater than the similarity between point a and point b, so that it is possible to group b and c into the same cluster. However, the Euclidean distance between point a and point b is significantly smaller than the Euclidean distance between point b and point c. We assume that the similarity of data pairs in the same manifold structure is high, and the similarity of data pairs in different manifold structures is low [12]. So this paper presents a manifold similarity function to meet the clustering assumption. First we define a segment length in manifold data.

Definition 1.

The length of line segment on manifold:

$$ L(x,y) = e^{\rho d(x,y)} - 1 $$

(4)

where $ d\left( {x,y} \right)\, = \,\left\| {x{-}y} \right\| $ is the Euclidean distance between the data points x and y; ρ is called the scaling factor.

If two points lie on the same manifold, suppose there is a path inside the manifold to connect the two points. We can use the length of the path as the manifold distance between the two points [13]. According to the length of line segment on manifold, a new distance measure—manifold distance measure is defined in Definition 2.

Definition 2.

Manifold distance measure: Given an undirected weighted graph $ G = \,\,\left( {V,E} \right),\,\,{\text{let}}\,\,p = \, \left\{ {v_{1} ,v_{2} , \ldots ,v_{|p|} } \right\} \in V^{|p|} $ denote the path between vertex v₁ and v_|p|, where $ \left| p \right| $ is the number of vertices contained in path p, the edge $ \left( {v_{k} ,v_{k + 1} } \right) \in E $, $ 1 \le k < \,\left| p \right| $. Let P_ij represent the set of all paths connecting the point pair $ \left\{ {x_{i} ,x_{j} } \right\}\,\,(1 \le i,j < N) $, then the manifold distance between x_i and x_j is

$$ D_{i,\,j}^{\rho } = \frac{1}{{\rho^{2} }}\ln (1 + d_{sp} (x_{i} ,x_{j} ))^{2} $$

(5)

where $ d_{sp} (x_{i} ,x_{j} ) = \mathop {\hbox{min} }\limits_{{p \subset P_{ij} }} \sum\limits_{k = 1}^{|p| - 1} {L(v_{k} ,v_{k + 1} )} $ is the distance of the shortest path between nodes x_i and x_j on graph G; $ L\left( {v_{k} ,v_{k + 1} } \right) $ is the manifold segment distance of two adjacent points on the shortest path from x_i to x_j on graph G.

Definition 3.

According to the above manifold distance measure, the manifold similarity of data points x_i and x_j is defined as

$$ s(i,j) = \exp \left( { - \frac{{D_{i,\,j}^{\rho } }}{{2\sigma_{i} \sigma_{j} }}} \right) $$

(6)

where the scale parameter $ \sigma_{i} = d(x_{i} ,x_{il} ) = \left\| {x_{i} - x_{il} } \right\| $, x_il is the l-th neighbor of x_i. σ_i adaptively changes with the neighborhood distribution of data points. The manifold similarity can enlarge the distance between two points on different manifolds and reduce the distance between two points on the same manifold.

4 K-AP Clustering Based on Manifold Similarity Measure

We use the manifold similarity measure to improve the K-AP clustering algorithm, and proposes a MKAP algorithm. This algorithm constructs the similarity matrix with the manifold similarity measure. Then it iteratively optimizes the clustering objective function by passing messages. The detail steps of MKAP algorithm are given below.

Algorithm 1.

K-AP clustering algorithm based on manifold similarity measure

Input: data set $ X = \, \left\{ {x_{1} ,x_{2} ,\,\, \ldots ,x_{n} } \right\} $, cluster number k.

Output: k final clusters.

Step 1. Calculate the manifold distance $ D_{i,\,j}^{\rho } $ between each data pair (x_i, x_j) according to Eq. (5).

Step 2. Use the manifold distance $ D_{i,\,j}^{\rho } $ to calculate the similarity s(i, j) between pairwise points $ \left( {x_{i} ,x_{j} } \right) $ by Eq. (6), and construct the similarity matrix S.

Step 3. Initialize the ‘availability’ $ a\left( {i,j} \right)\,\, = \,\,0 $, and the ‘confidence’ $ \eta^{out} (i) = min(S) $.

Step 4. Iteratively update the ‘responsibility’, ‘availability’ and ‘confidence’ according to the following equations:

(1)
Update the ‘responsibility’, $ \forall i,j $:

$$ r(i,j) = s(i,j) - \hbox{max}\left\{ {\eta^{out} (i) + a(i,i),\mathop {\hbox{max}}\limits_{{j^{\prime}:j^{\prime} \notin \{ i,\,j\} }} \left\{ {s(i,j^{\prime}) + a(i,j^{\prime})} \right\}} \right\} $$
(7)

$$ r(i,i) = \eta^{out} (i) - \mathop {\hbox{max} }\limits_{{j^{\prime}:j^{\prime} \ne s}} \left\{ {s(i,j^{\prime}) + a(i,j^{\prime})} \right\} $$
(8)
(2)
Update the ‘availability’, $ \forall i,j $:

$$ a(i,j) = \hbox{min}\left\{ {0,r(j,j)\;{ + }\;\sum\limits_{{v^{\prime}:v^{\prime} \notin \{ i,\,j\} }} {\hbox{max} \left\{ {0,r(i^{\prime},j)} \right\}} } \right\} $$
(9)

$$ a(j,j) = \sum\limits_{{v^{\prime}:v^{\prime} \ne j}} {\hbox{max} \left\{ {0,r(i^{\prime},j)} \right\}} $$
(10)
(3)
Update the ‘confidence’, $ \forall i $:

$$ \eta^{in} (i) = a(i,i) - \mathop {\hbox{max} }\limits_{{j^{\prime}:j^{\prime} \ne s}} \left\{ {s(i,j^{\prime}) + a(i,j^{\prime})} \right\} $$
(11)

$$ \eta^{out} (i) = - f^{k} \left( {\left\{ {\eta^{in} (j),j \ne i} \right\}} \right) $$
(12)

where f^k(•) means the k-th largest value in $ \eta^{in} (j) $, $ i,j = \,\,1, \, 2,\,\, \ldots ,N $.

Step 5. According to Eq. (13) to determine the best cluster center for data points, until the algorithm converges.

$$ c_{i} = \mathop {\arg \hbox{max} }\limits_{j} \left\{ {a(i,j) + r(i,j)} \right\} $$

(13)

Similar to K-AP algorithm, the time complexity of MKAP algorithm is also O(N²). As MKAP algorithm uses the manifold similarity measure to construct the similarity matrix, it can well describe the manifold relationship between data points.

5 Experimental Analysis

5.1 Clustering on Synthetic Datasets

In the experiments, the clustering performances of AP algorithm, K-AP algorithm and MKAP algorithm are compared on three challenging synthetic manifold datasets: ‘two circles’, ‘two moons’ and ‘two spirals’. These datasets are illustrated in Fig. 2.

In the experiments, the preference parameter p of AP algorithm is the median of affinity matrix, the maximum iteration maxits = 1000, the convergence coefficient of iteration convits = 100. The density factor of MKAP algorithm is set as ρ = 2. The clustering results of AP algorithm, K-AP algorithm and MKAP algorithm on these three synthetic data sets are presented in Fig. 3.

Form Fig. 3, we can see that AP algorithm tends to generate many small clusters. It is not easy to control the cluster number of clustering results for AP algorithm. AP algorithm is easy to fall into the local optimum. In K-AP algorithm, the cluster number K is one of the clustering constraints, so the final cluster number of K-AP algorithm on each dataset is right. But similar to AP algorithm, K-AP algorithm measures the similarity between points based on Euclidean distance and it cannot recognize complex manifold structure of the dataset. In contrast, the performance of the proposed MKAP algorithm is much better. With the help of manifold similarity measurement, MKAP algorithm is suitable for the clustering problem on manifold datasets. For MKAP algorithm, the data points on the same manifold have high similarity and the data points on different manifolds are dissimilar with each other.

5.2 Clustering on Real World Datasets

To further test the effectiveness of the proposed MKAP algorithm, we compare MKAP algorithm with other popular clustering algorithms on several benchmarking real world datasets [14]. The information of these datasets are shown in Table 1.

Table 1. Information of real world datasets

Full size table

In the experiments, adjusted rand index (ARI) is used to evaluate the clustering performance [15]. ARI is based on the relationship of pairwise data points. The calculation equation of ARI is:

$$ {\text{ARI}} = \frac{2(a * d - b * c)}{(a + b) * (b + d) + (a + c) * (c + d)} $$

(14)

where a, b, c, d are the number of different kind of data pairs.$ {\text{ARI}} \in [0,1] $, the higher the value of ARI, the better the clustering quality.

The clustering performance of the proposed MKAP algorithm is compared with AP algorithm, K-AP algorithm and F-AP algorithm [16]. All the experiments are conducted on the computer with 3.20 GHz AMD Ryzen 5 1600 six-core processor, 8 GB RAM. The programming environment is MATLAB 2015b. The clustering results of different algorithms are given in Table 2.

Table 2. Clustering results of different algorithms on real world datasets

Full size table

According to Table 2, the running speed of F-AP algorithm is much faster than other algorithms. Because F-AP computes upper and lower estimates to limit the messages to be updated in each iteration, and it dynamically detects converged messages to efficiently skip unneeded updates. But it is not easy for AP algorithm and F-AP algorithm to control the final cluster number. Their clustering performance are not very well on some datasets. Both K-AP algorithm and MKAP algorithm can make good use of prior knowledge, and divide dataset into a given number of clusters. However, K-AP constructs the similarity matrix based on the Euclidean distance between data points. Euclidean distance is not proper to describe the complex data structure of many real world datasets. So the ARI indexes of K-AP algorithm are not as good as the proposed MKAP algorithm on most datasets. MKAP utilizes the manifold similarity measure to do clustering and can produce better clustering results.

6 Conclusions

In this paper, we propose a K-AP clustering algorithm based on manifold similarity measure (MKAP). K-AP algorithm cannot work well on manifold data and it is easy to fall into local optimum. To improve the clustering performance K-AP algorithm, we design a manifold similarity measurement. The manifold similarity measure can correctly describe the complex relationships between data points and reveal the internal structure of the dataset. With the manifold similarity measure, MKAP algorithm is able to maintain the global and local consistency of clustering when assigning data points into multiple groups. In the experiments, the proposed MKAP algorithm is compared with other popular Affinity propagation clustering algorithms on both synthetic and real world datasets. The experimental results demonstrate the effectiveness of MKAP algorithm. Next we consider to improve the clustering efficiency of MKAP algorithm and apply it to some practical problems, such as character recognition, image segmentation and speech separation etc.

References

Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Article MathSciNet Google Scholar
Wei, Z., Wang, Y., He, S., et al.: A novel intelligent method for bearing fault diagnosis based on affinity propagation clustering and adaptive feature selection. Knowl. Based Syst. 116, 1–12 (2017)
Article Google Scholar
Jia, H., Ding, S., Du, M.: A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput. 21(19), 5815–5827 (2017)
Article Google Scholar
Wang, Z.J., Zhan, Z.H., Lin, Y., et al.: Dual-strategy differential evolution with affinity propagation clustering for multimodal optimization problems. IEEE Trans. Evol. Comput. (2017). https://doi.org/10.1109/tevc.2017.2769108
Li, P., Gu, W., Wang, L., et al.: Dynamic equivalent modeling of two-staged photovoltaic power station clusters based on dynamic affinity propagation clustering algorithm. Int. J. Electr. Power Energy Syst. 95, 463–475 (2018)
Article Google Scholar
Li, P., Ji, H., Wang, B., et al.: Adjustable preference affinity propagation clustering. Pattern Recogn. Lett. 85, 72–78 (2017)
Article Google Scholar
Fan, Z., Jiang, J., Weng, S., et al.: Adaptive density distribution inspired affinity propagation clustering. Neural Comput. Appl., 1–11 (2017). https://doi.org/10.1007/s00521-017-3024-6
Zhang, X.L., Wang, W., Nørvag, K., et al.: K-AP: generating specified K clusters by efficient affinity propagation. In: Proceedings 2010 10th IEEE International Conference on Data Mining (ICDM 2010), pp. 1187–1192 (2010)
Google Scholar
Jia, H., Ding, S., Du, M.: Self-tuning p-spectral clustering based on shared nearest neighbors. Cogn. Comput. 7(5), 622–632 (2015)
Article Google Scholar
Wang, B., Zhang, J., Liu, Y., et al.: Density peaks clustering based integrate framework for multi-document summarization. CAAI Trans. Intell. Technol. 2(1), 26–30 (2017)
Article Google Scholar
Arzeno, N.M., Vikalo, H.: Semi-supervised affinity propagation with soft instance-level constraints. IEEE Trans. Pattern Anal. Mach. Intell. 37(5), 1041–1052 (2015)
Article Google Scholar
Liu, Z., Wang, W., Jin, Q.: Manifold alignment using discrete surface Ricci flow. CAAI Trans. Intell. Technol. 1(3), 285–292 (2016)
Article Google Scholar
Jia, H., Ding, S., Xu, X., et al.: The latest research progress on spectral clustering. Neural Comput. Appl. 24(7–8), 1477–1486 (2014)
Article Google Scholar
Jia, H., Ding, S., Du, M., et al.: Approximate normalized cuts without Eigen-decomposition. Inf. Sci. 374, 135–150 (2016)
Article Google Scholar
Jia, H., Ding, S., Meng, L., et al.: A density-adaptive affinity propagation clustering algorithm based on spectral dimension reduction. Neural Comput. Appl. 25(7–8), 1557–1567 (2014)
Article Google Scholar
Fujiwara, Y., Nakatsuji, M., Shiokawa, H., et al.: Adaptive message update for fast affinity propagation. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 309–318. ACM (2015)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundations of China (Nos. 61672267, 61672522, 61601202), and the Natural Science Foundation of Jiangsu Province (Nos. BK20140571, BK20170558).

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, China
Hongjie Jia, Liangjun Wang, Heping Song & Qirong Mao
School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
Shifei Ding
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Shifei Ding

Authors

Hongjie Jia
View author publications
You can also search for this author in PubMed Google Scholar
Liangjun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Heping Song
View author publications
You can also search for this author in PubMed Google Scholar
Qirong Mao
View author publications
You can also search for this author in PubMed Google Scholar
Shifei Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongjie Jia .

Editor information

Editors and Affiliations

Institute of Computing Technology, CAS, Beijing, China
Zhongzhi Shi
University of Reims Champagne-Ardenne, Saint Drezery, France
Eunika Mercier-Laurent
University of South Australia, Mawson Lakes, SA, Australia
Jiuyong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, H., Wang, L., Song, H., Mao, Q., Ding, S. (2018). A K-AP Clustering Algorithm Based on Manifold Similarity Measure. In: Shi, Z., Mercier-Laurent, E., Li, J. (eds) Intelligent Information Processing IX. IIP 2018. IFIP Advances in Information and Communication Technology, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-030-00828-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-00828-4_3
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00827-7
Online ISBN: 978-3-030-00828-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)