1 Introduction

Clustering is an important approach to analyze the intrinsic structure of data. Affinity propagation (AP) clustering, proposed by Frey and Dueck [1], is a popular clustering method. AP clustering aims to find the optimal representative point, called ‘exemplar’, for each data point. It is more useful to find representative points than separate date points into several classes in many application domains [2,3,4,5]. For example, the representative points recognized from a document can be used to summarize and refine an essay. Different from k-means, the AP algorithm does not need specifying the initial cluster centers in advance [6, 7]. In contrast, it regards all data points as potential cluster center, therefore avoiding the arbitrary of the selection of the initial cluster centers.

However, AP clustering algorithm cannot directly specify the final class number, and the number of ultimate clusters is affected by a user-defined parameter. In order to generate K clusters, Zhang et al. [8] propose K-AP clustering algorithm. Similar to AP algorithm, K-AP algorithm needs constructing similarity matrix firstly, so it is crucial to select an appropriate distance measurement to describe the real structure of dataset. The data points belong to the same cluster should have high similarity, and keep the spatial coherency [9]. K-AP algorithm has better clustering performance on linear separable data, but not suit the clustering problem of manifold data. Because K-AP algorithm measures the similarity between data points based on Euclidean distance, which cannot correctly reflect the distribution of complex manifold data set [10]. This will significantly reduce the performance of K-AP, causing bad clustering results. According to the assumption of local-coherence and global-coherence of cluster, this paper designs a manifold similarity measure. We use a density-adjustable length to calculate the distance of data points, so that it is able to describe the manifold data distribution much better. Then the manifold similarity measure is used to improve the performance of K-AP algorithm.

To solve the difficulties of handling manifold data faced by K-AP clustering algorithm, we propose a K-AP clustering algorithm based on manifold similarity measure (MKAP). The rest paper is organized as follows: Sect. 2 introduces the basic theory of K-AP Clustering algorithm; Sect. 3 describes the manifold similarity measure; Sect. 4 presents the MKAP algorithm and gives its detail steps; Sect. 5 verifies the effectiveness of MKAP algorithm on artificial data sets and real world data sets; the last part is conclusion.

2 Basic K-AP Clustering

In AP clustering algorithm, the cluster number is affected by the preference parameter. It is not easy to set an appropriate preference parameter for AP algorithm to get the desired number of clusters [11]. K-AP clustering algorithm solves this problem very well. It uses the specified cluster number k as an input parameter and can directly classifies data points into k groups. K-AP algorithm searches the optimal representative point set of clusters and maximize the energy function by passing messages between data points. Equation (1) is the energy function of K-AP algorithm:

$$ E(\varepsilon ) = \sum\limits_{j = 1}^{K} {\sum\limits_{{x_{i} :c(x_{i} ) = e_{j} }} {s(x_{i} ,e_{j} )} } $$
(1)

where K is the cluster number and the number of representative points; \( \varepsilon = \{ e_{1} , \cdots ,e_{k} \} \) is the collection of representative points; c(xi) is the mapping function between xi and its closest representative point; \( s(x_{i} ,e_{j} ) \) is the similarity between xi and cluster representative point ej.

To find K representative points, we may introduce binary variables \( \left\{ {b_{ij} \in \{ 0,1\} ,i,j = 1, \cdots ,N} \right\} \) to indicate the distribution of representative points: \( b_{ij} = 1 \), \( i \ne j \) means xi chooses xj as its representative point; \( b_{ii} = 1 \) means xi is a representative point. Then Eq. (1) is equal to Eq. (2):

$$ E(\left\{ {b_{ij} } \right\}) = \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{N} {b_{ij} s(x_{i} ,x_{j} )} } $$
(2)

Equation (2) satisfies three conditions: \( \sum\limits_{j = 1}^{N} {b_{ij} } = 1 \); \( b_{ii} = 1 \), if \( \exists b_{ji} = 1 \); \( \sum\limits_{i = 1}^{N} {b_{ii} } = K \). The three conditions mean that: (a) every xi can only have one representative point; (b) if there is a point xj select xi as its representative point, then xi is a representative point; (c) the number of representative points must be K. These constraint conditions can be solved by factor graph model. Then the problem of finding K representative points turns into searching the optimal value of bij in factor graph. Equation (3) is the objective function of K-AP:

$$ \begin{array}{*{20}l} {F(b;s;K)} \hfill \\ { = \prod\limits_{i = 1}^{N} {\left( {e^{{b_{ii} }} \prod\limits_{j = 1,\,j \ne i}^{N} {e^{{b_{ij} s(i,\,j)}} } } \right)} h(b_{11} , \cdots ,b_{NN} |K)\prod\limits_{j = 1}^{N} {f_{j} (b_{1j} , \cdots ,b_{Nj} )} \prod\limits_{i = 1}^{N} {g_{i} (b_{i1} , \cdots ,b_{iN} )} } \hfill \\ \end{array} $$
(3)

where {gi},{fi} and h are three constraint functions. The above linear programming problem can be solved by Belief Propagation (BP) method [8].

3 Manifold Similarity Measure

The standard K-AP clustering algorithm measures the similarity between data points by Gaussian kernel function. Gaussian kernel is based on Euclidean distance, but Euclidean distance is not a proper distance measure for manifold data. Figure 1 is an example to illustrate the shortcomings of Euclidean distance.

Fig. 1.
figure 1

Euclidean distance for manifold data

It can be seen from Fig. 1 that point b and point c are on the same manifold, point a and point b are on different manifolds. We hope that the similarity between point b and point c is greater than the similarity between point a and point b, so that it is possible to group b and c into the same cluster. However, the Euclidean distance between point a and point b is significantly smaller than the Euclidean distance between point b and point c. We assume that the similarity of data pairs in the same manifold structure is high, and the similarity of data pairs in different manifold structures is low [12]. So this paper presents a manifold similarity function to meet the clustering assumption. First we define a segment length in manifold data.

Definition 1.

The length of line segment on manifold:

$$ L(x,y) = e^{\rho d(x,y)} - 1 $$
(4)

where \( d\left( {x,y} \right)\, = \,\left\| {x{-}y} \right\| \) is the Euclidean distance between the data points x and y; ρ is called the scaling factor.

If two points lie on the same manifold, suppose there is a path inside the manifold to connect the two points. We can use the length of the path as the manifold distance between the two points [13]. According to the length of line segment on manifold, a new distance measure—manifold distance measure is defined in Definition 2.

Definition 2.

Manifold distance measure: Given an undirected weighted graph \( G = \,\,\left( {V,E} \right),\,\,{\text{let}}\,\,p = \, \left\{ {v_{1} ,v_{2} , \ldots ,v_{|p|} } \right\} \in V^{|p|} \) denote the path between vertex v1 and v|p|, where \( \left| p \right| \) is the number of vertices contained in path p, the edge \( \left( {v_{k} ,v_{k + 1} } \right) \in E \), \( 1 \le k < \,\left| p \right| \). Let Pij represent the set of all paths connecting the point pair \( \left\{ {x_{i} ,x_{j} } \right\}\,\,(1 \le i,j < N) \), then the manifold distance between xi and xj is

$$ D_{i,\,j}^{\rho } = \frac{1}{{\rho^{2} }}\ln (1 + d_{sp} (x_{i} ,x_{j} ))^{2} $$
(5)

where \( d_{sp} (x_{i} ,x_{j} ) = \mathop {\hbox{min} }\limits_{{p \subset P_{ij} }} \sum\limits_{k = 1}^{|p| - 1} {L(v_{k} ,v_{k + 1} )} \) is the distance of the shortest path between nodes xi and xj on graph G; \( L\left( {v_{k} ,v_{k + 1} } \right) \) is the manifold segment distance of two adjacent points on the shortest path from xi to xj on graph G.

Definition 3.

According to the above manifold distance measure, the manifold similarity of data points xi and xj is defined as

$$ s(i,j) = \exp \left( { - \frac{{D_{i,\,j}^{\rho } }}{{2\sigma_{i} \sigma_{j} }}} \right) $$
(6)

where the scale parameter \( \sigma_{i} = d(x_{i} ,x_{il} ) = \left\| {x_{i} - x_{il} } \right\| \), xil is the l-th neighbor of xi. σi adaptively changes with the neighborhood distribution of data points. The manifold similarity can enlarge the distance between two points on different manifolds and reduce the distance between two points on the same manifold.

4 K-AP Clustering Based on Manifold Similarity Measure

We use the manifold similarity measure to improve the K-AP clustering algorithm, and proposes a MKAP algorithm. This algorithm constructs the similarity matrix with the manifold similarity measure. Then it iteratively optimizes the clustering objective function by passing messages. The detail steps of MKAP algorithm are given below.

Algorithm 1.

K-AP clustering algorithm based on manifold similarity measure

Input: data set \( X = \, \left\{ {x_{1} ,x_{2} ,\,\, \ldots ,x_{n} } \right\} \), cluster number k.

Output: k final clusters.

Step 1. Calculate the manifold distance \( D_{i,\,j}^{\rho } \) between each data pair (xi, xj) according to Eq. (5).

Step 2. Use the manifold distance \( D_{i,\,j}^{\rho } \) to calculate the similarity s(i, j) between pairwise points \( \left( {x_{i} ,x_{j} } \right) \) by Eq. (6), and construct the similarity matrix S.

Step 3. Initialize the ‘availability’ \( a\left( {i,j} \right)\,\, = \,\,0 \), and the ‘confidence’ \( \eta^{out} (i) = min(S) \).

Step 4. Iteratively update the ‘responsibility’, ‘availability’ and ‘confidence’ according to the following equations:

  1. (1)

    Update the ‘responsibility’, \( \forall i,j \):

    $$ r(i,j) = s(i,j) - \hbox{max}\left\{ {\eta^{out} (i) + a(i,i),\mathop {\hbox{max}}\limits_{{j^{\prime}:j^{\prime} \notin \{ i,\,j\} }} \left\{ {s(i,j^{\prime}) + a(i,j^{\prime})} \right\}} \right\} $$
    (7)
    $$ r(i,i) = \eta^{out} (i) - \mathop {\hbox{max} }\limits_{{j^{\prime}:j^{\prime} \ne s}} \left\{ {s(i,j^{\prime}) + a(i,j^{\prime})} \right\} $$
    (8)
  2. (2)

    Update the ‘availability’, \( \forall i,j \):

    $$ a(i,j) = \hbox{min}\left\{ {0,r(j,j)\;{ + }\;\sum\limits_{{v^{\prime}:v^{\prime} \notin \{ i,\,j\} }} {\hbox{max} \left\{ {0,r(i^{\prime},j)} \right\}} } \right\} $$
    (9)
    $$ a(j,j) = \sum\limits_{{v^{\prime}:v^{\prime} \ne j}} {\hbox{max} \left\{ {0,r(i^{\prime},j)} \right\}} $$
    (10)
  3. (3)

    Update the ‘confidence’, \( \forall i \):

    $$ \eta^{in} (i) = a(i,i) - \mathop {\hbox{max} }\limits_{{j^{\prime}:j^{\prime} \ne s}} \left\{ {s(i,j^{\prime}) + a(i,j^{\prime})} \right\} $$
    (11)
    $$ \eta^{out} (i) = - f^{k} \left( {\left\{ {\eta^{in} (j),j \ne i} \right\}} \right) $$
    (12)

where fk(•) means the k-th largest value in \( \eta^{in} (j) \), \( i,j = \,\,1, \, 2,\,\, \ldots ,N \).

Step 5. According to Eq. (13) to determine the best cluster center for data points, until the algorithm converges.

$$ c_{i} = \mathop {\arg \hbox{max} }\limits_{j} \left\{ {a(i,j) + r(i,j)} \right\} $$
(13)

Similar to K-AP algorithm, the time complexity of MKAP algorithm is also O(N2). As MKAP algorithm uses the manifold similarity measure to construct the similarity matrix, it can well describe the manifold relationship between data points.

5 Experimental Analysis

5.1 Clustering on Synthetic Datasets

In the experiments, the clustering performances of AP algorithm, K-AP algorithm and MKAP algorithm are compared on three challenging synthetic manifold datasets: ‘two circles’, ‘two moons’ and ‘two spirals’. These datasets are illustrated in Fig. 2.

Fig. 2.
figure 2

Original synthetic datasets

In the experiments, the preference parameter p of AP algorithm is the median of affinity matrix, the maximum iteration maxits = 1000, the convergence coefficient of iteration convits = 100. The density factor of MKAP algorithm is set as ρ = 2. The clustering results of AP algorithm, K-AP algorithm and MKAP algorithm on these three synthetic data sets are presented in Fig. 3.

Fig. 3.
figure 3

Clustering results of different algorithms on synthetic datasets

Form Fig. 3, we can see that AP algorithm tends to generate many small clusters. It is not easy to control the cluster number of clustering results for AP algorithm. AP algorithm is easy to fall into the local optimum. In K-AP algorithm, the cluster number K is one of the clustering constraints, so the final cluster number of K-AP algorithm on each dataset is right. But similar to AP algorithm, K-AP algorithm measures the similarity between points based on Euclidean distance and it cannot recognize complex manifold structure of the dataset. In contrast, the performance of the proposed MKAP algorithm is much better. With the help of manifold similarity measurement, MKAP algorithm is suitable for the clustering problem on manifold datasets. For MKAP algorithm, the data points on the same manifold have high similarity and the data points on different manifolds are dissimilar with each other.

5.2 Clustering on Real World Datasets

To further test the effectiveness of the proposed MKAP algorithm, we compare MKAP algorithm with other popular clustering algorithms on several benchmarking real world datasets [14]. The information of these datasets are shown in Table 1.

Table 1. Information of real world datasets

In the experiments, adjusted rand index (ARI) is used to evaluate the clustering performance [15]. ARI is based on the relationship of pairwise data points. The calculation equation of ARI is:

$$ {\text{ARI}} = \frac{2(a * d - b * c)}{(a + b) * (b + d) + (a + c) * (c + d)} $$
(14)

where a, b, c, d are the number of different kind of data pairs.\( {\text{ARI}} \in [0,1] \), the higher the value of ARI, the better the clustering quality.

The clustering performance of the proposed MKAP algorithm is compared with AP algorithm, K-AP algorithm and F-AP algorithm [16]. All the experiments are conducted on the computer with 3.20 GHz AMD Ryzen 5 1600 six-core processor, 8 GB RAM. The programming environment is MATLAB 2015b. The clustering results of different algorithms are given in Table 2.

Table 2. Clustering results of different algorithms on real world datasets

According to Table 2, the running speed of F-AP algorithm is much faster than other algorithms. Because F-AP computes upper and lower estimates to limit the messages to be updated in each iteration, and it dynamically detects converged messages to efficiently skip unneeded updates. But it is not easy for AP algorithm and F-AP algorithm to control the final cluster number. Their clustering performance are not very well on some datasets. Both K-AP algorithm and MKAP algorithm can make good use of prior knowledge, and divide dataset into a given number of clusters. However, K-AP constructs the similarity matrix based on the Euclidean distance between data points. Euclidean distance is not proper to describe the complex data structure of many real world datasets. So the ARI indexes of K-AP algorithm are not as good as the proposed MKAP algorithm on most datasets. MKAP utilizes the manifold similarity measure to do clustering and can produce better clustering results.

6 Conclusions

In this paper, we propose a K-AP clustering algorithm based on manifold similarity measure (MKAP). K-AP algorithm cannot work well on manifold data and it is easy to fall into local optimum. To improve the clustering performance K-AP algorithm, we design a manifold similarity measurement. The manifold similarity measure can correctly describe the complex relationships between data points and reveal the internal structure of the dataset. With the manifold similarity measure, MKAP algorithm is able to maintain the global and local consistency of clustering when assigning data points into multiple groups. In the experiments, the proposed MKAP algorithm is compared with other popular Affinity propagation clustering algorithms on both synthetic and real world datasets. The experimental results demonstrate the effectiveness of MKAP algorithm. Next we consider to improve the clustering efficiency of MKAP algorithm and apply it to some practical problems, such as character recognition, image segmentation and speech separation etc.