Keywords

1 Introduction

Unsupervised data analysis using clustering algorithms provides a useful tool. The aim of clustering analysis is to discover the hidden data structure of a dataset according to a certain similarity criterion such that all the data points are assigned into a number of distinctive clusters where points in the same cluster are similar to each other, while points from different clusters are dissimilar [1]. Clustering has been applied in a variety of scientific fields such as web search, social network analysis, image retrieval, medical imaging, gene expression analysis, recommendation systems and market analysis and so on.

Kernel clustering method can handle data sets that are not linearly separable in input space [2], thus, usually perform better than the Euclidean distance based clustering algorithms [3]. Due to simplicity and efficiency, kernel k-means has become a hot research topic. The kernel function is used to map the input data into a high-dimensional feature space, which makes clusters that are not linearly separable in input space become separable. A single kernel is sometimes insufficient to represent the data. Recently, multiple kernel clustering has gained increasing attention in machine learning. Huang et al. propose a multiple kernel fuzzy c-means [4]. By incorporating multiple kernels and automatically adjusting the kernel weights, ineffective kernels and irrelevant features are not crucial for kernel clustering. Zhou et al. use the maximum entropy method to regularize the kernel weights and decide the important kernels [5]. Gao applies multiple kernel fuzzy c-means to optimize clustering and presented mono-nuclear kernel function which is a set of Gaussian kernel function combination assigned different weights resolution [6]. Lu et al. applies multiple kernel k-means clustering algorithm into SAR image change detection [7]. They fuse various features through a weighted summation kernel by automatically and optimally computing the kernel weights, which leads to computational burden. Zhang et al. propose a locally multiple kernel clustering which assigns to each cluster a weight vector for feature selection and combines it with a Gaussian kernel to form a unique kernel for the corresponding cluster [8]. They search the scale parameter of Gaussian kernel by running their clustering algorithm repeatedly for a number of values of the parameter and selecting the best one. Tzortzis et al. overcome the kernel selection problem of maximum margin clustering by employing multiple kernel learning to jointly learn the kernel and a partitioning of the instances [9]. Yu et al. propose an optimized kernel k-means clustering which optimizes the cluster membership and kernel coefficients based on the same Rayleigh quotient objective [10]. Lu et al. improve kernel evaluation measure based on centered kernel alignment and their algorithm needs to be given the initial kernel fusion coefficients [11]. Although the above methods extend from different clustering algorithms, they all employ the alternating optimization technique to solve their extended problems. Specifically, cluster labels and kernel combination coefficients are alternatively optimized until convergence.

Our algorithm is proposed from perspective of similarity measure by calculating a local scale parameter for each data point, which can reflect local distribution of datasets. In addition, another parameter named density factor is introduced in Gaussian kernel function which can describe global structure of data set and avoid kernel k-means running into local optimum. Based on improved similarity measure, our algorithm has several advantages. First, as a kernel method, it has unusual ability in dealing with datasets with multiple scales. Second, it fuses automatically and optimally local and global structures of datasets. Furthermore, our algorithm does not need a good deal of iterations and calculate kernel weights until convergence.

The remainder of this paper is organized as follows: in Sect. 2 we introduce the related works. In Sect. 3 we give a detailed description of our algorithm. Section 4 presents the experimental results and evaluation of our algorithm. Finally, we conclude the paper in Sect. 5.

2 Related Work

2.1 Kernel K-Means

Girolami first proposed the kernel k-means clustering method. It first maps the data points from the input space to higher dimensional feature space through a nonlinear transformation \( \phi ( \cdot ) \) and then minimizes the clustering error in that feature space [12].

Let \( {\text{D}} = \{ x_{1} ,x_{2} , \ldots ,x_{n} \} \) be the data set of size n, k be the number of clusters required. The final partition of the entire data set is \( \Pi _{D} = \{ C_{1} ,C_{2} , \ldots ,C_{k} \} \). The objective function is to minimize the criterion function:

$$ {\text{J}} = \sum\nolimits_{j = 1}^{k} {\sum\nolimits_{{x_{i} \in C_{j} }} {\parallel \phi \left( {x_{i} } \right) - m_{j} \parallel^{2} } } $$
(1)

Where \( m_{j} \) is the mean of cluster \( C_{j} \). That is

$$ m_{j} = \sum\nolimits_{{x_{i} \in C_{j} }} {\frac{{\phi \left( {x_{i} } \right)}}{{|C_{j} |}}} $$
(2)

in the induced space.

$$ \begin{array}{*{20}c} {\left\| {\phi \left( {x_{i} } \right) - m_{j} \parallel^{2} = \parallel \phi \left( {x_{i} } \right) - \sum\nolimits_{{x_{i} \in C_{j} }} {\frac{{\phi \left( {x_{i} } \right)}}{{|C_{j} |}}} } \right\|^{2} } \\ { = \phi \left( {x_{i} } \right) \cdot \phi \left( {x_{i} } \right) + \frac{2}{{C_{j} }}\sum\nolimits_{{x_{i} \in C_{j} }} {\phi \left( {x_{l} } \right) \cdot \phi \left( {x_{i} } \right) + \frac{1}{{\left| {C_{j} } \right|^{2} }}\sum\nolimits_{{x_{l} \in C_{j} }} {\sum\nolimits_{{x_{s} \in C_{j} }} {\phi \left( {x_{l} } \right) \cdot \phi \left( {x_{s} } \right)} } } } \\ { = \kappa \left( {x_{i} ,x_{i} } \right) + \frac{2}{{C_{j} }}\sum\nolimits_{{x_{i} \in C_{j} }} {\kappa \left( {x_{i} ,x_{l} } \right) + \frac{1}{{|C_{j} |^{2} }}\mathop \sum \limits_{{x_{l} \in C_{j} }} \mathop \sum \limits_{{x_{s} \in C_{j} }} \kappa \left( {x_{l} ,x_{s} } \right)} } \\ \end{array} $$
(3)

Further, \( \parallel \phi \left( {x_{i} } \right) - m_{j} \parallel^{2} \) can be calculated without knowing the transformation \( \phi ( \cdot ) \) explicitly as formula (3).

Thus, only inner products are used in the computation of the Euclidean distance between a point and a centroid. If given a kernel matrix \( \kappa \), where \( \kappa_{ij} = \phi \left( {x_{i} } \right) \cdot \phi \left( {x_{j} } \right) \), A kernel function is commonly used to map the original points to inner products. Given a data set, the kernel k-means clustering has the following steps:

2.2 Multiple Kernel k-means

Weighted summation kernel is a common tool for multiple kernel learning. Huang et al. propose multiple kernel k-means algorithm by incorporating weighted summation kernel into the kernel k-means, which results in the multiple kernel k-means algorithm [4]. The MKKM algorithm is solved by updating iteratively the kernel weights. Its objective function is to minimize

$$ J_{M} = \sum\nolimits_{j = 1}^{k} {\sum\nolimits_{{x_{i} \in C_{j} }} {\sum\nolimits_{m}^{M} {w_{k}^{2} \parallel \phi_{k} \left( {x_{i} } \right) - v_{c} \parallel^{2} } } } $$
(4)
$$ w_{m} = \frac{{\frac{1}{{\beta_{m} }}}}{{(\frac{1}{{\beta_{1} }} + \frac{1}{{\beta_{2} }} + \ldots + \frac{1}{{\beta_{M} }})}},\beta_{m} = \sum\nolimits_{j = 1}^{k} {\sum\nolimits_{{x_{i} \in C_{j} }} {\parallel \phi \left( {x_{i} } \right) - m_{j} \parallel^{2} } } $$

Where \( \{ \phi_{k} \}_{m = 1}^{M} \) are the mapping functions corresponding to multiple kernel functions. \( w_{m} (m = 1,2,..,M) \) are kernel weights.

3 Locally Multiple Kernel k-means

3.1 Similarity Measure

Selecting a suitable method of similarity measure in cluster analysis is crucial, and it is used as the basis for division [13]. To handle the dataset with multiple scales, we calculate a local scaling parameter \( \sigma_{i} \) for each data point \( s_{i} \). The selection of the local scale \( \sigma_{i} \) can be done by studying the local statistics of the neighborhood of point \( s_{i} \). \( s_{K} \) is the K’th neighbor of point \( s_{i} \).

$$ \sigma_{i} = d(s_{i} ,s_{K} ) $$

According to the conception of clustering hypothesis, the data point of intra-class should locate in high-density region, and the data point of inter-class should be separated by low-density region [14]. In order to better describe global structure of data set and avoid kernel k-means running into local optimum, density factor ρ is introduced to discover clusters of arbitrary shape. Combined ρ with formula (6), we propose a new similarity measure as follows:

$$ S_{ij} = { \exp }(\frac{{ - d^{2} (s_{i} ,s_{j} )}}{{\sigma_{i} \sigma_{j} \rho_{ij} }}) $$
(5)

Density factor is obtained by a simple and powerful way. First, find k neighbor points for each point by k nearest neighbor algorithm and then use k nearest neighbor graph to depict the local neighborhood relation between data points. The neighborhood of a point p is denoted by \( N(p) \). For a sample point q, if \( q \in N(p) \), we think q is directly density-reachable from point p. Given a sample set \( {\text{D}} = \{ p_{1} ,p_{2} , \ldots ,p_{n} \} \), supposed that \( p_{i} \) is directly density-reachable from point \( p_{i + 1} \), \( p_{1} \) is density-reachable from \( p_{n} \). If there is a point o such that both, p and q are density-reachable from o, we consider that the point p is density-connected to a point q. Finally, according to all directly density-reachable data points, find density-connected sets by union-find set method which is a very sophisticated and practical data structures and mainly used for processing the merger of the problem of some disjoint sets [15]. Let \( \rho_{ij} \) denote density factor between the point \( s_{i} \) and \( s_{j} \), as follows:

$$ \rho_{ij} = \left\{ {\begin{array}{*{20}l} {1,\,if\,s_{i} ,s_{j} \;are\;in\;the\;same\;density - connected \, set} \hfill \\ {0,\,otherwise} \hfill \\ \end{array} } \right. $$
(6)

3.2 Algorithm

From the perspective of similarity measure, we propose a novel locally multiple kernel k-means algorithm (LMKKM). Its basic idea is: firstly, calculate the local scale parameter σ and density factor ρ; subsequently, construct kernel matrix based on our proposed similarity measure; finally, according to the kernel matrix, cluster dataset by kernel k-means. The detail steps of our algorithm are as follows.

Suppose n is the total number of points on the data set. Our local multiple kernel k-means contains three main components: calculating parameters, constructing similarity matrix, and clustering. In the phase of calculating parameters, the complexity of k nearest neighbor algorithm and union-find set method both are \( O(n) \). The complexity in calculating the similarity matrix is \( O(n^{2} ) \). At the last clustering phase, the complexity of k-means is \( O(n) \). Our algorithm does not increase the complexity of kernel k-means, but it improves performance.

4 Experiments

4.1 Artificial Data Clustering

In order to verify the effectiveness of our improved algorithm,we choose three artificial data sets, “smile face”, “four lines” and “blobs and circle” to perform an experiment and compare with kernel k-means (KKM) algorithm.

Figure 1 shows KKM algorithm’s clustering results on artificial data sets. The scale parameter of Gaussian kernel function is set to be 1 experientially. KKM algorithm measures the similarity between points based on Euclidean distance which can not reflect the intrinsic structure of dataset. Thus, KKM can only gather the similar points in local region into a cluster, but does not satisfy the global coherence hypothesis of the clustering or recognize complex manifold structure of dataset.

Fig. 1.
figure 1

KKM algorithm’s clustering results on artificial data sets

Figure 2 shows LMKKM algorithm’s clustering results on artificial datasets. It calculates the kernel matrix by formula (6). After the similarity measure involves density factor, it meets both the local coherence hypothesis and global coherence hypothesis of the clustering. Through the approach, the intra-class data points become more compact and the inner-class data points are more discrete.

Fig. 2.
figure 2

LMKKM algorithm’s clustering results on artificial datasets

4.2 Clustering Results

In this subsection, our method is compared with three baseline methods including kernel k-means (KKM), Self-Tuning Spectral Clustering (SSC) [16], locally adaptive multiple kernel clustering (LAMKC) [8]. SSC is a locally adaptive spectral clustering algorithm. LAMKC is a newly proposed multiple kernel clustering algorithm extending form kernel k-means. We carry out experiments on seven UCI datasets. These datasets are often used to test performance of machine learning algorithm. The characteristics of these data sets are shown in Table 1.

Table 1. Data characteristics of real data sets

We use the accuracy (ACC) to evaluate the clustering performance. Considering the random initialization of clustering centers of kernel k-means, clustering results will be fluctuated. Therefore the clustering experiments are repeated 20 times. Results in first row are the means of the 20 trials, and results in second row (in parentheses) are corresponding standard deviations. The neighborhood size of K is set to 7. For all experiments, the cluster number is set to be the cluster number of each dataset. For LAMKC, the stop condition of the gradient descent method is set to be 0.0001 [8]. In order to compare experimental results fairly, we do not use the Kaufman Approach to select a set of initial centroids in LAMKC. All experiments are conducted on Intel Pentium G2030 CPU with 3.00 GHz processor and 4 G RAM running 64bit-Win7. Clustering results are shown in Table 2.

Table 2. Clustering results on real-world data sets

The results on the seven UCI datasets are shown in Table 2. We use the boldface to mark the best result for each dataset. For the clustering performance measured by ACC, the experimental results are encouraging and our algorithm obtained five best results for the seven datasets. Comparable to the kernel k-means and self-tuning spectral clustering, performance of our algorithm is significantly better than that of them on seven datasets. For datasets WDBC and Dermatology, our algorithm is roughly comparable to LAMKC. In most cases, our algorithm can capture structures of datasets and calculate appropriate parameters adaptively while LAMKC searched the parameter of Gaussian kernel in a range. These indicate that our improved similarity measure has the capability to capture local and global structures of datasets with complexity so as that our algorithm can complete the tasks of clustering efficiently.

5 Conclusions

Conventional multiple kernel clustering algorithms aim to construct a global combination of multiple kernels in input space and have to kernel combination coefficients iteratively. In this paper, we proposed a local multiple kernel clustering method based on similarity measure. Our method is dedicated to the datasets with varying local distributions. Instead of using a uniform combination of multiple kernels over the whole input space, our method associates to each data point a localized kernel and combined with density factor simultaneously. Taking local and global structures into consideration, our similarity measure can depict distributed situation of dataset. Results of clustering experiments on artificial datasets and UCI datasets demonstrate that our locally multiple kernel clustering method can deal with datasets with multiple scales and not fall into local optimal.

There are three points remaining for further research. First, the time complexity of our algorithm is the same as kernel k-means’s, so it will spend much time when processing big data set. Further study is necessary on how to reduce the time complexity of the algorithm and improve the efficiency of clustering. Second, kernel k-means is sensitive to the initial cluster centers. We can improve kernel k-means from this perspective. Third, following the idea of this paper, we can construct better multiple kernel k-means methods based on the other kernel evaluation measures.