Keywords

1 Introduction

Clustering high-dimensional data is an area that has attracted considerable research interest over the past two decades [1, 3, 6]. The existence of irrelevant features and correlations between subsets of features, which are commonly encountered in such datasets, renders the task of identifying clusters much harder as distances between observations become less informative about the cluster structure. Dimensionality reduction and Feature Embedding is widely used to improve clustering performance and to enable the visualization of the resulting cluster structure in such data. Although well-established methods like Principal Component Analysis (PCA) and metric Multi-Dimensional Scaling (MDS) [2] have been successfully applied on a plethora of high-dimensional applications, there is no guarantee that the cluster structure in the high-dimensional space will be preserved in the low-dimensional subspace since in many cases clusters could be defined by highly non-linear structure. For this purpose nonlinear dimensionality reduction techniques have been explicitly designed to identify a lower dimensional manifold along which the data lie, and are therefore appropriate to distinguish nonlinearly separable clusters.

Kernel-based clustering is amongst the most popular methods for nonlinear clustering, based on the projection of the input data points into a high dimensional kernel space in order to make nonlinear clusters linearly separable [5]. In particular, kernel k-means combines the k-means method with the kernel trick in an attempt to deal with nonlinearly separable data, however specifying a suitable kernel function and appropriate parameters most of the times is a hard task. Another widely used manifold learning method is isometric mapping (ISOMAP). Instead of using the Euclidean distance, Isomap is based on approximating geodesic distance along the manifold [7]. However, isomap operates on neighboring data-points defined by Euclidean threshold distance, which accelerates the algorithm but presents problems in case of outlier points. In [8] the authors applied the k-means clustering algorithm after Isomap and proposed a modified definition of the geodesic distances but concluded that even their modified method was unsatisfactory in real-data cases where the data is noisy or the clusters are highly nonlinear.

In this work we propose a new topology preserving distance that follows the geodetic of the underlying manifold. Instead of imposing a threshold on distance between points, we construct a graph with all available points and impose a penalty function that penalizes distant points. The definition of distant points uses a characteristic distance parameter whose value is automatically estimated from the available dataset. Furthermore, we propose a modification of the k-means algorithm incorporating benefits of a newly introduced distance metric that preserves the topology of the data point manifold. A critical advantage of the proposed approach is the feasible robust parameterization of the algorithm. Extensive experiments on both simulated and real data sets employing the Purity and V-measure metrics for comparison, as described in [4] provide further evidence on the wide applicability of the proposed method.

2 Methodology

2.1 The Proposed Topology-Preserving Distance Metric

Let \( {\mathbf{P}} \) be a data matrix of dimensions N × K, each row of which is a feature vector (equivalently a data point) \( {\mathbf{p}} = \left( {x_{1} ,x_{2} , \ldots ,x_{K} } \right) \) of dimensionality K. Any given set of such points may be arranged on an unknown manifold in the \( \Re^{K} \) space. Thus, the Euclidean distance metric between any two points may not represent their actual distance.

Let us define an auxiliary distance metric between any pair of data points as:

$$ D_{ij} = \left\{ {\begin{array}{*{20}c} {\lambda d_{ij} ,d_{ij} > d_{0} } \\ {d_{ij} ,d_{ij} \le d_{0} } \\ \end{array} } \right. $$
(1)

where \( d_{ij} = \left\| {{\mathbf{p}}_{{\mathbf{i}}} - {\mathbf{p}}_{{\mathbf{j}}} } \right\| \) is the Euclidean distance between the two points, λ is a sufficiently big value and d0 is a characteristic distance, whose value is estimated for the current dataset, as it will be described later. Let us stretch that Dij is not the distance metric proposed in this work, rather an auxiliary definition.

For a given set of data points \( P = \left\{ {{\mathbf{p}}_{{\mathbf{i}}} } \right\},i = 1,2, \ldots ,N \) in \( \Re^{K} \), a fully connected graph \( G = \left( {P,E} \right) \) is defined with vertices \( P \) as the set of all data points and edges E as the set of all possible connections between vertices, \( E = \left\{ {\left( {i,j} \right)} \right\},i,j = 1,2, \ldots ,N \). Thus, each point is connected to all other points in the data set. The cost of the connection (edge) between any pair of points \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \) is set equal to their auxiliary distance Dij, as defined in Eq. (1). The proposed topology preserving distance between \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \) is defined as the cost of the minimum-cost path \( \pi_{\text{ij}} \) according to the well-known Dijktra’s algorithm, between the two points.

$$ A\left( {{\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} } \right) = A_{ij} = \text{cost}\left( {\pi_{\text{ij}} } \right),i,j = 1,2, \ldots ,N $$
(2)

Since any generated path \( \pi_{\text{i,j}} \) between \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \) consists of an ordered series of data points with indices \( \left( {i_{1} ,i_{2} , \ldots ,i_{M} } \right) \), where \( i_{1} = i,\,i_{M} = j \) the proposed topology preserving distance between \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \) is calculated as

$$ A_{i,j} = {\text{cost}}(\pi_{\text{i,j}} ) = \sum\limits_{m = 1}^{M - 1} {D_{m,m + 1} ,i = i_{1} ,j = i_{M} } $$
(3)

The parameter d0 is a characteristic length in \( \Re^{K} \) that defines the scale of local linearity in a given set of data points. It is self-evident that any two points with Euclidean distance less than or equal to d0 will be connected without any intermediate points. On the other hand, for any two points with Euclidean distance greater than d0, the proposed algorithm will generate a connecting path with intermediate points, if sufficient pairs of these points have Euclidean distance not greater than d0.

Figure 1 shows the paths generated by the Dijkstra’s algorithm using the auxiliary distance Dij, in the case of data points generated using the Swiss roll data set [11], for λ = 108, d0 = 2, using one randomly selected point i0 as the source for Dijkstra’s algorithm (the point from which all other distances are calculated). The dataset is constructed to contain 2 classes (Nc = 2), 400 points each, denoted by different color. The points lie on a manifold that is defined by a parametric equation. The paths \( \pi_{{{\text{i}}_{ 0} , {\text{j}}}} \) are also plotted as blue lines for all points j = 1,2,…, 800, j ≠ i0. As it can be observed, the selected value of d0 generates shortest paths that lie on the manifold, rather than crossing the gap as it would be dictated by the Euclidean distance. The use of this propose distance metric in any clustering or classification process on this dataset is expected to significantly increase the achieved accuracy.

Fig. 1.
figure 1

The paths generated by the proposed distance definition in the case of the Swiss Roll dataset, for d0 = 2. The points lie on a manifold and consist of two classes shown in green and red color (Color figure online)

2.2 Determining the Value of d0 Parameter

The parameter d0 is very important for the efficient operation of the algorithm. A low value of d0 would cause \( D_{ij} = \lambda d_{ij} \) for almost all pairs of points \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \). Consequently, the Dijkstra’s algorithm would return a single edge from \( {\mathbf{p}}_{{\mathbf{i}}} \) to \( {\mathbf{p}}_{{\mathbf{j}}} \), rather than generating a path with intermediate points. Equation (1) would be simplified to \( A_{ij} = D_{ij} \), thus the proposed metric would be equivalent to a Euclidean one. On the other hand, a high value for d0 would cause \( D_{ij} = d_{ij} \) for almost all pairs of points \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \), also resulting in a single-edge path, just as described previously.

Let i0 be the index of a randomly selected point. The proposed algorithm is executed N -1 times connecting i0 with all the rest N−1 points \( {\mathbf{p}}_{{\mathbf{j}}} \), calculating distance \( A_{{i_{0} j}} \) and generating the corresponding paths \( \pi_{{i_{0} j}} \). Let us denote the sequence of points that constitute the path from i0 to j as \( \left( {i_{1} ,i_{2} , \ldots ,i_{M} } \right) \) with \( i_{0} = i_{1} ,j = i_{M} \) and the series of Euclidean distances \( \left\{ {d_{{i_{m} ,i_{m + 1} }} } \right\},m = 1,2, \ldots ,M \). To simplify notation, let us use \( d_{{i_{m} ,i_{m + 1} }} = d_{m,m + 1} \). Let \( d_{{i_{0} j}}^{\hbox{max} } = \hbox{max} \left\{ {d_{m,m + 1} } \right\},m = 1,2, \ldots ,M \) be the maximum Euclidean length of the steps in the path from i0 to j. Then the average \( d_{{i_{0} j}}^{\hbox{max} } \) can be calculated for the selected point i0 over all other points j in the data set:

$$ \left\langle {d_{i} } \right\rangle = \frac{1}{N - 1}\sum\limits_{\begin{subarray}{l} j = 1, \\ j \ne i \end{subarray} }^{N} {d_{i,j}^{\hbox{max} } } $$
(4)

By its definition, \( \left\langle {d_{i} } \right\rangle \) is calculated for a selected point i0 and it is a function of d0. It is easily proven that when d0 has very low values, below the minimum Euclidean distance dmin in the dataset, \( \left\langle {d_{i} } \right\rangle \) is equal to the mean Euclidean distance between i0 and all data points. In the case of data points being equally distributed (e.g. on a regular grid), the quantity \( \left\langle {d_{i} } \right\rangle \) is expected to be monotonically increasing when d0 in [dmin, dmax]. When d0 approaches values equal to dmax, the \( \left\langle {d_{i} } \right\rangle \) becomes equal to the mean Euclidean distance between i0 and all data points and remains constant for larger values of d0. In the case however of anisotropic data point distribution, \( \left\langle {d_{i} } \right\rangle \) drops sharply when d0 has an appropriate intermediate value, since the proposed algorithm generates connecting paths between data points that consist of steps with smaller distances. Finally, when d0 approaches values equal to the maximum Euclidean distance in the dataset dmax, the \( \left\langle {d_{i} } \right\rangle \) becomes equal to the mean Euclidean distance between i0 and all data points. Thus, for any point in the dataset i, the quantity is calculated for different values of \( d_{0} \) and the value of \( d_{0} = d_{\hbox{min} }^{i} \) that produces the minimal value is determined

$$ d_{\hbox{min} }^{i} = \mathop {\arg \hbox{max} }\limits_{{d_{0} }} \left\langle {d_{i} } \right\rangle ,\,d_{0} \in \left[ {d_{\hbox{min} } ,d_{\hbox{max} } } \right] $$
(5)

In the special case of data points lying on a manifold with shape of large scale concavities, then \( d_{\hbox{min} }^{i} \) has a value that indicates the characteristic length of the concavities. Thus, setting d0 to a value less than \( d_{\hbox{min} }^{i} \) will cause the proposed algorithm to produce connecting paths between data points that do not cross the concavities, but lie on the manifold, thus behaving like geodetic curves. In order to obtain a good estimation of \( d_{\hbox{min} }^{i} \), the calculation is repeated for many randomly selected points in the dataset and the average \( \bar{d}_{\hbox{min} }^{i} \) is obtained.

Figure 2 shows the \( \left\langle {d_{i} } \right\rangle \) calculated for 50 points randomly distributed in a two-dimensional dataset for d0 in [dmin, dmax]. Two different datasets are used to demonstrate the aforementioned behavior: first 2601 points are put on a regular grid on the plane (a) and 2601 points are randomly placed on the plane (following the flat distribution). In both cases the [0, 10] × [0, 10] part of the plane is used. The \( \left\langle {d_{i} } \right\rangle \) is plotted against the values of d0. The average \( \bar{d}_{\hbox{min} }^{i} \) (over all 50 points) is plotted as a green circle, whereas the average thick curve also overlaid. The aforementioned behavior is obvious in both data sets, whereas the consistency for all different points is evident.

Fig. 2.
figure 2

The \( \left\langle {d_{i} } \right\rangle \) as a function of d0, for a number of data points and the determined \( \bar{d}_{\hbox{min} }^{i} \) plotted as a green circle, for data points (a) on a 2D regular grid and (b) randomly distributed. (Color figure online)

The same process is applied to the Swiss Roll dataset [11] using 800 points in 2 classes. The quantity \( \left\langle {d_{i} } \right\rangle \) is plotted for d0 in [dmin, dmax] = [0.0072, 48.4663]. The average \( \bar{d}_{\hbox{min} }^{i} \) (over all 40 points) was found equal to 3.6431 (Fig. 3).

Fig. 3.
figure 3

The shape of \( \left\langle {d_{i} } \right\rangle \) as a function of d0, for the Swiss roll data set. The estimated \( \bar{d}_{\hbox{min} }^{i} \) is plotted as green circle. (Color figure online)

Fig. 4.
figure 4

The minimum cost paths generated by the proposed algorithm applied on the Swiss roll dataset, for different values of d0. Intermediate values of d0, as suggested by the estimation of the average \( \bar{d}_{\hbox{min} }^{i} \), produce paths that follow the underlying manifold.

The effect of d0 is demonstrated in Fig. 4. One point is randomly selected from the dataset and the minimum cost connections with all other points are shown. The cost of the connecting path is calculated according to Eq. (3), as a function of d0. It can be observed that for very low or very high values of d0, the connecting paths cross the topological gap between points, whereas for intermediate values, the paths follow the manifold, as geodesic curves.

2.3 A k-Means Variant for the Proposed Topology-Preserving Distance Metric

In this subsection a variant of the k-means clustering algorithm is proposed, that utilizes the proposed topology preserving distance metric. The main differences from the classic k-means algorithm can be summarized as following:

The N × N distance matrix is calculated using the proposed distance metric in Eq. (2) (it requires the characteristic length d0): \( {\mathbf{A}} = \left\{ {A\left( {{\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} } \right)} \right\} = \left\{ {A_{ij} } \right\},i,j = 1,2, \ldots ,N \).

The class centers in each iteration are selected from the data points, so that they minimize the average (proposed metric) distance from the members of the specific class. The algorithm is terminated when all class centers remain unchanged in two consecutive iterations. The details of the proposed algorithm are given below.

Input: the data matrix \( P \), the number of classes \( N_{c} \), the N × N distance matrix A using the proposed metric.

figure a

Experimentation shows (see Results section) that the proposed variant of the k-means method produces consistently optimal results for values of d0 slightly smaller than the estimated \( \bar{d}_{\hbox{min} }^{i} \).

3 Results

The proposed k-means variant that uses the proposed distance metric is evaluated against the classic k-means, the kernel k-means (implemented as in [9]) and the spectral clustering (implemented in [10] and [12]) in terms of purity of clustering, as well as V-measure. The proposed method has been executed for 20 times with random initialization and the resulting average purity, as well as the standard deviation are plotted for different values of d0 in Fig. 5. The same quantities achieved by the classic k-means clustering, the kernel k-means and the spectral clustering are also shown. It can be observed that the proposed algorithm clearly outperforms the classic and the kernel k-means. The behavior of the proposed algorithm with respect to parameter d0 is consistent with the estimated value of \( \bar{d}_{\hbox{min} }^{i} \): for values of less than the achieved clustering is consistently high. For values of \( d_{0} > \bar{d}_{\hbox{min} }^{i} \), the proposed algorithm behaves very similarly to the classic k-means. This is expected, since as described above the proposed distance definition becomes similar to the Euclidean one.

Fig. 5.
figure 5

The achieved purity and V-measure using the proposed method, applied to the Swiss roll dataset, for different values of d0, against the classic k-means, the kernel k-means [9] and the spectral clustering [10].

Figure 6 shows the same results for the COIL dataset. The determination of \( \bar{d}_{\hbox{min} }^{i} \) as shown in Fig. 6(a) is unambiguous. The behavior of the proposed k-means variant with respect to d0, is also very consistent, with best performance occurring at values of d0 slightly smaller than the estimated \( \bar{d}_{\hbox{min} }^{i} \). The proposed k-means variant with the suggested distance metric outperforms the other methods in comparison.

Fig. 6.
figure 6

The achieved purity and V-measure using the proposed method applied to the COIL dataset, for different values of d0, against the classic k-means, the kernel k-means [9]. The spectral clustering [9] produced worse results and it was therefore not included in the graph.

Table 1 shows the clustering purity and V-measure achieved by the proposed method, k-means, kernel k-means and spectral clustering. The values for the proposed method were calculated by using the corresponding value for d0 slightly less than the estimated \( \bar{d}_{\hbox{min} }^{i} \). The standard Matlab implementation was used for the k-means method. Kernel k-means was used as provided in [9]. Spectral clustering was used as provided in [10] and/or in [12] that implements the algorithm described in [13].

Table 1. Dataset description, with the achieved clustering purity and V-measure by the proposed method, classic and kernel k-means and spectral clustering.

4 Conclusions

A new distance metric for high-dimensional data has been presented that preserves the topology of the underlying manifold. A variant of the k-means clustering algorithm has been suggested that utilizes this metric. The value of the main parameter of the proposed distance metric can be obtained with a standard and efficient process. The performance of the proposed method has been analyzed theoretically and validated experimentally, on a number of benchmark datasets. Comparative results with well-established clustering algorithms show that the proposed method is a competent alternative with consistent behavior that systematically performs equally well or better than the other techniques under comparison. Future work includes the algorithmic fine tuning of the proposed k-means variant and the extension of the application of the distance metric to visualization and dimensionality reduction techniques. Comparative results will also be expanded to include more optimized implementations of other state of the art methods.