Abstract
In many cases of high dimensional data analysis, data points may lie on manifolds of very complex shapes/geometries. Thus, the usual Euclidean distance may lead to suboptimal results when utilized in clustering or visualization operations. In this work, we introduce a new distance definition in multi-dimensional spaces that preserves the topology of the data point manifold. The parameters of the proposed distance are discussed and their physical meaning is explored through 2 and 3-dimensional synthetic datasets. A robust method for the parameterization of the algorithm is suggested. Finally, a modification of the well-known k-means clustering algorithm is introduced, to exploit the benefits of the proposed distance metric for data clustering. Comparative results including other established clustering algorithms are presented in terms of cluster purity and V-measure, for a number of well-known datasets.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Clustering high-dimensional data is an area that has attracted considerable research interest over the past two decades [1, 3, 6]. The existence of irrelevant features and correlations between subsets of features, which are commonly encountered in such datasets, renders the task of identifying clusters much harder as distances between observations become less informative about the cluster structure. Dimensionality reduction and Feature Embedding is widely used to improve clustering performance and to enable the visualization of the resulting cluster structure in such data. Although well-established methods like Principal Component Analysis (PCA) and metric Multi-Dimensional Scaling (MDS) [2] have been successfully applied on a plethora of high-dimensional applications, there is no guarantee that the cluster structure in the high-dimensional space will be preserved in the low-dimensional subspace since in many cases clusters could be defined by highly non-linear structure. For this purpose nonlinear dimensionality reduction techniques have been explicitly designed to identify a lower dimensional manifold along which the data lie, and are therefore appropriate to distinguish nonlinearly separable clusters.
Kernel-based clustering is amongst the most popular methods for nonlinear clustering, based on the projection of the input data points into a high dimensional kernel space in order to make nonlinear clusters linearly separable [5]. In particular, kernel k-means combines the k-means method with the kernel trick in an attempt to deal with nonlinearly separable data, however specifying a suitable kernel function and appropriate parameters most of the times is a hard task. Another widely used manifold learning method is isometric mapping (ISOMAP). Instead of using the Euclidean distance, Isomap is based on approximating geodesic distance along the manifold [7]. However, isomap operates on neighboring data-points defined by Euclidean threshold distance, which accelerates the algorithm but presents problems in case of outlier points. In [8] the authors applied the k-means clustering algorithm after Isomap and proposed a modified definition of the geodesic distances but concluded that even their modified method was unsatisfactory in real-data cases where the data is noisy or the clusters are highly nonlinear.
In this work we propose a new topology preserving distance that follows the geodetic of the underlying manifold. Instead of imposing a threshold on distance between points, we construct a graph with all available points and impose a penalty function that penalizes distant points. The definition of distant points uses a characteristic distance parameter whose value is automatically estimated from the available dataset. Furthermore, we propose a modification of the k-means algorithm incorporating benefits of a newly introduced distance metric that preserves the topology of the data point manifold. A critical advantage of the proposed approach is the feasible robust parameterization of the algorithm. Extensive experiments on both simulated and real data sets employing the Purity and V-measure metrics for comparison, as described in [4] provide further evidence on the wide applicability of the proposed method.
2 Methodology
2.1 The Proposed Topology-Preserving Distance Metric
Let \( {\mathbf{P}} \) be a data matrix of dimensions N × K, each row of which is a feature vector (equivalently a data point) \( {\mathbf{p}} = \left( {x_{1} ,x_{2} , \ldots ,x_{K} } \right) \) of dimensionality K. Any given set of such points may be arranged on an unknown manifold in the \( \Re^{K} \) space. Thus, the Euclidean distance metric between any two points may not represent their actual distance.
Let us define an auxiliary distance metric between any pair of data points as:
where \( d_{ij} = \left\| {{\mathbf{p}}_{{\mathbf{i}}} - {\mathbf{p}}_{{\mathbf{j}}} } \right\| \) is the Euclidean distance between the two points, λ is a sufficiently big value and d0 is a characteristic distance, whose value is estimated for the current dataset, as it will be described later. Let us stretch that Dij is not the distance metric proposed in this work, rather an auxiliary definition.
For a given set of data points \( P = \left\{ {{\mathbf{p}}_{{\mathbf{i}}} } \right\},i = 1,2, \ldots ,N \) in \( \Re^{K} \), a fully connected graph \( G = \left( {P,E} \right) \) is defined with vertices \( P \) as the set of all data points and edges E as the set of all possible connections between vertices, \( E = \left\{ {\left( {i,j} \right)} \right\},i,j = 1,2, \ldots ,N \). Thus, each point is connected to all other points in the data set. The cost of the connection (edge) between any pair of points \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \) is set equal to their auxiliary distance Dij, as defined in Eq. (1). The proposed topology preserving distance between \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \) is defined as the cost of the minimum-cost path \( \pi_{\text{ij}} \) according to the well-known Dijktra’s algorithm, between the two points.
Since any generated path \( \pi_{\text{i,j}} \) between \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \) consists of an ordered series of data points with indices \( \left( {i_{1} ,i_{2} , \ldots ,i_{M} } \right) \), where \( i_{1} = i,\,i_{M} = j \) the proposed topology preserving distance between \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \) is calculated as
The parameter d0 is a characteristic length in \( \Re^{K} \) that defines the scale of local linearity in a given set of data points. It is self-evident that any two points with Euclidean distance less than or equal to d0 will be connected without any intermediate points. On the other hand, for any two points with Euclidean distance greater than d0, the proposed algorithm will generate a connecting path with intermediate points, if sufficient pairs of these points have Euclidean distance not greater than d0.
Figure 1 shows the paths generated by the Dijkstra’s algorithm using the auxiliary distance Dij, in the case of data points generated using the Swiss roll data set [11], for λ = 108, d0 = 2, using one randomly selected point i0 as the source for Dijkstra’s algorithm (the point from which all other distances are calculated). The dataset is constructed to contain 2 classes (Nc = 2), 400 points each, denoted by different color. The points lie on a manifold that is defined by a parametric equation. The paths \( \pi_{{{\text{i}}_{ 0} , {\text{j}}}} \) are also plotted as blue lines for all points j = 1,2,…, 800, j ≠ i0. As it can be observed, the selected value of d0 generates shortest paths that lie on the manifold, rather than crossing the gap as it would be dictated by the Euclidean distance. The use of this propose distance metric in any clustering or classification process on this dataset is expected to significantly increase the achieved accuracy.
2.2 Determining the Value of d0 Parameter
The parameter d0 is very important for the efficient operation of the algorithm. A low value of d0 would cause \( D_{ij} = \lambda d_{ij} \) for almost all pairs of points \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \). Consequently, the Dijkstra’s algorithm would return a single edge from \( {\mathbf{p}}_{{\mathbf{i}}} \) to \( {\mathbf{p}}_{{\mathbf{j}}} \), rather than generating a path with intermediate points. Equation (1) would be simplified to \( A_{ij} = D_{ij} \), thus the proposed metric would be equivalent to a Euclidean one. On the other hand, a high value for d0 would cause \( D_{ij} = d_{ij} \) for almost all pairs of points \( {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} \), also resulting in a single-edge path, just as described previously.
Let i0 be the index of a randomly selected point. The proposed algorithm is executed N -1 times connecting i0 with all the rest N−1 points \( {\mathbf{p}}_{{\mathbf{j}}} \), calculating distance \( A_{{i_{0} j}} \) and generating the corresponding paths \( \pi_{{i_{0} j}} \). Let us denote the sequence of points that constitute the path from i0 to j as \( \left( {i_{1} ,i_{2} , \ldots ,i_{M} } \right) \) with \( i_{0} = i_{1} ,j = i_{M} \) and the series of Euclidean distances \( \left\{ {d_{{i_{m} ,i_{m + 1} }} } \right\},m = 1,2, \ldots ,M \). To simplify notation, let us use \( d_{{i_{m} ,i_{m + 1} }} = d_{m,m + 1} \). Let \( d_{{i_{0} j}}^{\hbox{max} } = \hbox{max} \left\{ {d_{m,m + 1} } \right\},m = 1,2, \ldots ,M \) be the maximum Euclidean length of the steps in the path from i0 to j. Then the average \( d_{{i_{0} j}}^{\hbox{max} } \) can be calculated for the selected point i0 over all other points j in the data set:
By its definition, \( \left\langle {d_{i} } \right\rangle \) is calculated for a selected point i0 and it is a function of d0. It is easily proven that when d0 has very low values, below the minimum Euclidean distance dmin in the dataset, \( \left\langle {d_{i} } \right\rangle \) is equal to the mean Euclidean distance between i0 and all data points. In the case of data points being equally distributed (e.g. on a regular grid), the quantity \( \left\langle {d_{i} } \right\rangle \) is expected to be monotonically increasing when d0 in [dmin, dmax]. When d0 approaches values equal to dmax, the \( \left\langle {d_{i} } \right\rangle \) becomes equal to the mean Euclidean distance between i0 and all data points and remains constant for larger values of d0. In the case however of anisotropic data point distribution, \( \left\langle {d_{i} } \right\rangle \) drops sharply when d0 has an appropriate intermediate value, since the proposed algorithm generates connecting paths between data points that consist of steps with smaller distances. Finally, when d0 approaches values equal to the maximum Euclidean distance in the dataset dmax, the \( \left\langle {d_{i} } \right\rangle \) becomes equal to the mean Euclidean distance between i0 and all data points. Thus, for any point in the dataset i, the quantity is calculated for different values of \( d_{0} \) and the value of \( d_{0} = d_{\hbox{min} }^{i} \) that produces the minimal value is determined
In the special case of data points lying on a manifold with shape of large scale concavities, then \( d_{\hbox{min} }^{i} \) has a value that indicates the characteristic length of the concavities. Thus, setting d0 to a value less than \( d_{\hbox{min} }^{i} \) will cause the proposed algorithm to produce connecting paths between data points that do not cross the concavities, but lie on the manifold, thus behaving like geodetic curves. In order to obtain a good estimation of \( d_{\hbox{min} }^{i} \), the calculation is repeated for many randomly selected points in the dataset and the average \( \bar{d}_{\hbox{min} }^{i} \) is obtained.
Figure 2 shows the \( \left\langle {d_{i} } \right\rangle \) calculated for 50 points randomly distributed in a two-dimensional dataset for d0 in [dmin, dmax]. Two different datasets are used to demonstrate the aforementioned behavior: first 2601 points are put on a regular grid on the plane (a) and 2601 points are randomly placed on the plane (following the flat distribution). In both cases the [0, 10] × [0, 10] part of the plane is used. The \( \left\langle {d_{i} } \right\rangle \) is plotted against the values of d0. The average \( \bar{d}_{\hbox{min} }^{i} \) (over all 50 points) is plotted as a green circle, whereas the average thick curve also overlaid. The aforementioned behavior is obvious in both data sets, whereas the consistency for all different points is evident.
The same process is applied to the Swiss Roll dataset [11] using 800 points in 2 classes. The quantity \( \left\langle {d_{i} } \right\rangle \) is plotted for d0 in [dmin, dmax] = [0.0072, 48.4663]. The average \( \bar{d}_{\hbox{min} }^{i} \) (over all 40 points) was found equal to 3.6431 (Fig. 3).
The effect of d0 is demonstrated in Fig. 4. One point is randomly selected from the dataset and the minimum cost connections with all other points are shown. The cost of the connecting path is calculated according to Eq. (3), as a function of d0. It can be observed that for very low or very high values of d0, the connecting paths cross the topological gap between points, whereas for intermediate values, the paths follow the manifold, as geodesic curves.
2.3 A k-Means Variant for the Proposed Topology-Preserving Distance Metric
In this subsection a variant of the k-means clustering algorithm is proposed, that utilizes the proposed topology preserving distance metric. The main differences from the classic k-means algorithm can be summarized as following:
The N × N distance matrix is calculated using the proposed distance metric in Eq. (2) (it requires the characteristic length d0): \( {\mathbf{A}} = \left\{ {A\left( {{\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} } \right)} \right\} = \left\{ {A_{ij} } \right\},i,j = 1,2, \ldots ,N \).
The class centers in each iteration are selected from the data points, so that they minimize the average (proposed metric) distance from the members of the specific class. The algorithm is terminated when all class centers remain unchanged in two consecutive iterations. The details of the proposed algorithm are given below.
Input: the data matrix \( P \), the number of classes \( N_{c} \), the N × N distance matrix A using the proposed metric.
Experimentation shows (see Results section) that the proposed variant of the k-means method produces consistently optimal results for values of d0 slightly smaller than the estimated \( \bar{d}_{\hbox{min} }^{i} \).
3 Results
The proposed k-means variant that uses the proposed distance metric is evaluated against the classic k-means, the kernel k-means (implemented as in [9]) and the spectral clustering (implemented in [10] and [12]) in terms of purity of clustering, as well as V-measure. The proposed method has been executed for 20 times with random initialization and the resulting average purity, as well as the standard deviation are plotted for different values of d0 in Fig. 5. The same quantities achieved by the classic k-means clustering, the kernel k-means and the spectral clustering are also shown. It can be observed that the proposed algorithm clearly outperforms the classic and the kernel k-means. The behavior of the proposed algorithm with respect to parameter d0 is consistent with the estimated value of \( \bar{d}_{\hbox{min} }^{i} \): for values of less than the achieved clustering is consistently high. For values of \( d_{0} > \bar{d}_{\hbox{min} }^{i} \), the proposed algorithm behaves very similarly to the classic k-means. This is expected, since as described above the proposed distance definition becomes similar to the Euclidean one.
Figure 6 shows the same results for the COIL dataset. The determination of \( \bar{d}_{\hbox{min} }^{i} \) as shown in Fig. 6(a) is unambiguous. The behavior of the proposed k-means variant with respect to d0, is also very consistent, with best performance occurring at values of d0 slightly smaller than the estimated \( \bar{d}_{\hbox{min} }^{i} \). The proposed k-means variant with the suggested distance metric outperforms the other methods in comparison.
Table 1 shows the clustering purity and V-measure achieved by the proposed method, k-means, kernel k-means and spectral clustering. The values for the proposed method were calculated by using the corresponding value for d0 slightly less than the estimated \( \bar{d}_{\hbox{min} }^{i} \). The standard Matlab implementation was used for the k-means method. Kernel k-means was used as provided in [9]. Spectral clustering was used as provided in [10] and/or in [12] that implements the algorithm described in [13].
4 Conclusions
A new distance metric for high-dimensional data has been presented that preserves the topology of the underlying manifold. A variant of the k-means clustering algorithm has been suggested that utilizes this metric. The value of the main parameter of the proposed distance metric can be obtained with a standard and efficient process. The performance of the proposed method has been analyzed theoretically and validated experimentally, on a number of benchmark datasets. Comparative results with well-established clustering algorithms show that the proposed method is a competent alternative with consistent behavior that systematically performs equally well or better than the other techniques under comparison. Future work includes the algorithmic fine tuning of the proposed k-means variant and the extension of the application of the distance metric to visualization and dimensionality reduction techniques. Comparative results will also be expanded to include more optimized implementations of other state of the art methods.
References
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Disc. 11(1), 5–33 (2005)
Cox, M.A., Cox, T.F.: Multidimensional scaling. In: Chen, C.-H., Härdle, W., Unwin, A. (eds.) Handbook of Data Visualization. Springer Handbooks Comp.Statistics, pp. 315–347. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-33037-0_14
Kriegel, H.-P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1–58 (2009)
Pavlidis, N.G., Hofmeyr, D.P., Tasoulis, S.K.: Minimum density hyperplanes. J. Mach. Learn. Res. 17(156), 1–33 (2016)
Schoelkopf, B., Smola, A., Mueller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5), 1299–1319 (1998)
Tasoulis, S.K., Tasoulis, D.K., Plagianakos, V.P.: Enhancing principal direction divisive clustering. Pattern Recogn. 43(10), 3391–3411 (2010)
Tenenbaum, J.B., Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Sci. 290(5500), 2319–2323 (2000)
Yu, H., Zhang, X., Yang, Y., Zhao, X., Cai, L.: An extended ISOMAP by enhancing similarity for clustering. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS (LNAI), vol. 7345, pp. 808–815. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31087-4_81
Gonen, M., Margolin, A.A.: Localized data fusion for kernel k-means clustering with application to cancer biology. In: Advances in Neural Information Processing Systems 27 (NIPS 2014), Montrιal, Quιbec, Canada (2014)
Mathworks. https://www.mathworks.com/matlabcentral/fileexchange/46733-spectral-clustering. Accessed 2 Mar 2019
http://people.cs.uchicago.edu/~dinoj/manifold/swissroll.html
https://www.mathworks.com/matlabcentral/fileexchange/34412-fast-and-efficient-spectral-clustering
von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Acknowledgement
The author wishes to acknowledge the partial support by the Interdepartmental Postgraduate Program “Computer Science and Computational Biomedicine” of the School of Science of the University of Thessaly.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 IFIP International Federation for Information Processing
About this paper
Cite this paper
Delibasis, K.K. (2019). A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering. In: MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2019. IFIP Advances in Information and Communication Technology, vol 559. Springer, Cham. https://doi.org/10.1007/978-3-030-19823-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-19823-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19822-0
Online ISBN: 978-3-030-19823-7
eBook Packages: Computer ScienceComputer Science (R0)