A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering

Delibasis, Konstantinos K.

doi:10.1007/978-3-030-19823-7_12

Konstantinos K. Delibasis¹⁹

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 559))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

1749 Accesses

Abstract

In many cases of high dimensional data analysis, data points may lie on manifolds of very complex shapes/geometries. Thus, the usual Euclidean distance may lead to suboptimal results when utilized in clustering or visualization operations. In this work, we introduce a new distance definition in multi-dimensional spaces that preserves the topology of the data point manifold. The parameters of the proposed distance are discussed and their physical meaning is explored through 2 and 3-dimensional synthetic datasets. A robust method for the parameterization of the algorithm is suggested. Finally, a modification of the well-known k-means clustering algorithm is introduced, to exploit the benefits of the proposed distance metric for data clustering. Comparative results including other established clustering algorithms are presented in terms of cluster purity and V-measure, for a number of well-known datasets.

You have full access to this open access chapter, Download conference paper PDF

Reliable Clustering Quality Estimation from Low to High Dimensional Data

An effective clustering scheme for high-dimensional data

Article 19 October 2023

Optimal Landmark Point Selection Using Clustering for Manifold Modeling and Data Classification

Article 08 January 2019

Keywords

1 Introduction

Clustering high-dimensional data is an area that has attracted considerable research interest over the past two decades [1, 3, 6]. The existence of irrelevant features and correlations between subsets of features, which are commonly encountered in such datasets, renders the task of identifying clusters much harder as distances between observations become less informative about the cluster structure. Dimensionality reduction and Feature Embedding is widely used to improve clustering performance and to enable the visualization of the resulting cluster structure in such data. Although well-established methods like Principal Component Analysis (PCA) and metric Multi-Dimensional Scaling (MDS) [2] have been successfully applied on a plethora of high-dimensional applications, there is no guarantee that the cluster structure in the high-dimensional space will be preserved in the low-dimensional subspace since in many cases clusters could be defined by highly non-linear structure. For this purpose nonlinear dimensionality reduction techniques have been explicitly designed to identify a lower dimensional manifold along which the data lie, and are therefore appropriate to distinguish nonlinearly separable clusters.

Kernel-based clustering is amongst the most popular methods for nonlinear clustering, based on the projection of the input data points into a high dimensional kernel space in order to make nonlinear clusters linearly separable [5]. In particular, kernel k-means combines the k-means method with the kernel trick in an attempt to deal with nonlinearly separable data, however specifying a suitable kernel function and appropriate parameters most of the times is a hard task. Another widely used manifold learning method is isometric mapping (ISOMAP). Instead of using the Euclidean distance, Isomap is based on approximating geodesic distance along the manifold [7]. However, isomap operates on neighboring data-points defined by Euclidean threshold distance, which accelerates the algorithm but presents problems in case of outlier points. In [8] the authors applied the k-means clustering algorithm after Isomap and proposed a modified definition of the geodesic distances but concluded that even their modified method was unsatisfactory in real-data cases where the data is noisy or the clusters are highly nonlinear.

In this work we propose a new topology preserving distance that follows the geodetic of the underlying manifold. Instead of imposing a threshold on distance between points, we construct a graph with all available points and impose a penalty function that penalizes distant points. The definition of distant points uses a characteristic distance parameter whose value is automatically estimated from the available dataset. Furthermore, we propose a modification of the k-means algorithm incorporating benefits of a newly introduced distance metric that preserves the topology of the data point manifold. A critical advantage of the proposed approach is the feasible robust parameterization of the algorithm. Extensive experiments on both simulated and real data sets employing the Purity and V-measure metrics for comparison, as described in [4] provide further evidence on the wide applicability of the proposed method.

2 Methodology

2.1 The Proposed Topology-Preserving Distance Metric

Let $ {\mathbf{P}} $ be a data matrix of dimensions N × K, each row of which is a feature vector (equivalently a data point) $ {\mathbf{p}} = \left( {x_{1} ,x_{2} , \ldots ,x_{K} } \right) $ of dimensionality K. Any given set of such points may be arranged on an unknown manifold in the $ \Re^{K} $ space. Thus, the Euclidean distance metric between any two points may not represent their actual distance.

Let us define an auxiliary distance metric between any pair of data points as:

$$ D_{ij} = \left\{ {\begin{array}{*{20}c} {\lambda d_{ij} ,d_{ij} > d_{0} } \\ {d_{ij} ,d_{ij} \le d_{0} } \\ \end{array} } \right. $$

(1)

where $ d_{ij} = \left\| {{\mathbf{p}}_{{\mathbf{i}}} - {\mathbf{p}}_{{\mathbf{j}}} } \right\| $ is the Euclidean distance between the two points, λ is a sufficiently big value and d₀ is a characteristic distance, whose value is estimated for the current dataset, as it will be described later. Let us stretch that D_ij is not the distance metric proposed in this work, rather an auxiliary definition.

For a given set of data points $ P = \left\{ {{\mathbf{p}}_{{\mathbf{i}}} } \right\},i = 1,2, \ldots ,N $ in $ \Re^{K} $, a fully connected graph $ G = \left( {P,E} \right) $ is defined with vertices $ P $ as the set of all data points and edges E as the set of all possible connections between vertices, $ E = \left\{ {\left( {i,j} \right)} \right\},i,j = 1,2, \ldots ,N $. Thus, each point is connected to all other points in the data set. The cost of the connection (edge) between any pair of points $ {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} $ is set equal to their auxiliary distance D_ij, as defined in Eq. (1). The proposed topology preserving distance between $ {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} $ is defined as the cost of the minimum-cost path $ \pi_{\text{ij}} $ according to the well-known Dijktra’s algorithm, between the two points.

$$ A\left( {{\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} } \right) = A_{ij} = \text{cost}\left( {\pi_{\text{ij}} } \right),i,j = 1,2, \ldots ,N $$

(2)

Since any generated path $ \pi_{\text{i,j}} $ between $ {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} $ consists of an ordered series of data points with indices $ \left( {i_{1} ,i_{2} , \ldots ,i_{M} } \right) $, where $ i_{1} = i,\,i_{M} = j $ the proposed topology preserving distance between $ {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} $ is calculated as

$$ A_{i,j} = {\text{cost}}(\pi_{\text{i,j}} ) = \sum\limits_{m = 1}^{M - 1} {D_{m,m + 1} ,i = i_{1} ,j = i_{M} } $$

(3)

The parameter d₀ is a characteristic length in $ \Re^{K} $ that defines the scale of local linearity in a given set of data points. It is self-evident that any two points with Euclidean distance less than or equal to d₀ will be connected without any intermediate points. On the other hand, for any two points with Euclidean distance greater than d₀, the proposed algorithm will generate a connecting path with intermediate points, if sufficient pairs of these points have Euclidean distance not greater than d₀.

Figure 1 shows the paths generated by the Dijkstra’s algorithm using the auxiliary distance D_ij, in the case of data points generated using the Swiss roll data set [11], for λ = 10⁸, d₀ = 2, using one randomly selected point i₀ as the source for Dijkstra’s algorithm (the point from which all other distances are calculated). The dataset is constructed to contain 2 classes (N_c = 2), 400 points each, denoted by different color. The points lie on a manifold that is defined by a parametric equation. The paths $ \pi_{{{\text{i}}_{ 0} , {\text{j}}}} $ are also plotted as blue lines for all points j = 1,2,…, 800, j ≠ i₀. As it can be observed, the selected value of d₀ generates shortest paths that lie on the manifold, rather than crossing the gap as it would be dictated by the Euclidean distance. The use of this propose distance metric in any clustering or classification process on this dataset is expected to significantly increase the achieved accuracy.

2.2 Determining the Value of d₀ Parameter

The parameter d₀ is very important for the efficient operation of the algorithm. A low value of d₀ would cause $ D_{ij} = \lambda d_{ij} $ for almost all pairs of points $ {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} $. Consequently, the Dijkstra’s algorithm would return a single edge from $ {\mathbf{p}}_{{\mathbf{i}}} $ to $ {\mathbf{p}}_{{\mathbf{j}}} $, rather than generating a path with intermediate points. Equation (1) would be simplified to $ A_{ij} = D_{ij} $, thus the proposed metric would be equivalent to a Euclidean one. On the other hand, a high value for d₀ would cause $ D_{ij} = d_{ij} $ for almost all pairs of points $ {\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} $, also resulting in a single-edge path, just as described previously.

Let i₀ be the index of a randomly selected point. The proposed algorithm is executed N -1 times connecting i₀ with all the rest N−1 points $ {\mathbf{p}}_{{\mathbf{j}}} $, calculating distance $ A_{{i_{0} j}} $ and generating the corresponding paths $ \pi_{{i_{0} j}} $. Let us denote the sequence of points that constitute the path from i₀ to j as $ \left( {i_{1} ,i_{2} , \ldots ,i_{M} } \right) $ with $ i_{0} = i_{1} ,j = i_{M} $ and the series of Euclidean distances $ \left\{ {d_{{i_{m} ,i_{m + 1} }} } \right\},m = 1,2, \ldots ,M $. To simplify notation, let us use $ d_{{i_{m} ,i_{m + 1} }} = d_{m,m + 1} $. Let $ d_{{i_{0} j}}^{\hbox{max} } = \hbox{max} \left\{ {d_{m,m + 1} } \right\},m = 1,2, \ldots ,M $ be the maximum Euclidean length of the steps in the path from i₀ to j. Then the average $ d_{{i_{0} j}}^{\hbox{max} } $ can be calculated for the selected point i₀ over all other points j in the data set:

$$ \left\langle {d_{i} } \right\rangle = \frac{1}{N - 1}\sum\limits_{\begin{subarray}{l} j = 1, \\ j \ne i \end{subarray} }^{N} {d_{i,j}^{\hbox{max} } } $$

(4)

By its definition, $ \left\langle {d_{i} } \right\rangle $ is calculated for a selected point i₀ and it is a function of d₀. It is easily proven that when d₀ has very low values, below the minimum Euclidean distance d_min in the dataset, $ \left\langle {d_{i} } \right\rangle $ is equal to the mean Euclidean distance between i₀ and all data points. In the case of data points being equally distributed (e.g. on a regular grid), the quantity $ \left\langle {d_{i} } \right\rangle $ is expected to be monotonically increasing when d₀ in [d_min, d_max]. When d₀ approaches values equal to d_max, the $ \left\langle {d_{i} } \right\rangle $ becomes equal to the mean Euclidean distance between i₀ and all data points and remains constant for larger values of d₀. In the case however of anisotropic data point distribution, $ \left\langle {d_{i} } \right\rangle $ drops sharply when d₀ has an appropriate intermediate value, since the proposed algorithm generates connecting paths between data points that consist of steps with smaller distances. Finally, when d₀ approaches values equal to the maximum Euclidean distance in the dataset d_max, the $ \left\langle {d_{i} } \right\rangle $ becomes equal to the mean Euclidean distance between i₀ and all data points. Thus, for any point in the dataset i, the quantity is calculated for different values of $ d_{0} $ and the value of $ d_{0} = d_{\hbox{min} }^{i} $ that produces the minimal value is determined

$$ d_{\hbox{min} }^{i} = \mathop {\arg \hbox{max} }\limits_{{d_{0} }} \left\langle {d_{i} } \right\rangle ,\,d_{0} \in \left[ {d_{\hbox{min} } ,d_{\hbox{max} } } \right] $$

(5)

In the special case of data points lying on a manifold with shape of large scale concavities, then $ d_{\hbox{min} }^{i} $ has a value that indicates the characteristic length of the concavities. Thus, setting d₀ to a value less than $ d_{\hbox{min} }^{i} $ will cause the proposed algorithm to produce connecting paths between data points that do not cross the concavities, but lie on the manifold, thus behaving like geodetic curves. In order to obtain a good estimation of $ d_{\hbox{min} }^{i} $, the calculation is repeated for many randomly selected points in the dataset and the average $ \bar{d}_{\hbox{min} }^{i} $ is obtained.

Figure 2 shows the $ \left\langle {d_{i} } \right\rangle $ calculated for 50 points randomly distributed in a two-dimensional dataset for d₀ in [d_min, d_max]. Two different datasets are used to demonstrate the aforementioned behavior: first 2601 points are put on a regular grid on the plane (a) and 2601 points are randomly placed on the plane (following the flat distribution). In both cases the [0, 10] × [0, 10] part of the plane is used. The $ \left\langle {d_{i} } \right\rangle $ is plotted against the values of d₀. The average $ \bar{d}_{\hbox{min} }^{i} $ (over all 50 points) is plotted as a green circle, whereas the average thick curve also overlaid. The aforementioned behavior is obvious in both data sets, whereas the consistency for all different points is evident.

The same process is applied to the Swiss Roll dataset [11] using 800 points in 2 classes. The quantity $ \left\langle {d_{i} } \right\rangle $ is plotted for d₀ in [d_min, d_max] = [0.0072, 48.4663]. The average $ \bar{d}_{\hbox{min} }^{i} $ (over all 40 points) was found equal to 3.6431 (Fig. 3).

The effect of d₀ is demonstrated in Fig. 4. One point is randomly selected from the dataset and the minimum cost connections with all other points are shown. The cost of the connecting path is calculated according to Eq. (3), as a function of d₀. It can be observed that for very low or very high values of d₀, the connecting paths cross the topological gap between points, whereas for intermediate values, the paths follow the manifold, as geodesic curves.

2.3 A k-Means Variant for the Proposed Topology-Preserving Distance Metric

In this subsection a variant of the k-means clustering algorithm is proposed, that utilizes the proposed topology preserving distance metric. The main differences from the classic k-means algorithm can be summarized as following:

The N × N distance matrix is calculated using the proposed distance metric in Eq. (2) (it requires the characteristic length d₀): $ {\mathbf{A}} = \left\{ {A\left( {{\mathbf{p}}_{{\mathbf{i}}} ,{\mathbf{p}}_{{\mathbf{j}}} } \right)} \right\} = \left\{ {A_{ij} } \right\},i,j = 1,2, \ldots ,N $.

The class centers in each iteration are selected from the data points, so that they minimize the average (proposed metric) distance from the members of the specific class. The algorithm is terminated when all class centers remain unchanged in two consecutive iterations. The details of the proposed algorithm are given below.

Input: the data matrix $ P $, the number of classes $ N_{c} $, the N × N distance matrix A using the proposed metric.

Experimentation shows (see Results section) that the proposed variant of the k-means method produces consistently optimal results for values of d₀ slightly smaller than the estimated $ \bar{d}_{\hbox{min} }^{i} $.

3 Results

The proposed k-means variant that uses the proposed distance metric is evaluated against the classic k-means, the kernel k-means (implemented as in [9]) and the spectral clustering (implemented in [10] and [12]) in terms of purity of clustering, as well as V-measure. The proposed method has been executed for 20 times with random initialization and the resulting average purity, as well as the standard deviation are plotted for different values of d₀ in Fig. 5. The same quantities achieved by the classic k-means clustering, the kernel k-means and the spectral clustering are also shown. It can be observed that the proposed algorithm clearly outperforms the classic and the kernel k-means. The behavior of the proposed algorithm with respect to parameter d₀ is consistent with the estimated value of $ \bar{d}_{\hbox{min} }^{i} $: for values of less than the achieved clustering is consistently high. For values of $ d_{0} > \bar{d}_{\hbox{min} }^{i} $, the proposed algorithm behaves very similarly to the classic k-means. This is expected, since as described above the proposed distance definition becomes similar to the Euclidean one.

Figure 6 shows the same results for the COIL dataset. The determination of $ \bar{d}_{\hbox{min} }^{i} $ as shown in Fig. 6(a) is unambiguous. The behavior of the proposed k-means variant with respect to d₀, is also very consistent, with best performance occurring at values of d₀ slightly smaller than the estimated $ \bar{d}_{\hbox{min} }^{i} $. The proposed k-means variant with the suggested distance metric outperforms the other methods in comparison.

Table 1 shows the clustering purity and V-measure achieved by the proposed method, k-means, kernel k-means and spectral clustering. The values for the proposed method were calculated by using the corresponding value for d₀ slightly less than the estimated $ \bar{d}_{\hbox{min} }^{i} $. The standard Matlab implementation was used for the k-means method. Kernel k-means was used as provided in [9]. Spectral clustering was used as provided in [10] and/or in [12] that implements the algorithm described in [13].

Table 1. Dataset description, with the achieved clustering purity and V-measure by the proposed method, classic and kernel k-means and spectral clustering.

Full size table

4 Conclusions

A new distance metric for high-dimensional data has been presented that preserves the topology of the underlying manifold. A variant of the k-means clustering algorithm has been suggested that utilizes this metric. The value of the main parameter of the proposed distance metric can be obtained with a standard and efficient process. The performance of the proposed method has been analyzed theoretically and validated experimentally, on a number of benchmark datasets. Comparative results with well-established clustering algorithms show that the proposed method is a competent alternative with consistent behavior that systematically performs equally well or better than the other techniques under comparison. Future work includes the algorithmic fine tuning of the proposed k-means variant and the extension of the application of the distance metric to visualization and dimensionality reduction techniques. Comparative results will also be expanded to include more optimized implementations of other state of the art methods.

References

Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Disc. 11(1), 5–33 (2005)
Article MathSciNet Google Scholar
Cox, M.A., Cox, T.F.: Multidimensional scaling. In: Chen, C.-H., Härdle, W., Unwin, A. (eds.) Handbook of Data Visualization. Springer Handbooks Comp.Statistics, pp. 315–347. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-33037-0_14
Chapter Google Scholar
Kriegel, H.-P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1–58 (2009)
Article Google Scholar
Pavlidis, N.G., Hofmeyr, D.P., Tasoulis, S.K.: Minimum density hyperplanes. J. Mach. Learn. Res. 17(156), 1–33 (2016)
MathSciNet MATH Google Scholar
Schoelkopf, B., Smola, A., Mueller, K.-R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5), 1299–1319 (1998)
Article Google Scholar
Tasoulis, S.K., Tasoulis, D.K., Plagianakos, V.P.: Enhancing principal direction divisive clustering. Pattern Recogn. 43(10), 3391–3411 (2010)
Article MATH Google Scholar
Tenenbaum, J.B., Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Sci. 290(5500), 2319–2323 (2000)
Article Google Scholar
Yu, H., Zhang, X., Yang, Y., Zhao, X., Cai, L.: An extended ISOMAP by enhancing similarity for clustering. In: Jiang, H., Ding, W., Ali, M., Wu, X. (eds.) IEA/AIE 2012. LNCS (LNAI), vol. 7345, pp. 808–815. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31087-4_81
Chapter Google Scholar
Gonen, M., Margolin, A.A.: Localized data fusion for kernel k-means clustering with application to cancer biology. In: Advances in Neural Information Processing Systems 27 (NIPS 2014), Montrιal, Quιbec, Canada (2014)
Google Scholar
Mathworks. https://www.mathworks.com/matlabcentral/fileexchange/46733-spectral-clustering. Accessed 2 Mar 2019
http://people.cs.uchicago.edu/~dinoj/manifold/swissroll.html
https://www.mathworks.com/matlabcentral/fileexchange/34412-fast-and-efficient-spectral-clustering
von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Article MathSciNet Google Scholar

Download references

Acknowledgement

The author wishes to acknowledge the partial support by the Interdepartmental Postgraduate Program “Computer Science and Computational Biomedicine” of the School of Science of the University of Thessaly.

Author information

Authors and Affiliations

Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
Konstantinos K. Delibasis

Authors

Konstantinos K. Delibasis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Konstantinos K. Delibasis .

Editor information

Editors and Affiliations

University of Sunderland, Sunderland, UK
John MacIntyre
University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of West England, Bristol, UK
Elias Pimenidis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Delibasis, K.K. (2019). A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering. In: MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds) Artificial Intelligence Applications and Innovations. AIAI 2019. IFIP Advances in Information and Communication Technology, vol 559. Springer, Cham. https://doi.org/10.1007/978-3-030-19823-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-19823-7_12
Published: 12 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19822-0
Online ISBN: 978-3-030-19823-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering

Abstract

Similar content being viewed by others

Reliable Clustering Quality Estimation from Low to High Dimensional Data

An effective clustering scheme for high-dimensional data

Optimal Landmark Point Selection Using Clustering for Manifold Modeling and Data Classification

Keywords

1 Introduction

2 Methodology

2.1 The Proposed Topology-Preserving Distance Metric

2.2 Determining the Value of d₀ Parameter

2.3 A k-Means Variant for the Proposed Topology-Preserving Distance Metric

3 Results

4 Conclusions

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering

Abstract

Similar content being viewed by others

Reliable Clustering Quality Estimation from Low to High Dimensional Data

An effective clustering scheme for high-dimensional data

Optimal Landmark Point Selection Using Clustering for Manifold Modeling and Data Classification

Keywords

1 Introduction

2 Methodology

2.1 The Proposed Topology-Preserving Distance Metric

2.2 Determining the Value of d0 Parameter

2.3 A k-Means Variant for the Proposed Topology-Preserving Distance Metric

3 Results

4 Conclusions

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation

2.2 Determining the Value of d₀ Parameter