1 Introduction

Data clustering is a technique to group a dataset into a number of subsets based on a “natural” hidden data sturcture (Cherkassky and Mulier 2007). To capture the underlying data structures, traditional clustering techniques such as the Expectation–Maximization (EM) algorithm (Dempster et al. 1977) assumes specific probability distributions as the source from which the dataset is generated. In comparison, density-based methods are attractive due to their non-parametric characteristic which enables them to deal with arbitrary shaped clusters (Jain 2010). They rely on spatially varying densities for the detection of clusters. High density regions are identified as clusters which are separated by regions of low density (Han and Kamber 2011).

However, most density-based methods have difficulties to detect all clusters when the clusters have large variations of densities (Ertöz et al. 2003a; Zhu et al. 2016). For example, DBSCAN (Ester et al. 1996), which uses a global density threshold to discriminate cluster core points from noise, fails to identify all clusters in the presence of greatly varying densities (Zhu et al. 2016).

Many efforts have been devoted to solve the varying densities problem in DBSCAN-like algorithms. Shared-Nearest-Neighbours (SNN) (Jarvis and Patrick 1973; Ertöz et al. 2003a) is an effective technique to this end. It uses the number of shared nearest neighbours between two points as a similarity measure to replace distance in the clustering procedure. Yet, the performance of SNN is sensitive to the number of nearest neighbours used in its similarity calculations (Brito et al. 1997; Ertöz et al. 2003; Tan and Wang 2013). ReScale (Zhu et al. 2016) is a recently proposed approach to tackle the same problem. It rescales a dataset such that the estimated density of a rescaled data point approximates the density ratio of the correspond point in the original dataset. However, ReScale does not perform well when clusters overlap significantly on some attributes (Zhu et al. 2016).

A more recent density-based clustering method called Clustering by Fast Search and Find of Density Peaks (CFSFDP) (Rodriguez and Laio 2014) employs a density-based approach different from DBSCAN for clustering. Instead of finding core points using a global threshold in the first step, it finds the density peak of every cluster and then links the neighboring points of each peak to form a cluster. CFSFDP overcomes some issues of varying densities of earlier density-based clustering algorithms (e.g., DBSCAN).

While the condition under which DBSCAN fails to detect all clusters has been formalised recently (Zhu et al. 2016), whether such a condition exists in the more recent density-based method CFSFDP is still unknown. We formalise a necessary condition for CFSFDP to identify all clusters in a dataset, and show that large variation of densities is still problematic for CFSFDP.

We propose a new measure called Local Contrast (LC), as an alternative to density, to make density-based clustering algorithms robust against varying densities. The proposed LC is not too sensitive to its parameter setting, and is able to achieve high clustering performance with a default setting.

Though the proposed LC is built on top of a density estimator, it has the following unique theoretical properties:

  • The local modes and local minima of the LC distribution are also the local modes and local minima of the density distribution of the same dataset.

  • The local modes of LC have the same constant LC value, irrespective of the density values of the local modes.

  • The local minima of LC have zero LC value, irrespective of their density values of the local minima.

We utilise LC to create a new version of CFSFDP, named LC-CFSFDP. We show that the new clustering method LC-CFSFDP is more robust against varying densities than CFSFDP.

To benchmark the proposed LC-CFSFDP, we apply SNN and ReScale (which are existing remedies for the density variation issue) to CFSFDP, creating two improved variants called SNN-CFSFDP and ReScale-CFSFDP. Together with the original CFSFDP and its latest improvement called FKNN-DPC (Xie et al. 2016), the four methods are used as contestants against LC-CFSFDP in our experiments. Our empirical evaluation shows that LC-CFSFDP outperforms all four contestants in 18 benchmark datasets.

The rest of the paper is organised as follows. Section 2 presents the related work. Section 3 discusses the weakness of CFSFDP and how to use existing remedies to improve it. Section 4 proposes Local Contrast and shows its properties. Section 5 presents LC-CFSFDP. The empirical evaluation, discussion and conclusions are provided in the last three sections.

2 Related work

Density-based clustering methods such as DBSCAN (Ester et al. 1996) identify high density (core) points using a global threshold and then link all neighbouring core points to form clusters. However, these methods are known to have one key issue, i.e., they have difficulties detecting all clusters when the clusters have large density variations (Ertöz et al. 2003a). Recent research has formalised a necessary condition for DBSCAN to detect all clusters in a dataset (Zhu et al. 2016): if the peak of some cluster has a density lower than that of a low-density region between clusters, then DBSCAN will fail to find all clusters. Many density-based clustering algorithms (Hinneburg and Gabriel 2007; Ram et al. 2009; Borah and Bhattacharyya 2008), like DBSCAN, use a global density threshold to define core points and links them to form clusters. All these algorithms have the same issue. The exact condition under which these density-based algorithms fail (Zhu et al. 2016) is provided in Appendix A for ease of reference.

Researchers have attempted to address the issue of density-based clustering using different approaches. For instance, Shared-Nearest-Neighbours (SNN) (Jarvis and Patrick 1973; Ertöz et al. 2003a) employs an alternative similarity measure to replace the distance measure in the clustering procedure. The similarity between two data points is either the number of their shared K-nearest-neighbours (if they have each other in their K-nearest-neighbour lists) or 0 otherwise. Since the SNN similarity measure takes into account the local distribution of the data points, it is less affected by varying densities of different clusters. It has been shown that DBSCAN which uses SNN improves the clustering results of DBSCAN which uses the distance measure (Ertöz et al. 2003a). However, its performance is sensitive to the setting of parameter K and its time complexity is \(O(K^2 N^2)\), instead of \(O(N^2)\) for many other density-based clustering methods such as DBSCAN, because of an additional KNN process is required (Zhu et al. 2016).

ReScale (Zhu et al. 2016) is another technique that is recently proposed to overcome the density variation problem in clustering. This technique is a pre-processing technique and is originally designed for a density-based clustering algorithm which uses a global density threshold to identify clusters. ReScale enables existing density-based clustering algorithms to perform density-ratio-based clustering, i.e., clusters are defined as regions of locally high density separated by regions of locally low density. The aim is to rescale the data such that the estimated density of each rescaled point is approximately the estimated density-ratio of the corresponding point in the original space, where density-ratio is defined as a ratio of the density of a point and the average density over its \(\eta \)-neighbourhood. A point located at a maximum local density area has higher density-ratio value than that of a point located at a minimum local density area. Thus, a density-based clustering algorithm can be applied without modification to the rescaled data which uses a single threshold to identify all clusters of locally high densities. Two additional parameters are introduced—\(\eta \) is used to define the local neighbourhood; and \(\psi \) is used to control the precision of \(\eta \)-neighbourhood density estimation.

A recent density-based clustering algorithm, CFSFDP (Rodriguez and Laio 2014), takes a different approach to reduce the effect of the above-mentioned issue. The idea is to find cluster centres which have higher density than their neighbours and are relatively distant from each other. CFSFDP mitigates the problem of varying densities in some situations because it finds cluster centres not only by high densities, but also by taking into account their distances from other centres. It can detect low-density cluster centre if it is far from other clusters.

The Fuzzy weighted K-Nearest-Neighbors Density Peak Clustering (FKNN-DPC) (Xie et al. 2016) is a recent effort to improve CFSFDP (Rodriguez and Laio 2014). It uses a similar procedure as CFSFDP, except the density estimation phase and the cluster assignation phase. FKNN-DPC uses a KNN kernel estimator, instead of a \(\epsilon \)-neighbourhood estimator. The key difference lies in the cluster assignation phase: FKNN-DPC uses a complex assignation scheme consists of 2 strategies based on a series of KNN searches. The heavy use of KNN searches makes the algorithm very sensitive to the K parameter. It does not overcome the problem in clusters having hugely varying densities from the root cause because its operation is still based on density, as mentioned in the last paragraph.

It is important to point out that the above improvement over CFSFDP (Xie et al. 2016) was done on procedural steps only (which use a different density estimator and a different scheme to assign points to a cluster), without knowing the root cause.

In this paper, we focus on CFSFDP (Rodriguez and Laio 2014) because it is a powerful and state-of-the-art core method of density-based clustering (Xu and Tian 2015); and we want to identify the key weakness of CFSFDP and its root cause. To achieve this aim, we first formalise the condition under which CFSFDP fails to detect all clusters in a dataset; and reveal that large density variations in a dataset can still harm CFSFDP’s clustering performance significantly under some situations. Then, we propose a new measure called Local Contrast, in place of density, as the primary means to find clusters. We show that this can be easily done using almost the same procedure as CFSFDP; and this overcomes CFSFDP’s weakness from the root cause.

3 Weakness of CFSFDP and current remedies

Here we first provide a necessary condition for CFSFDP to detect all clusters in a dataset in Sect. 3.1. Its violation will result in CFSFDP failing to detect all clusters. In Sect. 3.2, we create two variants of CFSFDP with existing remedies in tackling the problem of cluster density variations: SNN and ReScale. We show the limitations of these remedies for CFSFDP in the last subsection.

3.1 A necessary condition for CFSFDP

Like most density-based methods, CFSFDP (Rodriguez and Laio 2014) employs a density estimator \(f(\mathbf x)\) to estimate densities for all \(\mathbf x\) in a dataset D. The density estimator is defined as follows:

$$\begin{aligned} f(\mathbf x) = |\{\mathbf y \in D\ |\ d(\mathbf x, \mathbf y) < \epsilon \}|, \end{aligned}$$

where \(d(\cdot ,\cdot )\) is a distance measure and \(\epsilon \) is a cut-off distance; and |Q| is the cardinality of set Q.

Let \({\mathbf x}^m = \hbox {arg max}_{{\mathbf x} \in D} f(\mathbf x)\) denote the point with the global maximum density. CFSFDP (Rodriguez and Laio 2014) defines a distance function of \(\mathbf x\), \(\delta _f(\mathbf x)\), as follows:

$$\begin{aligned} \delta _{f}(\mathbf x) = \left\{ \begin{array}{ll} {\mathop {\min }\limits _{{f(\mathbf y)>f(\mathbf x)}}} d(\mathbf x, \mathbf y),&{}\forall \mathbf x \in D \backslash \{{\mathbf x}^m\}\\ {\mathop {\max }\limits _{\mathbf y \in D}}\, d(\mathbf x, \mathbf y),&{} \text {if}\quad \mathbf x = {\mathbf x}^m. \end{array} \right. \end{aligned}$$

In other words, \(\delta _f(\mathbf x)\) is the distance between \(\mathbf x\) and its nearest neighbour with a higher density; except that for the point with the global maximum density, \(\delta _f(\mathbf x)\) is the greatest distance between any point and itself. This is to make sure that for the point with the global maximum density, it will always be ranked first in the ranked list of \(f(\mathbf x)\delta _f(\mathbf x)\) sorted in descending order.

The user then selects the top M points from the ranked list of \(f(\mathbf x)\delta _f(\mathbf x)\), and label them from 1 to M, as the centres for M clusters.

All points are then sorted in descending order of \(f(\mathbf x)\). One by one from top to bottom of the sorted list, each unlabeled point is assigned to the same cluster of its nearest neighbour with a higher density. The first column in Table 1 provides a summary of the key steps in the CFSFDP procedure.

Table 1 CFSFDP versus LC-CFSFDP: key steps

CFSFDP requires that these cluster modes must be ranked at the top in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\) if they are to be selected as cluster centres.

We now state the necessary condition for CFSFDP to identify all clusters of a dataset.

Theorem 1

Given a dataset D of M actual clusters, let \(\mathbb C = \{\mathbf c_m, m = 1,\ldots ,M\}\) denote the M cluster modes, i.e., the points with the maximum density in each cluster with respect to a density estimator \(f(\mathbf x)\). A necessary condition for CFSFDP to correctly identify all clusters is given as follows:

$$\begin{aligned} \min _{\mathbf x \in \mathbb C}f(\mathbf x)\delta _f(\mathbf x) > \max _{\mathbf y \in D{\setminus } \mathbb C}f(\mathbf y)\delta _f(\mathbf y) . \end{aligned}$$
(1)

Proof

A violation of Eq. (1) means that at least one point \(\mathbf z \in \mathbb C\) is not among the top M points in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\). Then, one of the following three situations will occur:

  1. (i)

    If less than M points are selected as cluster representatives, then not all clusters are identified.

  2. (ii)

    If more than M points are selected as cluster representatives, then some cluster will be divided.

  3. (iii)

    If exactly M points are selected as cluster representatives, then point \(\mathbf z \in \mathbb C\) is not selected as a representative. As a result, \(\mathbf z\) will be assigned a label from a point with a higher density. Since \(\mathbf z\) is the density maximum in its own cluster, the point that \(\mathbf z\) links to can not be from the same cluster. Hence, \(\mathbf z\) and its neighbouring points will be mislabelled as belonging to a different cluster.

In all the above cases, CFSFDP can not correctly identify all clusters having violated Eq. (1). \(\square \)

Note that the condition provided in Theorem 1 is independent of the density estimator used.

Fig. 1
figure 1

The clustering result of CFSFDP on the synthetic dataset. Note that the brown square marker in c denotes the density peak of the middle cluster, which is ranked 7th in the Decision Graph shown in b. It is not selected in the final result shown in d because selecting 6 representatives produces the best F-measure. a Density distribution, b CFSFDP decision graph, c density peaks, d clustering result, \(\hbox {F}=0.84821\) (Color figure online)

The basic assumptions of CFSFDP are that (i) each cluster centre has the maximum density among all points within the cluster, and (ii) all cluster centres are well separated. While these two assumptions are usually true, the maximum densities of different clusters can not be guaranteed to be the same, or even similar.

Because density can not provide such a guarantee, the use of density becomes the root cause of CFSFDP’s weakness in detecting all clusters having hugely different densities. When clusters have significantly different densities, low density centres which have no sufficient long distance \(\delta _f(\cdot )\) will be ranked low in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\). As a result, the algorithm fails to correctly identify all clusters. An example is shown in Fig. 1. The top dense cluster has multiple peaks; and the centre of the sparse cluster has significantly lower density than these peaks. CFSFDP fails to detect the 4 clusters correctly because the mode of the sparse cluster has density which is too low for the mode to be ranked in the top four in the sorted list of \(f(\mathbf x)\delta _f(\mathbf x)\), shown in Fig. 1b.

To overcome this weakness, we provide an alternative to density which has the necessary properties to detect all clusters of different densities using the exactly the same CFSFDP procedure. This alternative measure will be introduced in Sect. 4; and our analysis in Sect. 5 shows that the alternative measure is more robust than density in a dataset having clusters of different densities using the same CFSFDP procedure.

3.2 Improving CFSFDP using existing methods of improving DBSCAN

SNN (Jarvis and Patrick 1973; Ertöz et al. 2003a) and ReScale (Zhu et al. 2016) are two existing methods designed to address the issue of DBSCAN-like clustering methods in datasets having huge density variations.

One can use either of these existing methods to improve CFSFDP. These can be applied straightforwardly. The following two subsections provide the details of two modified versions of CFSFDP: SNN-CFSFDP and ReScale-CFSFDP.

3.2.1 SNN-CFSFDP

SNN-CFSFDP has the same procedure of CFSFDP except that the distance measure used in both \(f(\cdot )\) and \(\delta _{f}(\cdot )\) is replaced with the shared nearest neighbour dissimilarity measure (Ertöz et al. 2003a).

Let \(N_K(\mathbf x)\) denote the K nearest neighbours of \(\mathbf x\) in a dataset D, with respect to Euclidean distance. The shared nearest neighbour dissimilarity (SNN) of two points \(\mathbf x\) and \(\mathbf y\) is defined as

$$\begin{aligned} { SNN}(\mathbf x, \mathbf y)= \left\{ \begin{array}{l} 1 - |N_K(\mathbf x) \cap N_K(\mathbf y)|/K, \text { if } \mathbf y \in N_K(\mathbf x) \text { and } \mathbf x \in N_K(\mathbf y) \\ 1, \text { otherwise.} \end{array} \right. \end{aligned}$$

SNN-CFSFDP then calculates both the density \(f_{{ SNN}}(\mathbf x)\) and the nearest distance to a higher density point \(\delta _{f_{{ SNN}}}(\mathbf x)\) in terms of \({ SNN}\) dissimilarity as follows:

$$\begin{aligned} f_{{ SNN}}(\mathbf x) = |\{\mathbf y \in D\ |\ { SNN}(\mathbf x, \mathbf y) < \epsilon \}| , \end{aligned}$$

and

$$\begin{aligned} \delta _{f_{{ SNN}}}(\mathbf x) = \left\{ \begin{array}{ll} {\mathop {\hbox {min}}\limits _{f_{{ SNN}}(\mathbf y)>f_{{ SNN}}(\mathbf x)}} { SNN}(\mathbf x, \mathbf y),&{} \forall \mathbf x \in D {\setminus } \{\mathbf x^w \} \\ {\mathop {\hbox {max}}\limits _{\mathbf y \in D}}\, { SNN}(\mathbf x, \mathbf y),&{}\text {if}\quad \mathbf x = \mathbf x^w, \end{array} \right. \end{aligned}$$

where \(\epsilon \) is the cut-off \({ SNN}\) dissimilarity and \(\mathbf x^w = \hbox {arg max}_{{\mathbf x} \in D} f_{{ SNN}}(\mathbf x)\).

Given \(f_{{ SNN}}(\mathbf x)\) and \(\delta _{f_{{ SNN}}}(\mathbf x)\), the rest of the procedure is the same as CFSFDP. The summary of the key steps is given in the second column of Table 2. Note that if the procedure is implemented with an input of a dissimilarity matrix, the \({ SNN}\) dissimilarity matrix can be computed in a pre-processing step. This is shown in step 0 in Table 2.

Table 2 SNN-CFSFDP versus ReScale-CFSFDP: key steps of the two algorithms

3.2.2 ReScale-CFSFDP

ReScale-CFSFDP pre-processes the dataset before utilizing the exact same procedure of CFSFDP. ReScale first estimates the density distribution on each dimension of the dataset D, with an \(\eta \)-neighbourhood estimator and a resolution of \(\psi \). It then scales the dataset D along each dimension based on the cumulative distribution, to yield a new dataset \(D'\).

Let \(D_i, \mathbf x_i\) denote the i-th attribute of dataset D and data point \(\mathbf x\), respectively. For each attribute i, ReScale divides the range of \(D_i\) into \(\psi \) equal segments, yielding \(\psi + 1\) grid points \(s_j, j=\{1,\ldots ,\psi +1\}\) and \(s_q > s_j\), for all \(q>j\). It then estimates the densities of \(s_j\) by following,

$$\begin{aligned} f(s_j) = |\{ \mathbf x \in D\ |\ (s_j-\eta ) < \mathbf x_i \le (s_j+\eta ) \}| . \end{aligned}$$

The value of the i-th attribute of a transformed point \(\mathbf x'\) is then given by

$$\begin{aligned} \mathbf x_i' = \sum _{j=1}^{\psi +1} f(s_j)I_{\{\mathbf x_i \ge s_j\}}, \end{aligned}$$

which is the cumulative marginal probability of \(\mathbf x\) on attribute i. After procesing each attribute, ReScale normalises the transformed dataset \(D'\) to be in [0, 1]. The detailed algorithm can be found in Zhu et al. (2016).

Using \(D'\), the rest of the procedure is the same as CFSFDP. The key steps of ReScale-CFSFDP are given in the last column of Table 2.

3.2.3 Limitations of SNN-CFSFDP and ReScale-CFSFDP

We apply SNN-CFSFDP and ReScale-CFSFDP on the synthetic dataset as shown in Fig. 1, and their clustering results are given in Figs. 2 and 3, respectively. Though both methods improve the F-measure compared to the original CFSFDP, they still fail to correctly identify all clusters: SNN-CFSFDP splits the two dense clusters at the bottom into four clusters; ReScale-CFSFDP splits the top cluster into two clusters.

SNN has two weaknesses. First, it is sensitive the K parameter (Brito et al. 1997; Ertöz et al. 2003; Tan and Wang 2013). Second, with a time complexity of \(O(K^2N^2)\), it is computationally expensive. In this example, the default setting of \(K=\sqrt{N}\) leads to an undesirable result as shown in Fig. 2, in which the true peaks #5 and #6 in Fig. 2c can not out-rank false peak #2, because the distance \(\delta \) based on SNN dissimilarity is not large enough. A proper K needs to be carefully tuned in order to produce the desired clustering outcome. We will provide further analysis of this issue in Sect. 6.

The ReScale approach aims to transform the dataset to be uniformly distributed along each attribute. However, when clusters overlap significantly on some attribute(s), it becomes problematic as exemplified in Fig. 3: when projected onto the x-axis, because of the overlapping of clusters along x-axis, there are abundant data points in the middle and fewer data points at each end. The ReScale approach therefore shifts data points from the middle to both ends, causing the upper cluster to have two dense regions at both ends after the transformation. A rotation of the dataset is proposed in Zhu et al. (2016) to remedy this weakness. However, without prior knowledge of the dataset, it is difficult to find an orientation that works well, if such an orientation exists.

Fig. 2
figure 2

The clustering result of SNN-CFSFDP on the synthetic dataset. The parameter K is fixed to \(\sqrt{N}\). \(\epsilon \) and M are searched for the best F-measure. a Density distribution, b SNN-CFSFDP decision graph, c density peaks, d clustering result, \(\hbox {F}=0.88197\)

Fig. 3
figure 3

The clustering result of ReScale-CFSFDP on the synthetic dataset. The distribution shown is for the ReScaled dataset \(D'\). \(\psi \) and \(\eta \) are fixed to 100 and 0.2 respectively while \(\epsilon \) and M are searched for the best F-measure. a Density distribution, b ReScale-CFSFDP decision graph, c density peaks, d clustering result, \(\hbox {F}=0.92721\)

4 Local contrast

We propose Local Contrast as a new remedy for the density variation problem in clustering. Unlike SNN or ReScale, it is not sensitive to the parameter K, nor does it need to rescale the dataset.

Here we provide the definition of Local Contrast and describe its properties which empower clustering algorithms to be more robust against varying densities.

Given a dataset D and a density estimator \(f(\cdot )\), we define Local Contrast as follows:

Definition 1

Local Contrast of an instance \(\mathbf x\) is defined as the number of times that \(\mathbf x\) has higher density than its K nearest neighbours:

$$\begin{aligned} LC(\mathbf x) = \sum _{\mathbf y \in N_K(\mathbf x)} I_{\{f(\mathbf x) > f(\mathbf y)\}} \end{aligned}$$

where \(N_K(\mathbf x)\) is the set of K nearest neighbours of \(\mathbf x\) and \(I_{\{\cdot \}}\) is an indicator.

Local Contrast has three properties.

Property 1

The local modes and the local minima of \(LC(\mathbf x)\) are also the local modes and the local minima of \(f(\mathbf x)\), with a proper choice of K.

Property 2

The local modes of \(LC(\mathbf x)\) that correspond to the local modes of \(f(\mathbf x)\) have \(LC(\mathbf x)= K\), irrespective of the density of \(f(\mathbf x)\).

Property 3

The local minima of \(LC(\mathbf x)\) that correspond to the local minima of \(f(\mathbf x)\) have \(LC(\mathbf x)=0\), irrespective of the density of \(f(\mathbf x)\).

Proof

of Properties 1, 2 and3. Let \(\mathbf p\) and \(\mathbf q\) be the local density maxima and minima, respectively. Assuming a proper choice of K exists such that for all \(\mathbf x \in N_K(\mathbf p)\), \(f(\mathbf p) > f(\mathbf x)\); and for all \(\mathbf x \in N_K(\mathbf q)\), \(f(\mathbf q) < f(\mathbf x)\).

Let \(G \subseteq N_K(\mathbf p)\) be the maximal subset of \(N_K(\mathbf p)\) such that for all \(\mathbf x \in G\), \(\mathbf p \in N_K(\mathbf x)\). In other words, \(\mathbf p\) is one of the K-nearest-neighbours of each member of G. Since G is a subset of \(N_K(\mathbf p)\), \(\mathbf p\) is also a local density maxima in the neighbourhood defined by G.

By Definition 1, we have

$$\begin{aligned} LC(\mathbf p) = \sum _{\mathbf x \in N_K(\mathbf p)} I_{\{f(\mathbf p) > f(\mathbf x)\}} = K \;, \end{aligned}$$

and for all \(\mathbf x \in G\), we have

$$\begin{aligned} LC(\mathbf x)&= \sum _{\mathbf y \in N_K(\mathbf x)} I_{\{f(\mathbf x)> f(\mathbf y)\}} \\&= \left( \sum _{\mathbf y \in (N_K(\mathbf x){\setminus } \mathbf p)} I_{\{f(\mathbf x)> f(\mathbf y)\}} \right) + I_{\{f(\mathbf x) > f(\mathbf p)\}} \\&\le K-1 + 0 \\&< K = LC(\mathbf p). \end{aligned}$$

Thus, \(\mathbf p\) is also a local mode of \(LC(\mathbf x)\) in the neighbourhood defined by G.

Similarly, let \(V \subseteq N_K(\mathbf q)\) be the maximal subset of \(N_K(\mathbf q)\) such that for all \(\mathbf x \in V\), \(\mathbf q \in N_K(\mathbf x)\). In other words, \(\mathbf q\) is one of the K-nearest-neighbours of each member of V. \(\mathbf q\) is also the local density minima in the neighbourhood defined by V.

The same argument follows that \(LC(\mathbf q) = 0 < LC(\mathbf x), \forall \mathbf x \in V\). \(\square \)

The properties of Local Contrast listed above depend on a proper choice of K for a given dataset. The range of K that can be used is usually large. In other words, Local Contrast is not too sensitive to the setting of K.

Figure 4 provides an illustration of the properties of LC. Note that K can be set within the range of 25 and 500, Properties 2 and 3 still hold true; and Property 1 holds true for all settings of K shown.

Fig. 4
figure 4

A dataset of size \(N = 2700\) is drawn from a mixture of three univariate Gaussian sources. The distributions of density and LC (with different K values) are shown

Throughout this paper, all experiments are done with the default setting \(K = \sqrt{N}\), the square root of the dataset size, as suggested by some researchers for K nearest neighbour procedures (Ferilli et al. 2008; Zitzler et al. 2004; Fukunaga 1990).

5 Improving CFSFDP with local contrast

We create a version of CFSFDP, called LC-CFSFDP, by replacing density with LC in the clustering procedure. Given a dataset D and a density estimator \(f(\cdot )\), \(LC(\mathbf x)\) is calculated as defined in Definition 1 for all \(\mathbf x\) in D.

Given \(LC(\cdot )\), \(\delta _{LC}(\mathbf x)\) is defined as follows:

$$\begin{aligned} \delta _{LC}(\mathbf x) = \left\{ \begin{array}{ll} \displaystyle \min _{LC(\mathbf y)>LC(\mathbf x)} d(\mathbf x, \mathbf y),&{} \forall \mathbf x \in D {\setminus } \{\mathbf x^\omega \} \\ \displaystyle \max _{\mathbf y \in D} d(\mathbf x, \mathbf y),&{} \text {if}\quad \mathbf x = \mathbf x^\omega \end{array} \right. \end{aligned}$$

where \(\mathbf x^\omega = \hbox {arg max}_{{\mathbf x} \in D} LC(\mathbf x)\) denotes the point with the global maximum LC; and \(d(\cdot ,\cdot )\) is the Euclidean distance.

In other words, \(\delta _{LC}(\mathbf x)\) is defined to be the distance between \(\mathbf x\) and its nearest neighbour with a higher LC, except when \(\mathbf x\) is the point with the maximum LC. In that case, \(\delta _{LC}(\mathbf x)\) is defined to be the maximum distance between \(\mathbf x\) and any point in D.

Here distance \(\delta _{LC}(\mathbf x)\) is analogous to the distance from a point’s nearest neighbour with a higher density \(\delta _f(\mathbf x)\) used in CFSFDP (Rodriguez and Laio 2014). Given \(LC(\mathbf x)\) and \(\delta _{LC}(\mathbf x)\), cluster centres are then chosen from a decision graph where all points are sorted in descending order of \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\).

Definition 2

Cluster centres are defined to be the top M points with the highest \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) values, where M is a user input parameter.

After selecting M cluster centres, all unlabeled data points are then assigned one by one in descending order of LC, with the same cluster label as its nearest neighbour with a higher LC. A contrast between the LC-CFSFDP and CFSFDP procedures is given in Table 1.

As shown in Table 1, LC-CFSFDP follows the same procedure of CFSFDP. The key difference between the two is that LC-CFSFDP replaces density with LC. As a result, analogous to the condition stated in Eq. (1), a necessary condition for LC-CFSFDP to detect all clusters correctly can be written as

$$\begin{aligned} \min _{\mathbf x \in \mathbb C}LC(\mathbf x)\delta _{LC}(\mathbf x) > \max _{\mathbf y \in D{\setminus } \mathbb C}LC(\mathbf y)\delta _{LC}(\mathbf y), \end{aligned}$$

where \(\mathbb C\) here denotes the set of points with maximum LC in each cluster.

Let \(\check{\mathbf x} = \hbox {arg min}_{\mathbf x \in \mathbb C}LC(\mathbf x)\delta _{LC}(\mathbf x)\) and \(\hat{\mathbf y} = \hbox {arg max}_{\mathbf y \in D{\setminus } \mathbb C}LC(\mathbf y)\delta _{LC}(\mathbf y)\). The above condition can be rewritten as

$$\begin{aligned} \frac{LC(\check{\mathbf x})}{LC(\hat{\mathbf y})} > \frac{\delta _{LC}(\hat{\mathbf y})}{\delta _{LC}(\check{\mathbf x})}. \end{aligned}$$
(2)

The corresponding rewritten condition for the density-based Eq. (1) is given as follows:

$$\begin{aligned} \frac{f(\acute{\mathbf x})}{f(\grave{\mathbf y})} > \frac{\delta _{f}(\grave{\mathbf y})}{\delta _{f}(\acute{\mathbf x})}. \end{aligned}$$
(3)

where \(\acute{\mathbf x} = \hbox {arg min}_{\mathbf x \in \mathbb C}f(\mathbf x)\delta _{f}(\mathbf x)\) and \(\grave{\mathbf y} = \hbox {arg max}_{\mathbf y \in D{\setminus } \mathbb C}f(\mathbf y)\delta _{f}(\mathbf y)\).

Equation (2) is much easier to satisfy than Eq. (3) because the properties of LC ensures that every member of \(\mathbb C\) has the maximum LC value (i.e., Property 2 stated in Sect. 4), irrespectively of the density distribution. This makes the left side of Eq. (2) not less than 1. Thus Eq. (2) is harder to violate. In contrast, in a data distribution which has greatly varying densities between clusters, the left side of Eq. (3) could easily be smaller than 1, if \(\acute{\mathbf x}\) is from a cluster of low density, which makes a violation of Eq. (3) a lot easier.

To demonstrate this, we apply LC-CFSFDP on the same example dataset shown in Fig. 1. Figure 5 shows the result that CFSFDP has ranked the centre of the sparse cluster (rank #7) lower than the multiple peaks in the elongated cluster (ranks #2, 4, 5 and 6) in the decision graph. By simply replacing density \(f(\cdot )\) with \(LC(\cdot )\), LC-CFSFDP allows the centre of the sparse cluster to be ranked in the top four. This difference in ranking is the key of improving the algorithm because all peaks now have about the same LC values, by virtue of Properties 1 and 2, stated in the last section. As a result, the ranking of peaks due to \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) is mainly influenced by \(\delta _{LC}(\mathbf x)\). Since multiple peaks in one cluster tend to have smaller \(\delta _{LC}(\mathbf x)\), the algorithm is more likely to select one peak from each cluster, which makes the algorithm more robust against significant density differences in the presence of multiple peaks in one cluster.

Fig. 5
figure 5

Top seven points as determined by CFSFDP is shown in a. b Shows that the densities of these top points vary hugely while their LCs (with \(K=\sqrt{N} = \sqrt{1250}\)) have similar values close to the maximum, due to Properties 1 and 2 of LC. c Compares the rankings of these points based on \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) with those based on \(f(\mathbf x) \times \delta _{f}(\mathbf x)\). Note the huge change in rank positions of the seventh point, from rank #7 to rank #2. a Top 7 points in decision graph, b density versus LC graph, c \(\hbox {f} \times \delta _{\mathrm{f}}\) versus \(\hbox {LC} \times \delta _{\mathrm{LC}}\)

Figure 5a shows that the top seven points on the synthetic dataset, as determined by CFSFDP using density. Figure 5b shows that the normalised density and LC of these seven points. Figure 5c shows the ranking due to \(LC(\mathbf x) \times \delta _{LC}(\mathbf x)\) and \(f(\mathbf x) \times \delta _f(\mathbf x)\). This change has enabled the centre of the sparse cluster to move from rank #7 to rank #2.

The complete clustering result of the LC version of CFSFDP is shown in Fig. 6. Compared to the clustering result of CFSFDP shown in Fig. 1, LC-CFSFDP has much stronger detecting power, in the presence of varying densities and multiple density peaks in one cluster (as shown in the top cluster in Fig. 5a), which improves the F-measure from 0.85 to 0.98 with the four correct clusters.

Fig. 6
figure 6

The clustering result of LC-CFSFDP on the synthetic dataset. For clarity of presentation, we plot the top 50 points only in the Decision Graph in Plot b. The clustering result is the optimal result in terms of F-measure, obtained by conducting a grid search of parameter \(\epsilon \) and M. a LC distribution, b LC-CFSFDP decision graph, c LC peaks, d clustering result, \(\hbox {F}=0.98478\)

6 Experiments

To show the power of Local Contrast, we conduct experiments using 18 benchmark datasets which have been used in the literature (Chang and Yeung 2008; Gionis et al. 2007; Jain and Law 2005; Lichman 2013; Müller et al. 2009).Footnote 1 Table 3 provides the characteristics of the datasets.

Table 3 Characteristics of datasets used in the experiments, where N is the dataset size, d is the number of features and M is the number of classes

In all experiments, the performance is measured in terms of F-measure. Given a clustering result, we calculate the precision score \(p_{m}\) and the recall score \(r_{m}\) for each cluster \(C_{m}\) based on the confusion matrix. F-measure of \(C_{m}\) is the harmonic mean of \(p_{m}\) and \(r_{m}\). We then use the Hungarian algorithm (Kuhn 1955) to search the optimal match for all clusters. The overall F-measure is the weighted average over all clusters: F-measure \(=\sum _{m=1}^{M}\frac{|C_m|}{N} \times \frac{2p_{m}r_{m}}{p_{m}+r_{m}}\), where N is the dataset size. In the calculations of F-measure, points labeled as noise are not removed from the dataset, but they are not regarded as a cluster. In addition, we also evaluate the performance in terms of Adjusted Rand Index (ARI) (Hubert and Arabie 1985). The outcome is similar to that of using F-measure. For clarity of presentation, we provide the results based on ARI in Appendix B.

All methods are searched in their parameter spaces and the best F-measure achieved is recorded. For all versions of CFSFDP, the value of the cut-off distance/dissimilarity \(\epsilon \) is set to be the average distance between each point and its certain percentile nearest neighbour. This percentile is searched within \([0.1,10\%]\), with a step increment of 0.1%. All methods automatically select M points, which rank at the top in their respective decision graph, to be the cluster centres, where M is searched within \(\{2,3,\ldots ,20\}\). For the K-Nearest-Neighbour search involved in LC-CFSFDP, SNN-CFSFDP and FKNN-DPC, the parameter K is set to the nearest integer of \(\sqrt{N}\), the square root of the dataset size. For ReScale-CFSFDP, the parameter \(\psi \) is set 100 as suggested by Zhu et al. (2016), and \(\eta \) is determined in the following way: we searched \(\eta \) within \(\{0.05,0.1,\ldots ,0.5\}\), and find the value that yields the best average F-measure of the 18 datasets, which is 0.2. We set \(\eta \) to 0.2 for all the experiments. As a result, ReScale-CFSFDP has been given an additional advantage compared with other methods. A summary of the parameter settings is provided in Table 4.

Table 4 Parameters and their search ranges

6.1 Comparing LC to SNN and ReScale

The results in Table 5 show that LC-CFSFDP has the best clustering performance among the four approaches with average rank 1.61, followed by ReScale-CFSFDP with rank 2.39, SNN-CFSFDP with rank 2.83 and CFSFDP with rank 3.00. In term of win/draw/loss counts with respect to base algorithm CFSFDP, LC-CFSFDP has 15 wins, 1 loss and 2 draws; SNN-CFSFDP has 11 wins and 7 losses; and ReScale-CFSFDP has 10 wins and 8 losses. The Friedman test results in Table 6 show that LC-CFSFDP outperforms CFSFDP and SNN-CFSFDP significantly at p-values < 0.02. When comparing LC-CFSFDP with ReScale-CFSFDP, LC-CFSFDP has 11 wins, 1 tie, and 6 losses, although the difference is not significant. Note that ReScale-CFSFDP has an unfair advantage because the parameter \(\eta \) is set to one which gives the best average F-measure over the 18 datasets; whereas SNN-CFSFDP and LC-CFSFDP have no such advantage.

Table 5 Comparison of original and improved versions of CFSFDP in terms of F-measures
Table 6 Pairwise Friedman tests: p-values

In a nutshell, Local Contrast significantly improves the CFSFDP algorithm, and its resultant LC-CFSFDP is the best density-based clustering method, among the current state-of-the-art.

6.2 Comparing LC-CFSFDP to FKNN-DPC

As shown in Table 7, the performance of FKNN-DPC is poor with K being fixed to \(\sqrt{N}\), due to its sensitivity to K. Therefore, we also compare LC-CFSFDP to FKNN-DPC with the paramter K being searched for an optimal result. When K is searched, FKNN-DPC improves significantly in terms of F-measure. Nevertheless, LC-CFSFDP still outperforms FKNN-DPC with 12 wins, 1 draw and 5 losses.

Table 7 Comparison of LC-CFSFDP and FKNN-DPC in terms of F-measures

6.3 Runtime

The time complexities for all methods are \(O(N^2)\) in terms of dataset size N. However, for those involving K-nearest-neighbour search, the time complexities are provided in terms of N and K. Table 8 gives the runtimes of all methods on all datasets. SNN-CFSFDP runs at least an order of magnitude slower than the others.

Table 8 Runtime in seconds

6.4 K sensitivity test

A K sensitivity test is shown in Fig. 7. In this experiment, the K parameter for the K-nearest-neighbours search used in LC-CFSFDP, SNN-CFSFDP and FKNN-DPC is set to different values ranging from 5 to 80, while their corresponding best F-measure is recorded. Three datasets with low, medium, and high dimensionalities are used for the test. In all three cases, LC-CFSFDP exhibits more stable clustering performance than SNN-CFSFDP and FKNN-DPC while K changes.

Fig. 7
figure 7

K sensitivity test on 3 datasets of different dimensionality: aggregation has 2 attributes; segment has 19; and libras has 90. LC-CFSFDP demonstrates better stability than SNN-CFSFDP and FKNN-DPC while K changes

7 Discussion

Local Contrast can be applied using any density estimators, not limited to the \(\epsilon \)-neighbourhood density estimator which has been employed in CFSFDP (Rodriguez and Laio 2014). For example, Local Contrast can be applied to DENCLUE (Hinneburg and Gabriel 2007) which employs kernel density estimator in its operation.

We chose CFSFDP over other density-based methods such as DBSCAN, to be the base algorithm, because the former is a more advanced method. This is confirmed by comparing DBSCAN with CFSFDP in clustering the 18 datasets. The result is provided in Appendix C, which shows that CFSFDP outperforms DBSCAN in all but 1 dataset. To be fair and complete, we also compare LC-CFSFDP to the original SNN and ReScale approaches. The result is provided in Appendix D, which shows that LC-CFSFDP outperforms both methods.

The choice of parameter K in KNN based methods is usually time-consuming since they are often sensitive to K. However, we have shown that LC is not as sensitive to K as SNN or FKNN-DPC. As a rule of thumb, setting \(K=\sqrt{N}\) has been empirically verified to be effective for LC.

As to the choice of parameter M (the number of clusters), both LC-CFSFDP and the original CFSFDP have the same requirement. For a specific dataset, the proper choice of M is a user decision that could be made based on domain knowledge, visual inspection, or other means. In our experiments, M is simply searched to show the best capability of each method.

The original CFSFDP does not explicitly identify any data point to be noise. Instead, after the clustering procedure, it takes an extra step to produce cluster halos, which can be considered as noise. In our experiments, no noise points are produced because all variants of CFSFDP, as well as FKNN-DPC, are able to cluster the whole dataset without producing any noise. However, while handling noisy datasets, LC-CFSFDP can also produce cluster halos in the same way as CFSFDP.

Grid-based clustering approaches partition the space into a number of cells and use the cell density to identify clusters. For example, GRIDCLUS (Schikuta 1996) and NSGC (Ma and Chow 2004) rely on the cell density to identify core cells and link neighbouring core cells together to form clusters. Instead of the current point-based definition, it is possible that Local Contrast can be redefined using cell densities of neighbouring cells; and employ Local Contrast in these algorithms to improve their performance.

Another possible application of LC is density-based subspace clustering, such as SUBCLU (Kailing et al. 2004) and DUSC (Assent et al. 2007). These methods use a density threshold to differentiate between cluster points and noise in different subspaces. Because density is dimensionality-biased, i.e., when estimated using distance-based density estimators, the densities of a data cloud tend to be lower in higher-dimensional spaces. Hence these methods suffer from density variation across subspaces with different dimensionalities: low thresholds detect high-dimensional clusters but have difficulty filtering out noise in low-dimensional subspaces; while high thresholds screen out noise well in low-dimensional subspaces but tend to overlook high-dimensional clusters (Zimek and Vreeken 2015). LC can possibly be an effective remedy for this issue in subspace clustering since LC is not dimensionality-biased.

However, not all density-based methods can utilise Local Contrast readily because some do not employ density directly in their operations. For example, instead of density, OPTICS (Ankerst et al. 1999) employs “core distance” and “reachability distance” to rank points in order to identify clusters. The “reachability distance” reflects the density such that points with a lower density normally have higher “reachability distance”. It is interesting to explore whether Local Contrast can be redefined using these distances rather than density.

8 Conclusions

In this paper, we identify the root cause of CFSFDP’s failure to detect all clusters in a dataset having hugely varying densities. This is the first work, as far as we know, that overcomes CFSFDP’s weakness from its root cause.

We make the following three contributions:

First, we formalise a necessary condition for CFSFDP to correctly identify all clusters. We show that a violation of this condition leads to poor clustering performance. This explains the reason why a density-based clustering algorithm such as CFSFDP is unable to correctly identify all clusters in datasets having large density variations.

Second, we propose a new measure called Local Contrast, as an alternative to density, to improve the capability of density-based clustering methods to detect clusters of hugely different densities in a dataset. We show that it has two unique properties that are critical in improving the above-mentioned capability, i.e., all cluster centres in the Local Contrast distribution have the same constant value, so as all local minima of Local Contrast which correspond to the local minima of density distribution, regardless of the densities of these cluster centres and local minima. We show that these properties make density-based algorithms much more robust in the presence of large density variations.

Third, by incorporating Local Contrast into CFSFDP, we create a powerful method LC-CFSFDP which has much better detecting power than the original method. Our empirical evaluation shows that LC-CFSFDP is the best performer compared to two state-of-the-art methods, SNN and ReScale, as well as FKNN-DPC which is a recent improvement of CFSFDP.