Keywords

1 Introduction

Clustering plays a basic role in many parts of data engineering, pattern recognition, data mining, data quantization and image analysis [5, 6, 15]. Some of the most important clustering algorithms are based on density estimation.

In the probabilistic model construction for univariate and multivariate data, finite mixture models have been widely used. The capability of representing arbitrary complex probability density functions (pdfs) enables it to have many applications not only in unsupervised learning [8], but also in (Bayesian) supervised learning or in parameter estimation of class-conditional pdfs [4].

One of the most important clustering method is based on GMM (Gaussian Mixture Models), which uses the Expectation maximization (EM) algorithm. Unfortunately, GMM has strong limitations related to its optimization procedure, which has to be applied in each iteration of the EM algorithm. While the expectation step is relatively simple, the maximization step usually needs complicated numerical optimization [2, 9]. Because of its greedy nature, the EM algorithm is sensitive to the initial configuration and usually gets stuck at local maxima. Moreover, there is a problem with choosing the correct number of clusters.

A feasible way for solving this problem is to choose several sets of initial values, then proceed respectively with the EM algorithms, and finally choose the best outcome set as the estimation. In most cases, the Bayesian information criterion (BIC) is used to establish the best result and final number of clusters. However, this will certainly increase computational complexity, since we have to apply a method many times with different initial parameters.

In order to solve this problem, many various methods were introduced. In [14] the authors proposed a split-and-merge EM (SMEM) algorithm in which they applied a split-and-merge operation to the EM algorithm. The basic idea of the SMEM algorithm is: after the convergence of the usual EM algorithm, we first use the split-and-merge operation to update the values of some parameters among all the parameters, then we perform the next round of the usual EM algorithm, and alternatively iterate the split-and-merge operation and the EM algorithm until some criterion is met. However, the split or merge method is a linear heuristic procedure without theoretical support. Moreover, the split or merge of the mean vector is independent of the covariance matrix, and vice versa.

In [17] authors propose two split methods based on SVD and the Cholesky decomposition of the covariance matrices. Shoham presented a robust clustering algorithm by creating a deterministic agglomeration EM (DAGEM) with multivariate t-distributions [11]. It was derived from the DAEM algorithm and achieved encouraging performance. Because the initial component number is much larger than the true number, the computation load is one to two orders of magnitude heavier than EM [11]. In [16] authors present Competitive EM (CEM) which uses an information theory based criterion for split and merge operations. The initial component number and model parameters can be set arbitrarily and the split and merge operation can be selected efficiently. In [7] the authors present a method that uses two different split and merge criteria. Homogeneity criterion decides whether two clusters should be merged or not (clusters which touch each other or slightly overlap can be merged, if they fulfill the homogeneity criterion). The split criterion is based on a penalized Bayesian information criterion (BIC), evaluated for the actual clusters and hypothetically split clusters, updated in the previous incremental learning step.

In all of the methods the basic idea is to construct a split merge strategy by analyzing cluster shapes. The idea is to avoid a local minima by applying some unconventional operation. The main problem is that such an operation does not depend on the cost function, that is minimized by the EM algorithms. Moreover, after the split or merge operation it is non trivial how to update the parameters of the components, since each point belongs to all cluster with different probability.

In this paper we present a split and merge strategy which solves these two basic problems. First of all, a simpler optimization procedure Cross Entropy Clustering (CEC) [13] is used instead of EM. The goal of CEC is to optimally approximate the scatter of a data set \(X \subset \mathbb {R}^d\) by a function which is a small modification of EM (for more information see Sect. 2). It occurs that at the small cost of having a minimally worse density approximation [13], we gain an efficient method which can be easily adapted for more complicated density models. Moreover, we can treat clusters separately which allows us to update the parameters of clusters more easily (each point belongs to only one group).

Furthermore, we can treat each cluster as a new dataset, that is separated from other data, and apply the CEC algorithm in that single cluster. The new division of the cluster is accepted if the global value of the cost function is lower. Similarly, we can verify if the merge of two clusters decreases the cost function.

Let us discuss the contents of the paper. In the first part of our work we briefly describe the CEC algorithm together with the basic structures which we can use in model construction. Then we describe the split-and-merge tweak in detail. At the end of this paper we present results of numerical experiments and conclusions.

2 Split-and-merge Cross Entropy Clustering

In this section the Split-and-merge Cross Entropy Clustering method will be presented. Our method is based on the CEC approach. Therefore, we start with a short introduction to the method. Since CEC is similar to EM in many aspects, let us first recall that, in general, EM aims to find \(p_1,\ldots ,p_k \ge 0\) (\(\sum \limits _{i=1}^k p_i=1\)) and \(f_1,\ldots , f_k\) Gaussian densities (where k is given beforehand and denotes the number of densities for which the convex combination builds the desired density model) such that the convex combination \( f=p_1 f_1 +\ldots + p_k f_k \) optimally approximates the scatter of our data X with respect to the MLE cost function

$$\begin{aligned} \mathrm {MLE}(f,X)=-\sum _{x \in X} \ln (p_1 f_1(x) +\ldots +p_k f_k(x)). \end{aligned}$$
(1)

A goal of CEC is to minimize the cost function, which is a minor modification of the one given in (1) by substituting the sum with the maximum:

$$\begin{aligned} \mathrm {CEC}(f,X)=-\sum \limits _{x \in X} \ln (\max (p_1 f_1(x),\ldots ,p_k f_k(x))). \end{aligned}$$
(2)

Instead of focusing on the density estimation as its main task, CEC aims at solving the clustering problem directly. As it turns out, at the small cost of having a minimally worse density approximation [13], we gain speed in implementationFootnote 1 and the ease of using less complicated density models. This is an advantage, roughly speaking, because the models do not mix with each other since we take the maximum instead of the sum.

To explain cross entropy clustering (CEC), we need to first introduce an energy function for the purpose of minimizing, which uses cross entropy. But let’s start with the definition of cross entropy itself.

By the cross-entropy of the dataset \(X\subset \mathbb {R}^d\) with respect to density f we understand

$$ H^{\times }\!\big (X \Vert f \big )=-\frac{1}{|X|}\sum _{x\in X} \ln f(x). $$

Cross entropy corresponds to the theoretical code-length of compression. Let us consider the case of partitioning \(X\subset \mathbb {R}^N\) into pairwise disjoint sets \(X_1,\ldots ,X_k\), such that elements of \(X_i\) are encoded by the optimal density from family \(\mathcal {F}\). In this case, the cross-entropy with respect to a family of coding densities \(\mathcal {F}\) is given by \( H^{\times }\!\big (X \Vert \mathcal {F} \big )=\inf _{f\in \mathcal {F}}H^{\times }\!\big (X \Vert f \big ). \) Thus, the mean code-length of a randomly chosen element x equals

$$\begin{aligned} E(X_1,\ldots ;X_k,\mathcal {F}):=\sum _{i=1}^k p_i \cdot (-\ln (p_i)+H^{\times }\!\big (X_i \Vert \mathcal {F}_i \big )), \end{aligned}$$
(3)

where \(p_i=\frac{|X_i|}{|X|}\).

The aim of CEC is to find a partitioning of \(X\subset \mathbb {R}^N\) into pairwise disjoint sets \(X_i\), \(i=1,\ldots ,k\) which minimizes the function given by (3). The minimization of (3) is equivalent to optimization of (2). In our case we consider as a \(\mathcal {F}\) a family of all Gaussian distributionsFootnote 2 \(\mathcal {G}\). According to [13], for single piece \(X_i\subset \mathbb {R}^N\) considered with respect to \(\mathcal {N}(\mu ,\varSigma ) \in \mathcal {G}\) we can get that

$$\begin{aligned} H^{\times }\!\big (X_i \Vert \mathcal {N}(\mu ,\varSigma ) \big )=\frac{N}{2} \ln (2\pi )+\frac{1}{2}\mathrm {tr}(\varSigma ^{-1}\varSigma _{X_i})+\frac{1}{2}\ln \det (\varSigma ), \end{aligned}$$
(4)

where \(\varSigma _{X_i}\) is a covariance matrix of \(X_i\), which allows as to easily calculate the cost function (3).

Let us now briefly introduce the algorithm step by step. The CEC clustering method starts from an initial clustering, which can be obtained randomly or by the use of the k-means++ [1] approach. Then the following two simple steps are applied simultaneously. First, we estimate the parameters of the optimal Gaussian function in each cluster. In the second step, we construct a new division of X by adding points to the closest cluster, or rather, to the closest Gaussian density. Specifically, we assign a point \(x \in X\) to the cluster \(i\in \{1,\ldots ,k\}\) such that

$$- \ln (p_i) - \ln \big ( \mathcal {N}(x ; \mu _{ i},\varSigma _{i}\big ))$$

is minimal.

We apply the above steps simultaneously until the change of the cost function is smaller than a predefined threshold or if the clusters did not change at all.

This approach causes a problem with local minima. Therefore, we apply a two point strategy for increasing the performance of the algorithm. More precisely, we apply a split and merge strategy.

2.1 Merge Strategies

Let us consider Gaussian densities \(\mathcal {G}\). For two disjoint sets X and Y (\(X,Y\subset \mathbb {R}^N\)), we want to develop a condition under which we should combine them into one cluster, rather then consider them separately – namely, energy/cost of \(X\cup Y\) is less then sum of energy of X and Y. This condition is given by

$$\begin{aligned} E(X\cup Y,\mathcal {G})\le E(X,\mathcal {G})+E(Y,\mathcal {G}). \end{aligned}$$
(5)

For the general case, it is very difficult to solve the above inequality and give an analytical solution. Thus, for simplicity of the problem, it is necessary to put some restrictions on the sets X and Y. In this section we solve this in a one dimensional real space under same constraint about sets X and Y. But before that, let us now recall the following important remark which simplifies our situation.

Remark 1

Let X, Y be given as finite subsets of \(\mathbb {R}^N\). Assume additionally that \(X\cap Y=\emptyset \). Then

$$\begin{aligned} \mathrm {m}_{X\cup Y}= & {} p_X \mathrm {m}_{X} + p_Y \mathrm {m}_{Y}\\ \varSigma _{X\cup Y}= & {} p_X\varSigma _{X}+p_Y \varSigma _{Y}+p_Xp_Y(\mathrm {m}_{X}-\mathrm {m}_{Y})(\mathrm {m}_{X}-\mathrm {m}_{Y})^T \end{aligned}$$

where \(p_X=\frac{|X|}{|X|+|Y|}, p_Y=\frac{|Y|}{|X|+|Y|}.\)

Theorem 1

Let X, Y be given as finite subsets of \(\mathbb {R}\). Assume additionally that

  • \(X\cap Y=\emptyset \),

  • \(|X| = |Y|\),

  • \(\varSigma _{X}=\varSigma _{Y}=\varSigma =(\sigma ^2)\) for an arbitrary \(\sigma > 0\).

Then the distributions \(\mathcal {N}(\mathrm {m}_X, \varSigma )\) and \(\mathcal {N}(\mathrm {m}_Y,\varSigma )\) should be combined into one Gaussian distribution \(\mathcal {N}(\frac{1}{2} (\mathrm {m}_X+\mathrm {m}_Y),(\sigma ^2+\frac{1}{4}(\mathrm {m}_X-\mathrm {m}_Y)^2))\) with respect to condition (5) iff

$$ |\mathrm {m}_X-\mathrm {m}_Y| < 2\sqrt{3}\sigma , $$

where \(\mathrm {m}_X,\mathrm {m}_Y\) denote the mean values of sets XY.

Proof

Let us consider Eq. (5) under the terms of our theorem. This leds us to

$$\begin{aligned} -\ln (1)+H^{\times }\!\big ({X\cup Y} \Vert \mathcal {G} \big )\le \frac{1}{2} \cdot (-\ln (\frac{1}{2})+H^{\times }\!\big ({X} \Vert \mathcal {G} \big )) + \frac{1}{2} \cdot (-\ln (\frac{1}{2})+H^{\times }\!\big ({Y} \Vert \mathcal {G} \big )). \end{aligned}$$

By Eq. (4) we get

$$\begin{aligned} \frac{N}{2}\ln (2\pi e)+\frac{1}{2}\ln \det (\varSigma _{X \cup Y})\le&\frac{1}{2}\bigg (\ln 2 + \frac{N}{2}\ln (2\pi e)+\frac{1}{2}\ln \det (\varSigma )\bigg ) \\ {}&+ \frac{1}{2}\bigg (\ln 2 +\frac{N}{2}\ln (2\pi e)+\frac{1}{2}\ln \det (\varSigma )\bigg ), \end{aligned}$$

and consequently \( \ln \det (\varSigma _{X \cup Y})\le \ln (4\det (\varSigma )). \) Finally, by Remark 1 we obtain

$$\begin{aligned} \det (\varSigma _{X \cup Y})\le 4\det (\varSigma ), \end{aligned}$$
(6)

where \(\varSigma _{X \cup Y} = \varSigma +\frac{1}{4}(\mathrm {m}_X-\mathrm {m}_Y)(\mathrm {m}_X-\mathrm {m}_Y)^T.\)

In the case of one dimensional space the Eq. (6) simplifies to

$$ \sigma ^2 + \frac{1}{4} (\mathrm {m}_X-\mathrm {m}_Y)^2 \le 4 \sigma ^2 $$

which ends the proof.

Figure 1 presents a simple illustration of the above theorem in the case of \(\sigma =1\).

Fig. 1.
figure 1

Comparison of density descriptions of two sets. First, the sum of two densities \(\mathcal {N}(0,1)\) (the blue one) and \(\mathcal {N}(2\sqrt{3},1)\) (the green one) describe separated clusters, while the density \(\mathcal {N}(\sqrt{3}, 4)\) (the red one) presents the best description of their set combination. (Color figure online)

It needs to be highlighted that even in one dimensional space our considerations were limited to strong constraints, which shows how it is a very hard task in the general case.

In the approach proposed by the authors for the merge problem we will always check the condition (5) directly. However, we will use Remark 1, according to which we do not need to recalculate the covariance matrix for a cluster combination, which simplifies and speeds up the calculations.

2.2 Split Strategies

The decision about splitting a given cluster \(X\subset \mathbb {R}^N\) into two parts \(X_1,X_2\) (\(X_1\cup X_2=X\), \(X_1\cap X_2=\emptyset \)) for CEC clustering in our case is given by the following

  • run CEC clustering with two clusters for set X,

  • for obtained clusters \(X_1\) and \(X_2\), if \(E(X,\mathcal {G})\ge E(X_1,\mathcal {G})+E(X_2,\mathcal {G})\) then replace cluster X by \(X_1\) and \(X_2\).

The approach, in this case, is similar to the merge strategy, since we also want to decrease energy. Figure 2 presents division steps for a sample set from CEC clustering.

Fig. 2.
figure 2

Split strategy. An initial clustering (with one cluster) is divided into as many clusters as needed using the split strategy. The numbers below the images present the numbers of waves of divisions. Those do not correspond to the number of clusters since multiple divisions can happen simultaneously.

Table 1. Comparison of different CEC split and merge strategies with GMM clustering.
Table 2. Effectiveness of clustering for datasets from Fig. 1. Table presents a comparison of results for various criteria with the best strategies selected (the marked ones).
Table 3. Adjusted rank index (the higher the better) for datasets 1 and 4 from Fig. 1 under different strategies.

3 Proposed Strategies and Experiments

We develop a few strategies to check when split and merge steps should be applied during CEC clustering. We denote them as follows:

  • CEC-FAF – the merge and split are performed when a CEC iteration did not reduce energy, the merge step is performed as many times as possible;

  • CEC-FSF – the merge and split are performed when a CEC iteration did not reduce energy, the merge is performed only once at a time;

  • CEC-P3AF – the merge is performed every third iteration as many times as possible, the split is performed if a CEC iteration(including the merge) did not reduce energy,

  • CEC-P1SP1 – the merge and split are performed in every iteration, the merge is performed once at most in each iteration.

We compare them with

  • CEC-Grid – we apply CEC clustering with a starting number of clusters from 2 to 70 and then we choose the best result according to the lowest Bayesian information criterion (BIC);

  • GMM-Grid – we used the Project R package mclust [3] which performs a grid search for finding the optimal numbers of mixture components (clusters) according to the BIC criterion. The considered range of the number of clusters was the same as in CEC-Grid.

Fig. 3.
figure 3

Energy during clustering iterations under different strategies. The split and merge strategies were all started with a single cluster.

Table 4. The results of algorithms in the case of data from UCI repository.

3.1 Split-and-merge Strategies Effectiveness

Illustrations: Table 1 presents the results of clustering of the sample sets with the strategies listed above. The results of all strategies are pretty similar except for GMM-Grid.

Accuracy: Table 2 presents the comparison of the results obtained by investigating different strategies. We show the calculated MLE (maximum likelihood estimation – the higher the better), BIC (Bayesian information criterion – the lower the better), AIC (Akaike information criterion – the lower the better) and also give the values of the energy function obtained by CEC clusterings (the lower the better). In this case results seem to favor the strategies CEC-Grid and CEC-P1SP1 as the best ones, although the results are usually very close (except for GMM-Grid).

Table 3 presents an adjusted rank index for different strategies. In this case all strategies gave the same result on their best runs.

Energy: Figure 3 presents the comparison of the energy function during clustering under different strategies. The results show that CEC-Grid and CEC-P1SP1 reach minimum in the smallest number of steps.

Speed: While the above could suggest the CEC-P1SP1 strategy was the fastest, take note that because it performs both split and merge operations at each iteration, the iterations themself are more expensive computational-wise. A direct time measurement of the strategies revealed CEC-FSF to be the fastest, followed closely by CEC-P3AF.

3.2 Split-and-merge CEC in Higher Dimensions

Table 4 presents the results obtained with the same strategies and algorithms on higher dimensional data. In this experiment there was much more variation both between the strategies and between different runs of the same strategy. It appears that the algorithm suffers from the curse of dimensionality and has trouble finding the right number of clusters in higher dimensional data. This is most evident in the iris dataset, where it overshots that number tenfold.

4 Conclusions and Future Work

We have proposed an approach for the split and merge problem in CEC clustering. It was shown that even in a one dimensional space the condition is not so easy to assess. Thus, in the proposed approach, we implicitly use the formula to decide if two clusters should be combined. Our experiments present that the strategy when the merge and split are performed in every iteration and the merge is performed once at most in each iteration can be competitive to classical CEC or even CEC-Grid.

Our future work will focus on developing new measures which will allow us to solve the split problem using information theory.