Efficient mixture model for clustering of sparse high dimensional binary data
Abstract
Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data.
Keywords
Bernoulli mixture model Entropy clustering Data compression High-dimensional binary data1 Introduction
Clustering, one of the fundamental tasks of machine learning, relies on grouping similar data points together while keeping dissimilar ones separate. It is hard to overestimate the role of clustering in present data analysis. Cluster analysis has been widely used in text mining to collect similar documents together (Baker and McCallum 1998); in biology to build groups of genes with related expression patterns (Fränti et al. 2003); in social networks to detect clusters of communities (Papadopoulos et al. 2012) and many other branches of science.
While most of the clustering methods are designed for continuous data, binary attributes become very common in various domains. In text mining, sentences (or documents) are often represented with the use of the set-of-words representation (Ying Zhao 2005; Guansong Pang 2015; Juan and Vidal 2002). In this case, a document is identified with a binary vector, where bit 1 at the i-th coordinate indicates that the i-th word from a given dictionary appears in this document.^{1} In cheminformatics, a chemical compound is also described by a binary vector, where every bit denotes the presence or absence of a predefined chemical pattern (Ewing et al. 2006; Klekota and Roth 2008). Since the number of possible patterns (words) is large, the resulting representation is high dimensional. Another characteristic of a typical binary representation is its sparsity. In text processing the number of words contained in a sentence is relatively low compared to the total number of words in a dictionary. In consequence, most coordinates are occupied by zeros, while only selected ones are nonzero.
On one hand, the aforementioned sparse high dimensional binary representation is very intuitive for the user, which is also important for interpretable machine learning (Ribeiro et al. 2016). On the other hand, it is extremely hard to handle by classical clustering algorithms. One reason behind that is commonly known as the curse of dimensionality. Since in high dimensional spaces any randomly chosen pairs of points have roughly similar Euclidean distances, then it is not a proper measure of dissimilarity (Steinbach et al. 2004; Mali and Mitra 2003). Moreover, direct application of standard clustering tools to this type of data may lead to a substantial increase of computational cost, which disqualifies a method from a practical use. Subspace clustering, which aims at approximating a data set by a union of low-dimensional subspaces, is a typical approach for handling high dimensional data (Li et al. 2018; Lu et al. 2012; You et al. 2016; Struski et al. 2017; Rahmani and Atia 2017). Nevertheless, these techniques have problems with binary data, because they also use Euclidean distances in their criterion functions. Moreover, they are unable to process extremely high dimensional data due to the non-linear computational complexity with respect to data dimension.^{2} In consequence, clustering algorithms have to be designed directly for this type of data to obtain satisfactory results in reasonable time.
In this paper we introduce a version of model-based clustering, SparseMix, which efficiently processes high dimensional sparse binary data.^{3} Our model splits the data into groups and creates probabilistic distributions for those groups (see Sect. 3). The points are assigned in a way that minimizes the entropy of the distributions. In contrast to classical mixture models using Bernoulli variables or latent trait models, our approach is designed for sparse data and can be efficiently optimized by an on-line Hartigan algorithm, which converges faster and finds better solutions than batch procedures like EM (Sect. 4).
The paper contains extensive experiments performed on various types of data, including text corpora, chemical and biological data sets, as well as the MNIST image database, see Sect. 5. The results show that SparseMix provides a higher compatibility with reference partition than existing methods based on mixture models and similarity measures. Visual inspection of clusters representatives obtained by SparseMix on the MNIST data set suggests high quality of its results, see Fig. 1. Moreover, its running time is significantly lower than that of related model-based algorithms and comparable to methods implemented in the Cluto package, which are optimized for processing large data sets, see Sect. 5.3.
- 1.
We introduce a cluster representative, which provides a sparser representation of a cluster’s instances and describes the most common patterns of the cluster.
- 2.
We analyze the impact of the cluster representative on fitting a Bernoulli model to the cluster distribution (Theorem 1) and illustrate exemplary representatives for the MNIST data set.
- 3.
An additional trade-off parameter \(\beta \) is introduced to the cost function, which weights the importance of model complexity. Its selection allows us to decide whether the model should aim to reduce clusters.
- 4.
We extend a theoretical analysis of the model by inspecting various conditions influencing clusters’ reduction, see Example 1. Moreover, we verify the clustering reduction problem in the experimental study, see Sect. 5.2.
- 5.
An efficient implementation of the algorithm is discussed in detail, see Sect. 4.
- 6.
We consider larger data sets (e.g. MNIST and Reuters), compare our method with a wider spectrum of clustering algorithms (maximum likelihood approaches, subspace clustering and dimensionality reduction techniques) and verify the stability of the algorithms.
2 Related work
In this section we refer to typical approaches for clustering binary data including distance-based and model-based techniques. Moreover, we discuss regularization methods, which allow to select optimal number of clusters.
2.1 Distance-based clustering
A lot of clustering methods are expressed in a geometric framework, where the similarity between objects is defined with the use of the Euclidean metric, e.g. k-means (MacQueen et al. 1967). Although a geometric description of clustering can be insightful for continuous data, it becomes less informative in the case of high dimensional binary (or discrete) vectors, where the Euclidean distance is not natural.
To adapt these approaches to binary data sets, the authors consider, for instance, k-medoids or k-modes (Huang 1998; Chan et al. 2004) with a dissimilarity measure designed for this special type of data, such as Hamming, Jaccard or Tanimoto measures (Li 2005). Evaluation of possible dissimilarity metrics for categorical data can be found in dos Santos and Zárate (2015), Bai et al. (2011). To obtain a more flexible structure of clusters, one can also use hierarchical methods (Zhao and Karypis 2002), density-based clustering (Wen et al. 2002) or model-based techniques (Spurek 2017; Spurek et al. 2017). One of important publicly available tools for efficient clustering of high dimensional binary data is the Cluto package (Karypis 2002). It is built on a sophisticated multi-level graph partitioning engine and offers many different criteria that can be used to derive both partitional, hierarchical and spectral clustering algorithms.
Another straightforward way for clustering high dimensional data relies on reducing the initial data’s dimension through a preprocessing step. This allows to transform the data into a continuous low dimensional space where typical clustering methods are applied efficiently. Such dimensionality reduction can be performed with the use of linear methods, such as PCA (principal component analysis) (Indhumathi and Sathiyabama 2010), or non-linear techniques, such as deep autoencoders (Jang et al. 2017; Serrà and Karatzoglou 2017). Although this approach allows for using any clustering algorithm on the reduced data space, it leads to a loss of information about the original data after performing the dimensionality reduction.
Subspace clustering is a class of techniques designed to deal with high dimensional data by approximating them using multiple low-dimensional subspaces (Li et al. 2017). This area received considerable attention in the last years and many techniques have been, proposed including iterative methods, statistical approaches and spectral clustering based methods (You et al. 2016; Tsakiris and Vidal 2017; Struski et al. 2017; Rahmani and Atia 2017). Although these methods give very good results on continuous data such as images (Lu et al. 2012), they are rarely applied to binary data, because they also use the Euclidean distance in their cost functions. Moreover, they are unable to process extremely high dimensional data due to the quadratic or cubic computational complexity with respect to the number of dimensions in the data (Struski et al. 2017). In consequence, a lot of algorithms apply dimensionality reduction as a preprocessing step (Lu et al. 2012).
2.2 Model-based techniques
Model-based clustering (McLachlan and Peel 2004), where data is modeled as a sample from a parametric mixture of probability distributions, is commonly used for grouping continuous data using Gaussian models, but has also been adapted for processing binary data. In the simplest case, the probability model of each cluster is composed of a sequence of independent Bernoulli variables (or multinomial distributions), describing the probabilities on subsequent attributes (Celeux and Govaert 1991; Juan and Vidal 2002). Since many attributes are usually statistically irrelevant and independent of true categories, they may be removed or associated with small weights (Graham and Miller 2006; Bouguila 2010). This partially links mixture models with subspace clustering of discrete data (Yamamoto and Hayashi 2015; Chen et al. 2016). Since the use of multinomial distributions formally requires an independence of attributes, different smoothing techniques were proposed, such as applying Dirichlet distributions as a prior to the multinomial (Bouguila and ElGuebaly 2009). Another version of using mixture models for binary variables tries to maximize the probability that the data points are generated around cluster centers with the smallest possible dispersion (Celeux and Govaert 1991). This technique is closely related to our approach. However, our model allows for using any cluster representatives (not only cluster centers) and is significantly faster due to the use of sparse coders.
A mixture of latent trait analyzers is a specialized type of mixture model for categorical data, where a continuous univariate of a multivariate latent variable is used to describe the dependence in categorical attributes (Vermunt 2007; Gollini and Murphy 2014). Although this technique recently received high interest in the literature (Langseth and Nielsen 2009; Cagnone and Viroli 2012), it is potentially difficult to fit the model, because the likelihood function involves an integral that cannot be evaluated analytically. Moreover, its use is computationally expensive for large high dimensional data sets (Tang et al. 2015).
Information-theoretic clustering relies on minimizing the entropy of a partition or maximizing the mutual information between the data and its discretized form Li et al. (2004), Tishby et al. (1999), Dhillon and Guan (2003). Although both approaches are similar and can be explained as a minimization of the coding cost, the first creates “hard clusters”, where an instance is classified to exactly one category, while the second allows for soft assignments (Strouse and Schwab 2016). Celeux and Govaert (1991) established a close connection between entropy-based techniques for discrete data and model-based clustering using Bernoulli variables. In particular, the entropy criterion can be formally derived using the maximum likelihood of the classification task.
To the best of the authors’ knowledge, neither model-based clustering nor information-theoretic methods have been optimized for processing sparse high-dimensional binary data. Our method can be seen as a combination of k-medoids with model-based clustering (in the sense that it describes a cluster by a single representative and a multivariate probability model), which is efficiently realized for sparse high-dimensional binary data.
2.3 Model selection criteria
Clustering methods usually assume that the number of clusters is known. However, most model-based techniques can be combined with additional tools, which, to some extent, allow to select an optimal number of groups (Grantham 2014).
A standard approach for detecting the number of clusters is to conduct a series of statistical tests. The idea relies on comparing likelihood ratios between models obtained for different numbers of groups (McLachlan 1987; Ghosh et al. 2006). Another method is to use a penalized likelihood, which combines the obtained likelihood with a measure of model complexity. A straightforward procedure fits models with different numbers of clusters and selects the one with the optimal penalized likelihood criterion (McLachlan and Peel 2004). Typical measures for model complexity are the Akaike Information Criterion (AIC) (Bozdogan and Sclove 1984) and the Bayesian Information Criterion (BIC) (Schwarz et al. 1978). Although these criteria can be directly applied to Gaussian Mixture Models, an additional parameter has to be estimated for the mixture of Bernoulli models to balance the importance between a model’s quality and its complexity (Bontemps et al. 2013). Alternatively, one can use coding based criteria, such as the Minimum Message Length (MML) (Baxter and Oliver 2000; Bouguila and Ziou 2007) or the Minimum Description Length (MDL) (Barbará et al. 2002) principles. The idea is to find the number of coders (clusters), which minimizes the amount of information needed to transmit the data efficiently from sender to receiver. In addition to detecting the number of clusters, these criteria also allow for eliminating redundant attributes. Coding criteria usually are used for comparing two models (like AIC or BIC criteria). Silvestre et al. (2015) showed how to apply the MML criterion simultaneously with a clustering method. This is similar to our algorithm, which reduces redundant clusters on-line.
Finally, it is possible to use a fully Bayesian approach, which treats the number of clusters as a random variable. By defining a prior distribution on the number of clusters K, we can estimate its number using the maximum a posteriori method (MAP) (Nobile et al. 2004). Although this approach is theoretically justified, it may be difficult to optimize in practice (Richardson and Green 1997; Nobile and Fearnside 2007). To our best knowledge, currently there are no available implementations achieving this technique for the Bernoulli mixture model.
3 Clustering model
The goal of clustering is to split data into groups that contain elements characterized by similar patterns. In our approach, the elements are similar if they can be efficiently compressed by the same algorithm. We begin this section by presenting a model for compressing elements within a single cluster. Next, we combine these partial encoders and define a final clustering objective function.
3.1 Compression model for a single cluster
Let us assume that \(X \subset \{0,1\}^D\) is a data set (cluster) containing D-dimensional binary vectors. We implicitly assume that X represents sparse data, i.e. the number of positions occupied by non-zero bits is relatively low.
A typical way for encoding such data is to directly remember the values at each coordinate (Barbará et al. 2002; Li et al. 2004). Since, in practice, D is often large, this straightforward technique might be computationally inefficient. Moreover, due to the sparsity of data, positions occupied by zeros are not very informative while the less frequent non-zero bits carry substantial knowledge. Therefore, instead of remembering all the bits of every vector, it might be more convenient to encode positions occupied by non-zero values. It occurs that this strategy can be efficiently implemented by on-line algorithms.
This leads to the SparseMix objective function for a single cluster, which gives the average cost of compressing a single element of X by our sparse coder:
Definition 1
Observe that, given probabilities \(p_1,\ldots ,p_D\), the selection of T determines the form of m and \(\mathrm {D}_m(X)\). The above cost coincides with the optimal compression under the assumption of independence imposed on the attributes of \(d \in \mathrm {D}_m(X)\). If this assumption does not hold, this coding scheme gives a suboptimal compression, but such codes can still be constructed.
Remark 1
To be able to decode the initial data set, we would also need to remember the probabilities \(p_1,\ldots ,p_D\) determining the form of the representative m and the corresponding probability \({\mathcal {Q}}\) used for constructing the codewords. These are the model parameters, which, in a practical coding scheme, are stored once in the header. Since they do not affect the asymptotic value of data compression, we do not include them in the final cost function.^{6}
Moreover, to reconstruct the original data we should distinguish between the encoded representations of subsequent vectors. It could be realized by reserving an additional symbol for separating two encoded vectors or by remembering the number of non-zero positions in every vector. Although this is necessary for the coding task, it is less important for clustering and therefore we decided not to include it in the cost function.
The following theorem demonstrates that \(T=\frac{1}{2}\) provides the best compression rate of a single cluster. Making use of the analogy between the Shannon compression theory and data modeling, it shows that with \(T_1 \ge \frac{1}{2}\) our model allows for a better fitting of a Bernoulli model to the cluster distribution than using \(T_2\) greater than \(T_1\).
Theorem 1
Proof
See Appendix A. \(\square \)
3.2 Clustering criterion function
A single encoder allows us to compress simple data. To efficiently encode real data sets, which usually origin from several sources, it is profitable to construct multiple coding algorithms, each designed for one homogeneous part of the data. Finding an optimal set of algorithms leads to a natural division of the data, which is the basic idea behind our model. Below, we describe the construction of our clustering objective function, which combines the partial cost functions of the clusters.
The SparseMix cost function combines the cost of clusters’ identification with the cost of encoding their elements. To add higher flexibility to the model, we introduce an additional parameter \(\beta \), which allows to weight the cost of clusters’ identification. Specifically, if the number of clusters is known a priori, we should put \(\beta = 0\) to prevent from reducing any groups. On the other hand, to encourage the model to remove clusters we can increase the value of \(\beta \). By default \(\beta = 1\), which gives a typical coding model:
Definition 2
As can be seen, every cluster is described by a single representative and a probability distribution modeling dispersion from a representative. Therefore, our model can be interpreted as a combination of k-medoids with model-based clustering. It is worth mentioning that for \(T=1\), we always get a representative \(m = 0\). In consequence, \(\mathrm {D}_0(X) = X\) and a distribution in every cluster is directly fitted to the original data.
The cost of clusters’ identification allows us to reduce unnecessary clusters. To get more insight into this mechanism, we present the following example. For simplicity, we use \(T = \beta = 1\).
Example 1
- 1.
Sources are characterized by the same number of bits. The influence of \(\omega \) and \(\alpha \) on the number of clusters, for a fixed \(d = \frac{1}{2}D\), is presented in Fig. 5a. Generally, if sources are well-separated, i.e. \(\alpha \notin (0.2,0.8)\), then SparseMix will always create two clusters regardless of the mixing proportion. This confirms that SparseMix is not sensitive to unbalanced sources generating the data if only they are distinct.
- 2.
Sources contain the same number of instances. The Fig. 5b shows the relation between d and \(\alpha \) when the mixing parameter \(\omega =\frac{1}{2}\). If one source is identified by a significantly lower number of attributes than the other (\(d<< D\)), then SparseMix will merge both sources into a single cluster. Since one source is characterized by a small number of features, it might be more costly to encode the cluster identifier than its attributes. In other words, the clusters are merged together, because the cost of cluster identification outweighs the cost of encoding the source elements.
- 3.
Both proportions of dimensions and instances for the mixture sources are balanced. If we set equal proportions for the source and dimension coefficients, then the number of clusters depends on the average number of non-zero bits in the data \(L = p d\), see Fig. 5c. For high density of data, we can easily distinguish the sources and, in consequence, SparseMix will end up with two clusters. On the other hand, in the case of sparse data, we use less memory for remembering its elements and the cost of clusters’ identification grows compared to the cost of encoding the elements within the groups.
4 Fast optimization algorithm
In this section, we present an on-line algorithm for optimizing the SparseMix cost function and discuss its computational complexity. Before that, let us first show how to estimate the probabilities involved in the formula (5).
4.1 Estimation of the cost function
4.2 Optimization algorithm
To obtain an optimal partition of X, the SparseMix cost function has to be minimized. For this purpose, we adapt a modified version of the Hartigan procedure, which is commonly applied in an on-line version of k-means (Hartigan and Wong 1979). Although the complexity of a single iteration of the Hartigan algorithm is often higher than in batch procedures such as EM, the model converges in a significantly lower number of iterations and usually finds better minima [see Śmieja and Geiger (2017) for the experimental comparison in the case of cross-entropy clustering].
The minimization procedure consists of two parts: initialization and iteration. In the initialization stage, \(k \ge 2\) nonempty groups are formed in an arbitrary manner. In the simplest case, it could be a random initialization, but to obtain better results one can also apply a kind of k-means++ seeding. In the iteration step the elements are reassigned between clusters in order to minimize the value of the criterion function. Additionally, due to the cost of clusters’ identification, some groups may lose their elements and finally disappear. In practice, a cluster is reduced if its size falls below a given threshold \(\varepsilon \cdot |X|\), for a fixed \(\varepsilon > 0\).
The proposed algorithm has to traverse all data points and all clusters to find the best assignment. Its computational complexity depends on the procedure for updating cluster models and recalculating the SparseMix cost function after switching elements between clusters (see lines 13 and 16). Below, we discuss the details of an efficient recalculation of this cost.
- 1.If \(n^i_j \le (n_i+1)T - 1\), then before and after the update we use the first formula of (8):Moreover, this value changes only when \(x_j = 1\).$$\begin{aligned} {\hat{N}}^i_j = N^i_j + x_j. \end{aligned}$$
- 2.If \(n^i_j > (n_i+1)T\), then before and after the update we use the second formula:It is changed only when \(x_j = 0\).$$\begin{aligned} {\hat{N}}^i_j = N^i_j + (1-x_j). \end{aligned}$$
- 3.If \(x_j = 0\) and \(n^i_j \in (n_iT, (n_i+1)T]\) then we switch from the second to the first formula andOtherwise, it remains unchanged.$$\begin{aligned} {\hat{N}}^i_j = n^i_j. \end{aligned}$$
- 4.If \(x_j = 1\) and \(n^i_j \in ((n_i+1)T - 1, n_iT]\) then we switch from the first to the second formula andOtherwise, it remains unchanged.$$\begin{aligned} {\hat{N}}^i_j = n_i - n^i_j. \end{aligned}$$
4.3 Computational complexity
We analyze the computational complexity of the whole algorithm.
5 Experiments
Summary of the data sets used in experiments
Dataset | Size | Dimensions | Avg. no. of non-zero bits | Classes |
---|---|---|---|---|
Reuters | 291 127 | 47 236 | 55.58 | 18 |
20newsgroups | 18 846 | 187 397 | 151.13 | 20 |
7newsgroups | 6 997 | 26 411 | 99.49 | 7 |
Farm-ads | 4 143 | 54 877 | 197.23 | 2 |
Questions | 5 452 | 3 029 | 4.04 | 6 |
Sentiment | 1 000 | 2 750 | 7.50 | 2 |
SMS | 5 574 | 7 259 | 13.51 | 2 |
Chemical data | 3 374 | 4 860 | 65.45 | 5 |
Mushroom | 8 124 | 119 | 21 | 2 |
Splice | 3 190 | 287 | 60 | 2 |
mnist | 70 000 | 784 | 150.08 | 10 |
5.1 Quality of clusters
In this experiment we evaluated our method over various binary data sets, summarized in Table 1, and compared its results with related methods listed at the beginning of this section. Since we considered data sets designed for classification, we compared the obtained clusterings with a reference partition.
We used seven text data sets: Reuters (a subset of documents from the Reuters Corpus Volume 1 Version 2 categorized into 18 topics) (Lewis et al. 2004), Questions (Li and Roth 2002), 20newsgroups, 7newsgroups (a subset containing only 7 classes of 20newsgroups), Farm-ads, SMS Spam Collection and Sentiment Labeled Sentences retrieved from UCI repository (Asuncion 2007). Each data set was encoded in a binary form with the use of the set-of-words representation: given a dictionary of words, a document (or sentence) is represented as a binary vector, where coordinates indicate the presence or absence of words from a dictionary.
We considered a real data set containing chemical compounds acting on 5-\(\hbox {HT}_{1A}\) receptor ligands (Warszycki et al. 2013). This is one of the proteins responsible for the regulation of the central nervous system. This data set was manually labeled by an expert in a hierarchical form. We narrowed that classification tree down to 5 classes: tetralines, alkylamines, piperidines, aimides and other piperazines. Each compound was represented by its Klekota–Roth fingerprint, which encodes 4860 chemical patterns in a binary vector (Yap 2011).
We also took a molecular biology data set (splice), which describes primate splice-junction gene sequences. Moreover, we used a data set containing mushrooms described in terms of physical characteristics, where the goal is to predict whether a mushroom is poisonous or edible. Both data sets were selected from the UCI repository.
Finally, we evaluated all methods on the MNIST data set (LeCun et al. 1998), which is a collection of handwritten digits made into gray scale. To produce binary images, pixels with intensity greater than 0 were turned into bit 1.
We considered two types of Bernoulli mixture models. The first one is a classical mixture model, which relies on the maximum likelihood principle (ML) (Elmore et al. 2004). We used the R package “mixtools” (Benaglia et al. 2009) for its implementation. The second method is based on classification maximum likelihood (CML) (Li et al. 2004). While ML models every data point as a sample from a mixture of probability distributions, CML assigns every example to a single component. CML coincides with applying entropy as a clustering criterion (Celeux and Govaert 1991).
We also used two distance-based algorithms. The first one is k-medoids (Huang 1998), which focuses on minimizing the average distance between data points and the corresponding clusters’ medoids (a generalization of a mean). We used the implementation of k-medoids from the cluster R package^{8} with a Jaccard similarity measure.^{9} We also considered the Cluto software (Karypis 2002), which is an optimized package for clustering large data sets. We ran the algorithm “direct” with a cosine distance function,^{10} which means that the package will calculate the final clustering directly, rather than bisecting the data multiple times.
The aforementioned methods are designed to deal with binary data. To apply a typical clustering technique to such data, one could use a method of dimensionality reduction as a preprocessing step. In the experiments, we used PCA to reduce the data dimension to 100 principal components. Next, we used the k-means algorithm on the reduced space.
Finally, we used two subspace clustering methods: ORGEN (You et al. 2016) and LSR (Lu et al. 2012), which are often applied to continuous data. In the case of high dimensional data, LSR reduces their dimension with the use of PCA. In our experiments, we reduced the data to 100 principal components.
Since all of the aforementioned methods are non-deterministic (their results depends on the initialization), each was run 50 times with different initial partitions and the best result was selected according to the method’s inner metric.
Adjusted Rand Index of the considered methods. Parameter \(\beta \) was fixed at 0 to avoid deleting clusters
Data set | \(\textsc {SparseMix}(\frac{1}{2})\) | \(\textsc {SparseMix}(1)\) | ML | CML | k-medoids | Cluto | LSR | ORGEN | PCA-k-means |
---|---|---|---|---|---|---|---|---|---|
Reuters | 0.1399 | 0.1316 | – | – | – | \(-\) 0.0454 | – | – | 0.0752 |
20newsgroup | 0.1632 | 0.4916 | – | – | – | 0.057 | – | – | 0.045 |
7newsgroups | 0.6337 | 0.8293 | 0.0948 | 0.0991 | 0.0012 | 0.7509 | 0.177 | – | 0.1029 |
Farm-ads | 0.1957 | 0.2664 | 0.0552 | 0.0468 | 0.0192 | 0.2564 | 0.0095 | – | \(-\) 0.0039 |
Questions | 0.0622 | 0.0622 | 0.0274 | 0.0087 | 0.03629 | 0.0925 | 0.0475 | 0.001 | 0.0299 |
Sentiment | 0.0667 | 0.0667 | 0.0198 | 0.0064 | 0.0153 | 0.0571 | 0.0167 | 0 | 0.0083 |
SMS | 0.5433 | 0.5433 | 0.3133 | 0.3063 | 0.1081 | 0.5748 | – | 0 | 0.127 |
Chemical | 0.3856 | 0.4281 | 0.4041 | 0.4472 | 0.3724 | 0.3841 | 0.1745 | 0.128 | 0.3068 |
Mushroom | 0.6354 | 0.6275 | 0.6120 | 0.6354 | 0.626 | 0.6228 | – | 0.2124 | 0.6205 |
Splice | 0.7216 | 0.2607 | 0.2442 | 0.4885 | 0.1071 | 0.1592 | 0.0005 | 0.0005 | 0.2414 |
mnist | 0.4501 | 0.395 | 0.4171 | 0.4277 | 0.3612 | 0.283 | – | 0.3763 | 0.3792 |
The results presented in Table 2 show significant disproportions between the two best performing methods (SparseMix and Cluto) and the other examined algorithms. The highest differences can be observed in the case of 7newsgroups and farm-adds data sets. Due to a large number of examples or attributes, most algorithms, except SparseMix, Cluto and PCA-k-means, failed to run on the 20newsgroups and Reuters data, where only SparseMix produced reasonable results. In the case of the questions and sentiment data sets, neither method showed results significantly better than a random partitioning. Let us observe that these sets are extremely sparse, which could make the appropriate grouping of their examples very difficult. In the mushroom example, on the other hand, all methods seemed to perform equally good. Slightly higher differences can be observed on MNIST and the chemical data sets, where ML and CML obtained good results. Finally, SparseMix with \(T = \frac{1}{2}\) significantly outperformed other methods for splice. Let us observe that both subspace clustering techniques were inappropriate for clustering this type of data. Moreover, reducing the data dimension by PCA led to the loss of information and, in consequence, recovering the true clustering structure was impossible.
To further illustrate the effects of SparseMix we present its detailed results obtained on the MNIST data set. Figure 7 shows a confusion matrix and clusters’ representatives (first row) produced by SparseMix with \(T=\frac{1}{2}\). It is clear that most of the clusters’ representatives resemble actual hand-drawn digits. It can be seen that SparseMix had trouble distinguishing between the digits 4 and 9, mixing them up a bit in their respective clusters. The digit 5 also could not be properly separated, resulting in its scatter among other clusters. The digit 1 occupied two separate clusters, once for being written vertically and once for being written diagonally. Nevertheless, this example showed that SparseMix is able to find reasonable clusters representatives that reflect their content in a strictly unsupervised way.
5.2 Detecting the number of clusters
In practice, the number of clusters may be unknown and a clustering method has to detect this number. In this section, we compare the performance of our method with a typical model-based approach when the number of clusters is unknown.
Our method uses the cost of clusters identification, which is a regularization term and allows for reducing redundant clusters on-line. In consequence, SparseMix starts with a number of clusters defined by the user and removes unnecessary groups in the clustering process. Final effects depend on the selection of the parameter \(\beta \), which controls the trade-off between the quality of the model and its complexity. For high \(\beta \), the cost of maintaining clusters could outweigh the cost of coding elements within clusters and SparseMix collapses them to a single cluster. On the other hand, for small \(\beta \) the cost of coding clusters identifiers could be irrelevant. Since the number of clusters is usually significantly lower than the number of non-zero attributes in a vector, we should put \(\beta \) greater than 1. Intuitively, both terms will be balanced for \(\beta \) close to the average number of non-zero bits in a data set. We will verify this hypothesis in the following experiment.
The results illustrated in the Fig. 9 show a high sensitivity of this approach to the change of trade-off parameter \(\lambda \). Although an additional penalty term allows to select the best model, there is no unified methodology for choosing the parameter^{13}\(\lambda \). In the case of a typical Gaussian mixture model, an analogical BIC formula does not contain additional parameters, which makes it easy to use in practice. However, for the Bernoulli mixture model, the analogical approach needs to specify the values of parameter \(\lambda \), which could limit its practical usefulness.
5.3 Time comparison
To further analyze the efficiency of our algorithm, we performed an analogical experiment on the Reuters data. We ran only SparseMix and Cluto because other algorithms were unable to process such a large data set. The results presented in Fig. 12 show some disproportion between the considered methods. One can observe that our algorithms were 2–5 times slower than Cluto. Nevertheless, Fig. 12 demonstrates that the complexity of SparseMix is close to linear with respect to the number of data points and the number of attributes. In consequence, more efficient implementation should increase the speed of our method.
5.4 Clustering stability
In this experiment, we examined the stability of the considered clustering algorithms in the presence of changes in the data. More precisely, we tested whether a method was able to preserve clustering results when some data instances or attributes were removed. In practical applications, high stability of an algorithm can be used to speed up the clustering procedure. If a method does not change its result when using a lower number of instances or attributes, we can safely perform clustering on a reduced data set and assign the remaining instances to the nearest clusters. We again used the chemical data set for this experiment. In this simulation we only ran SparseMix with \(T = \frac{1}{2}\) (our preliminary studies showed that parameter T does not visibly influence overall results).
First, we investigated the influence of the number of instances on the clustering results. For this purpose, we performed the clustering of the whole data set X and randomly selected p percent of its instances \(X^p\) (we considered \(p = 0.1, 0.2, \ldots , 0.9\)). Stability was measured by calculating the ARI between the clusters \(X_1^p,\ldots ,X_k^p\) created from the selected fraction of data \(X^p\) and from the whole data set (restricted to the same instances), i.e. \((X_1 \cap X^p), \ldots , (X_k \cap X^p)\). To reduce the effect of randomness, this procedure was repeated 5 times and the final results were averaged. The results presented in Fig. 13a show that for a small number of data points Cluto gave the highest stability, but as the number of instances grew, SparseMix performed better.
5.5 Sensitivity to imbalanced data
In the following section, we examined the sensitivity of the clustering algorithms to data imbalance, which very often proves to be a crucial trait of an algorithm (Chawla 2009). This extends the theoretical analysis presented in Example 1.
Figure 14a reports the fraction of the data that belongs to the first cluster. The optimal solution is \(y = \omega \). We can see that the k-medoids method did not respond to the changes in \(\omega \). Other algorithms seemed to perform well on the mid section, but gradually steered off the optimal line as \(\omega \) approached 0 or 1. The highest robustness to imbalanced data was obtained by ML and SparseMix with \(\beta = 1\) (cost of clusters identification was taken into account). If the cost of maintaining clusters is not considered (\(\beta = 0\)), then SparseMix tends to create more balanced groups. These results are consistent with a discussion outlined before Definition 2 and in Example 1.
Figure 14b presents the fraction of data that belongs to the first cluster (perfect solution is given by \(y = \frac{1}{2}\)). It can be observed that SparseMix with \(\beta = 1\) was very sensitive to attribute imbalance. According to the conclusion given in Example 1, the cost of encoding elements within a cluster is outweighed by the cost of clusters’ identification, as \(\alpha \rightarrow 0\) (or 1), which results in the reduction of the lighter group. Since the data is sampled from an underlying distribution and SparseMix flows to a local minimum, some attempts result in creating one group, while the others produce two clusters, which explains why the corresponding line is not equal to 1, for \(\alpha < 0.2\). This effect was not apparent when SparseMix used \(\beta = 0\), because there was no cost of creating an additional cluster. Its results were comparable to ML and CML, which also do not use any cost of clusters’ identification.
6 Conclusion
In this paper, we proposed SparseMix, a new approach for the clustering of sparse high dimensional binary data. Our results showed that SparseMix is not only more accurate than related model-based clustering algorithms, but also significantly faster. Its evaluation time is comparable to algorithms implemented in the Cluto package, the software optimized for processing large data sets, but its clusters quality is better. SparseMix provides a description of each cluster by its representative and the dispersion from this representative. Experimental results demonstrated that representatives obtained for the MNIST data set provide high resemblance to the original examples of handwritten digits. The model was theoretically analyzed.
Footnotes
- 1.
Another possibility is to use bag-of-words or tf-idf transform.
- 2.
They often apply linear regression or principal component analysis to match an affine subspace to data points, which result in a quadratic or cubic computational complexity with respect to data dimension.
- 3.
An implementation of SparseMix algorithm, together with some example data sets, is available on GitHub: https://github.com/hajtos/SparseMIX.
- 4.
in the limiting case.
- 5.
We put: \(0\cdot \log 0 = 0\).
- 6.
Nevertheless. these probabilities should be accounted in model selection criteria as AIC or BIC.
- 7.
ARI might take negative values, when the produced partition is less compatible with the reference grouping than a random assignment.
- 8.
- 9.
We also considered the Hamming and cosine distances, but the Jaccard distance provided the best results.
- 10.
- 11.
exceptions are farm-ads and splice data.
- 12.
- 13.
see Bontemps et al. (2013) for specific heuristics.
Notes
Acknowledgements
This research was partially supported by the National Science Centre (Poland) grant no. 2016/21/D/ST6/00980 and grant no. 2017/25/B/ST6/01271.
References
- Asuncion DNA (2007) UCI machine learning repositoryGoogle Scholar
- Bai L, Liang J, Dang C, Cao F (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recognit 44(12):2843–2861zbMATHCrossRefGoogle Scholar
- Baker LD, McCallum AK (1998) Distributional clustering of words for text classification. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 96–103Google Scholar
- Barbará D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the eleventh international conference on information and knowledge management. ACM, pp 582–589Google Scholar
- Baxter RA, Oliver JJ (2000) Finding overlapping components with mml. Stat Comput 10(1):5–16CrossRefGoogle Scholar
- Benaglia T, Chauveau D, Hunter D, Young D (2009) mixtools: an r package for analyzing finite mixture models. J Stat Softw 32(6):1–29CrossRefGoogle Scholar
- Bontemps D, Toussile W et al (2013) Clustering and variable selection for categorical multivariate data. Electron J Stat 7:2344–2371MathSciNetzbMATHCrossRefGoogle Scholar
- Bouguila N (2010) On multivariate binary data clustering and feature weighting. Comput Stat Data Anal 54(1):120–134MathSciNetzbMATHCrossRefGoogle Scholar
- Bouguila N, ElGuebaly W (2009) Discrete data clustering using finite mixture models. Pattern Recognit 42(1):33–42zbMATHCrossRefGoogle Scholar
- Bouguila N, Ziou D (2007) High-dimensional unsupervised selection and estimation of a finite generalized dirichlet mixture model based on minimum message length. IEEE Trans Pattern Anal Mach Intell 29(10):1716–1731CrossRefGoogle Scholar
- Bozdogan H, Sclove SL (1984) Multi-sample cluster analysis using akaike’s information criterion. Ann Inst Stat Math 36(1):163–180zbMATHCrossRefGoogle Scholar
- Cagnone S, Viroli C (2012) A factor mixture analysis model for multivariate binary data. Stat Model 12(3):257–277MathSciNetCrossRefGoogle Scholar
- Celeux G, Govaert G (1991) Clustering criteria for discrete data and latent class models. J Classif 8(2):157–176zbMATHCrossRefGoogle Scholar
- Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37(5):943–952zbMATHCrossRefGoogle Scholar
- Chawla NV (2009) Data mining for imbalanced datasets: An overview. In: Data mining and knowledge discovery handbook. Springer, pp 875–886Google Scholar
- Chen L, Wang S, Wang K, Zhu J (2016) Soft subspace clustering of categorical data with probabilistic distance. Pattern Recognit 51:322–332CrossRefGoogle Scholar
- Cover TM, Thomas JA (2012) Elements of information theory. Wiley, HobokenzbMATHGoogle Scholar
- Dhillon IS, Guan Y (2003) Information theoretic clustering of sparse cooccurrence data. In: Third IEEE international conference on data mining, 2003 (ICDM 2003). IEEE, pp 517–520Google Scholar
- dos Santos TR, Zárate LE (2015) Categorical data clustering: what similarity measure to recommend? Exp Syst Appl 42(3):1247–1260CrossRefGoogle Scholar
- Elmore RT, Hettmansperger TP, Thomas H (2004) Estimating component cumulative distribution functions in finite mixture models. Commun Stat Theory Methods 33(9):2075–2086MathSciNetzbMATHCrossRefGoogle Scholar
- Ewing T, Baber JC, Feher M (2006) Novel 2d fingerprints for ligand-based virtual screening. J Chem Inf Model 46(6):2423–2431CrossRefGoogle Scholar
- Fränti P, Xu M, Kärkkäinen I (2003) Classification of binary vectors by using \(\delta \text{ sc }\) distance to minimize stochastic complexity. Pattern Recognit Lett 24(1):65–73zbMATHCrossRefGoogle Scholar
- Ghosh JK, Delampady M, Samanta T (2006) Hypothesis testing and model selection. Theory and methods: an introduction to Bayesian analysis, pp 159–204Google Scholar
- Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588MathSciNetzbMATHCrossRefGoogle Scholar
- Graham MW, Miller DJ (2006) Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection. IEEE Trans Signal Process 54(4):1289–1303zbMATHCrossRefGoogle Scholar
- Grantham NS (2014) Clustering binary data with bernoulli mixture models. Unpublished written preliminary exam. NC State UniversityGoogle Scholar
- Guansong Pang SJ, Jin Huidong (2015) Cenknn: a scalable and effective text classifier. Data Min Knowl Discov 29(3):593–625MathSciNetCrossRefGoogle Scholar
- Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. Appl Stat 28:100–108zbMATHCrossRefGoogle Scholar
- He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Advances in neural information processing systems, pp 507–514Google Scholar
- Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304CrossRefGoogle Scholar
- Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218zbMATHCrossRefGoogle Scholar
- Indhumathi R, Sathiyabama S (2010) Reducing and clustering high dimensional data through principal component analysis. Int J Comput Appl 11(8):1–4Google Scholar
- Jang E, Gu S, Poole B (2017) Categorical reparametrization with gumble-softmax. In: International conference on learning representations 2017. http://OpenReviews.net/
- Juan A, Vidal E (2002) On the use of bernoulli mixture models for text classification. Pattern Recognit 35(12):2705–2710zbMATHCrossRefGoogle Scholar
- Karypis G (2002) Cluto-a clustering toolkit. Techical report, DTIC documentGoogle Scholar
- Klekota J, Roth FP (2008) Chemical substructures that enrich for biological activity. Bioinformatics 24(21):2518–2525CrossRefGoogle Scholar
- Langseth H, Nielsen TD (2009) Latent classification models for binary data. Pattern Recognit 42(11):2724–2736zbMATHCrossRefGoogle Scholar
- LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
- Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5(Apr):361–397Google Scholar
- Li T (2005) A general model for clustering binary data. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, pp 188–197Google Scholar
- Li X, Roth D (2002) Learning question classifiers. In: Proceedings of the 19th international conference on Computational linguistics-volume 1. Association for Computational Linguistics, pp 1–7Google Scholar
- Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of the twenty-first international conference on machine learning. ACM, p 68Google Scholar
- Li CG, You C, Vidal R (2017) Structured sparse subspace clustering: ajoint affinity learning and subspace clustering framework. IEEE Trans Image Process 26(6):2988–3001MathSciNetzbMATHCrossRefGoogle Scholar
- Li CG, You C, Vidal R (2018) On geometric analysis of affine sparse subspace clustering. arXiv preprint arXiv:1808.05965
- Lu CY, Min H, Zhao ZQ, Zhu L, Huang DS, Yan S (2012) Robust and efficient subspace segmentation via least squares regression. In: European conference on computer vision. Springer, pp 347–360Google Scholar
- MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1, pp 281–297Google Scholar
- Mali K, Mitra S (2003) Clustering and its validation in a symbolic framework. Pattern Recognit Lett 24(14):2367–2376zbMATHCrossRefGoogle Scholar
- McLachlan GJ (1987) On bootstrapping the likelihood ratio test stastistic for the number of components in a normal mixture. Appl Stati 36:318–324CrossRefGoogle Scholar
- McLachlan G, Peel D (2004) Finite mixture models. Wiley, HobokenzbMATHGoogle Scholar
- Nobile A et al (2004) On the posterior distribution of the number of components in a finite mixture. Ann Stat 32(5):2044–2073MathSciNetzbMATHCrossRefGoogle Scholar
- Nobile A, Fearnside AT (2007) Bayesian finite mixtures with an unknown number of components: the allocation sampler. Stat Comput 17(2):147–162MathSciNetCrossRefGoogle Scholar
- Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Discov 24(3):515–554CrossRefGoogle Scholar
- Plumbley MD (2002) Clustering of sparse binary data using a minimum description length approachGoogle Scholar
- Rahmani M, Atia G (2017) Innovation pursuit: a new approach to the subspace clustering problem. In: International conference on machine learning, pp 2874–2882Google Scholar
- Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1135–1144Google Scholar
- Richardson S, Green PJ (1997) On bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc Ser B (Stat Methodol) 59(4):731–792zbMATHCrossRefGoogle Scholar
- Schwarz G et al (1978) Estimating the dimension of a model. Aann Stat 6(2):461–464MathSciNetzbMATHCrossRefGoogle Scholar
- Serrà J, Karatzoglou A (2017) Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks. In: Proceedings of the eleventh ACM conference on recommender systems. ACM, pp 279–287Google Scholar
- Silvestre C, Cardoso MG, Figueiredo M (2015) Feature selection for clustering categorical data with an embedded modelling approach. Exp Syst 32(3):444–453CrossRefGoogle Scholar
- Silvestre C, Cardoso MG, Figueiredo MA (2014) Identifying the number of clusters in discrete mixture models. arXiv preprint arXiv:1409.7419
- Smieja M, Tabor J (2012) Entropy of the mixture of sources and entropy dimension. IEEE Trans Inf Theory 58(5):2719–2728MathSciNetzbMATHCrossRefGoogle Scholar
- Śmieja M, Nakoneczny S, Tabor J (2016) Fast entropy clustering of sparse high dimensional binary data. In: 2016 International joint conference on neural networks (IJCNN). IEEE, pp 2397–2404Google Scholar
- Śmieja M, Geiger BC (2017) Semi-supervised cross-entropy clustering with information bottleneck constraint. Inf Sci 421:254–271MathSciNetCrossRefGoogle Scholar
- Spurek P (2017) General split gaussian cross-entropy clustering. Exp Syst Appl 68:58–68CrossRefGoogle Scholar
- Spurek P, Tabor J, Byrski K (2017) Active function cross-entropy clustering. Exp Syst Appl 72:49–66CrossRefGoogle Scholar
- Steinbach M, Ertöz L, Kumar V (2004) The challenges of clustering high dimensional data. In: New directions in statistical physics. Springer, pp 273–309Google Scholar
- Strouse D, Schwab DJ (2016) The deterministic information bottleneck. In: Proceedings conference on uncertainty in artificial intelligence (UAI), New York City, NY, pp 696–705Google Scholar
- Struski Ł, Tabor J, Spurek P (2017) Lossy compression approach to subspace clustering. Inf Sci 435:161–183MathSciNetCrossRefGoogle Scholar
- Tabor J, Spurek P (2014) Cross-entropy clustering. Pattern Recognit 47(9):3046–3059zbMATHCrossRefGoogle Scholar
- Tang Y, Browne RP, McNicholas PD (2015) Model based clustering of high-dimensional binary data. Comput Stat Data Anal 87:84–101MathSciNetzbMATHCrossRefGoogle Scholar
- Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceddings Allerton conference on communication, control, and computing, Monticello, IL, pp 368–377Google Scholar
- Tsakiris MC, Vidal R (2017) Hyperplane clustering via dual principal component pursuit. In: International conference on machine learning, pp 3472–3481Google Scholar
- Vermunt JK (2007) Multilevel mixture item response theory models: an application in education testing. In: Proceedings of the 56th session of the International Statistical Institute Lisbon, Portugal 2228Google Scholar
- Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11(Oct):2837–2854MathSciNetzbMATHGoogle Scholar
- Warszycki D, Mordalski S, Kristiansen K, Kafel R, Sylte I, Chilmonczyk Z, Bojarski AJ (2013) A linear combination of pharmacophore hypotheses as a new tool in search of new active compounds—an application for 5-\(\text{ ht }_{1A}\) receptor ligands. PLoS ONE 8(12):e84,510. https://doi.org/10.1371/journal.pone.0084510 CrossRefGoogle Scholar
- Wen JR, Nie JY, Zhang HJ (2002) Query clustering using user logs. ACM Trans Inf Syst 20(1):59–81CrossRefGoogle Scholar
- Yamamoto M, Hayashi K (2015) Clustering of multivariate binary data with dimension reduction via l 1-regularized likelihood maximization. Pattern Recognit 48(12):3959–3968zbMATHCrossRefGoogle Scholar
- Yap CW (2011) Padel-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474MathSciNetCrossRefGoogle Scholar
- Ying Zhao UF, Karypis George (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168MathSciNetCrossRefGoogle Scholar
- You C, Li CG, Robinson DP, Vidal R (2016) Oracle based active set algorithm for scalable elastic net subspace clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3928–3937Google Scholar
- Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the eleventh international conference on information and knowledge management. ACM, pp 515–524Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.