A Novel Method for Identifying Optimal Number of Clusters with Marginal Differential Entropy

  • Bo Shu
  • Wei Chen
  • Zhendong Niu
  • Changmin Zhang
  • Xiaotian Jiang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7901)


Clustering evaluation plays an important role in clustering algorithms. Most of recent approaches about clustering that evaluate and identify the optimal number of clusters need to calculate the distances between data points pair-wisely or evaluate the entropy in the entire dimension space and have high computational complexity. In this paper, we propose an entropy-based clustering evaluation method for identifying the optimal number of clusters which first projects the clusters centroids to each of its individual dimensions, then accumulates the marginal differential entropy in each dimension. With the sum of marginal entropies we can analyze the performance and identify the optimal number of clusters. This method can dramatically reduce the computational complexity without losing accuracy. Experiment results show that the proposed method has high stability under various situations and can apply to massive high-dimensional data points.


Clustering Evaluation Information Theory Differential Entropy 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Wernick, M., Yang, Y., Brankov, J., Yourganov, G., Strother, S.: Machine learning in medical imaging. IEEE Signal Processing Magazine 27(4), 25–38 (2010)CrossRefGoogle Scholar
  2. 2.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. The Journal of Machine Learning Research 3, 1157–1182 (2003)zbMATHGoogle Scholar
  3. 3.
    Foroutan, I., Sklansky, J.: Feature selection for automatic classification of non-gaussian data. IEEE Transactions on Systems, Man and Cybernetics 17(2), 187–198 (1987)CrossRefGoogle Scholar
  4. 4.
    Wang, J., Wu, X., Zhang, C.: Support vector machines based on k-means clustering for real-time business intelligence systems. International Journal of Business Intelligence and Data Mining 1(1), 54–64 (2005)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Richards, J.A.: Remote sensing digital image analysis. Springer (2012)Google Scholar
  6. 6.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)zbMATHCrossRefGoogle Scholar
  7. 7.
    Singhal, A.: Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin 24(4), 35–43 (2001)Google Scholar
  8. 8.
    Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)CrossRefGoogle Scholar
  9. 9.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence (2), 224–227 (1979)Google Scholar
  10. 10.
    Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4(1), 95–104 (1974)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65 (1987)zbMATHCrossRefGoogle Scholar
  12. 12.
    Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 411–423 (2001)MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    Chen, K., Liu, L.: The” best k” for entropy-based categorical data clustering. In: Proceedings of the 17th International Conference on Scientific and Statistical Database Management, pp. 253–262. Lawrence Berkeley Laboratory (2005)Google Scholar
  14. 14.
    Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD (1996)Google Scholar
  15. 15.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: an efficient data clustering method for very large databases. ACM SIGMOD Record 25, 103–114 (1996)CrossRefGoogle Scholar
  16. 16.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Bo Shu
    • 1
  • Wei Chen
    • 2
  • Zhendong Niu
    • 1
  • Changmin Zhang
    • 1
  • Xiaotian Jiang
    • 1
  1. 1.School of Computer Science and TechnologyBeijing Institute of TechnologyBeijingChina
  2. 2.Agricultural Information InstituteChinese Academy of Agricultural SciencesBeijingChina

Personalised recommendations