Advertisement

Automatically finding the number of clusters based on simulated annealing

  • Zhengwu Yang (杨政武)
  • Hong Huo (霍 宏)
  • Tao Fang (方 涛)
Article

Abstract

Based on simulated annealing (SA), automatically finding the number of clusters (AFNC) is proposed in this paper to determine the number of clusters and their initial centers. It is a simple and automatic method that combines local search with two widely-accepted global analysis techniques, namely careful-seeding (CS) and distance-histogram (DH). The procedure for finding a cluster is formulated as mountain-climbing, and the mountain is defined as the convergent domain of SA.When arriving at the peak of one mountain, AFNC has found one of the clusters in the dataset, and its initial center is the peak. Then, AFNC continues to climb up another mountain from a new starting point found by CS till the termination condition is satisfied. In the procedure of climbing-up mountain, the local dense region for searching the next state of SA is found by analyzing the distance histogram. Experimental results show that AFNC can achieve consistent performance for a wide range of datasets.

Key words

clusters simulated annealing (SA) distance histogram careful seeding 

CLC number

TP 301.6 

Document code

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    XU R. Survey of clustering algorithms [J]. IEEE Transaction on Neural Networks, 2005, 16(3): 645–678.CrossRefGoogle Scholar
  2. [2]
    WANG L, LECKIE C, RAMAMOHANARAO K, et al. Automatically determining the number of clusters in unlabeled data sets [J]. IEEE Transaction on Knowledge and Data Engineering, 2009, 21(3): 335–350.CrossRefGoogle Scholar
  3. [3]
    CHEN C, PAU L, WANG P. Handbook of pattern recognition and computer vision [M]. Singapore: World Scientific, 1993.CrossRefMATHGoogle Scholar
  4. [4]
    CALIńSKI R, HARABASZ J. A denrite method for cluster analysis [J]. Communications in Statistics, 1974, 3(1): 1–27.MATHGoogle Scholar
  5. [5]
    HARTIGAN J A. Clustering algorithms [M]. Toronto: Wiley, 1975.MATHGoogle Scholar
  6. [6]
    KRZANOWSKI W J, LAI Y T. A criterion for determining the number of clusters in a dataset [J]. Biometrics, 1985, 44(1): 23–34.CrossRefGoogle Scholar
  7. [7]
    SUGAR C A, JAMES G M. Finding the number of clusters in a dataset: An information theoretic approach [J]. Journal of American Statistical Association, 2003, 98: 750–763.MathSciNetCrossRefMATHGoogle Scholar
  8. [8]
    ROUSSEEUW P J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis [J]. Journal of Computational and Applied Mathematics, 1987, 20: 53–65.CrossRefMATHGoogle Scholar
  9. [9]
    TIBSHIRANI R, WALTHER G, HASTIE T. Estimating the number of clusters in a dataset via the gap statistic [J]. Journal of the Royal Statistical Society, Series B, 2001, 63: 411–423.MathSciNetCrossRefMATHGoogle Scholar
  10. [10]
    PERMUTER H, FRANCOS J, JERMYN I H. Gaussian mixture models of texture and colour for image database retrieval [C]//Proceedings of ICASSP. Hong Kong, China: IEEE, 2003: 569–572.Google Scholar
  11. [11]
    VERMA B, RAHMAN A. Cluster-oriented ensemble classifier: Impact of multicluster characterization on ensemble classifier learning [J]. IEEE Transaction on Knowledge and Data Engineering, 2012, 24(4): 605–618.CrossRefGoogle Scholar
  12. [12]
    WANG J H. Consistent selection of the number of clusters via cross-validation [J]. Biometrika, 2010, 97(4): 893–904.MathSciNetCrossRefMATHGoogle Scholar
  13. [13]
    EVERITT B, LANDAU S, LEESE M. Cluster analysis [M]. London: Arnold, 2001.MATHGoogle Scholar
  14. [14]
    KIRKPATRICK S, GELATT C D, VECCHI J M P. Optimization by simulated annealing [J]. Science, 1983, 220(4598): 671–681.MathSciNetCrossRefMATHGoogle Scholar
  15. [15]
    BERTSIMAS D, TSITSIKLIS J. Simulated annealing [J]. Statistical Science, 1993, 8(1): 10–15.CrossRefMATHGoogle Scholar
  16. [16]
    CHIB S, GREENBERG E. Understanding the Metropolis-Hastings algorithm [J]. American Statistician, 1995, 49(4): 327–335.Google Scholar
  17. [17]
    FAIGLE U, KERN W. Note on the convergence of simulated annealing algorithms [J]. SIAM Journal of Control and Optimization, 1991, 29(1): 153–159.MathSciNetCrossRefMATHGoogle Scholar
  18. [18]
    ARTHUR D, VASSILVITSKII S. k-means++: The advantage of careful seeding [C]//Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. New Orleans, Louisiana: ACM, 2007: 1027–1035.Google Scholar
  19. [19]
    MCALLESTER D, SELMAN B, KAUTZ H. Evidence for invariants in local search [C]//Proceedings of the 14th National Conference on Artificial Intelligence. Menlo Park, USA: AAAI Press, 1997: 321–326.Google Scholar
  20. [20]
    YANG Z W, FANG T. On the accuracy of image normalization by Zernike moments [J]. Image and Vision Computing, 2010, 28: 403–413.CrossRefGoogle Scholar
  21. [21]
    LICHMAN M. UCI machine learning database [DB/OL]. (2010-02-02). http://archive.ics.uci.edu/ml/.Google Scholar
  22. [22]
    BREITENBACH M, GRUDIC G E. Clustering through ranking on manifolds [C]//Proceedings of 22nd International Conference on Machine Learning. Bonn, Germany: ACM, 2005: 73–80.Google Scholar
  23. [23]
    MANJUNATH B S, MA W Y. Texture features for browsing and retrieval of image data [J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 1996, 18(8): 837–842.CrossRefGoogle Scholar

Copyright information

© Shanghai Jiaotong University and Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  • Zhengwu Yang (杨政武)
    • 1
  • Hong Huo (霍 宏)
    • 1
  • Tao Fang (方 涛)
    • 1
  1. 1.Department of AutomationShanghai Jiao Tong UniversityShanghaiChina

Personalised recommendations