Abstract
Based on simulated annealing (SA), automatically finding the number of clusters (AFNC) is proposed in this paper to determine the number of clusters and their initial centers. It is a simple and automatic method that combines local search with two widely-accepted global analysis techniques, namely careful-seeding (CS) and distance-histogram (DH). The procedure for finding a cluster is formulated as mountain-climbing, and the mountain is defined as the convergent domain of SA.When arriving at the peak of one mountain, AFNC has found one of the clusters in the dataset, and its initial center is the peak. Then, AFNC continues to climb up another mountain from a new starting point found by CS till the termination condition is satisfied. In the procedure of climbing-up mountain, the local dense region for searching the next state of SA is found by analyzing the distance histogram. Experimental results show that AFNC can achieve consistent performance for a wide range of datasets.
Similar content being viewed by others
References
XU R. Survey of clustering algorithms [J]. IEEE Transaction on Neural Networks, 2005, 16(3): 645–678.
WANG L, LECKIE C, RAMAMOHANARAO K, et al. Automatically determining the number of clusters in unlabeled data sets [J]. IEEE Transaction on Knowledge and Data Engineering, 2009, 21(3): 335–350.
CHEN C, PAU L, WANG P. Handbook of pattern recognition and computer vision [M]. Singapore: World Scientific, 1993.
CALIńSKI R, HARABASZ J. A denrite method for cluster analysis [J]. Communications in Statistics, 1974, 3(1): 1–27.
HARTIGAN J A. Clustering algorithms [M]. Toronto: Wiley, 1975.
KRZANOWSKI W J, LAI Y T. A criterion for determining the number of clusters in a dataset [J]. Biometrics, 1985, 44(1): 23–34.
SUGAR C A, JAMES G M. Finding the number of clusters in a dataset: An information theoretic approach [J]. Journal of American Statistical Association, 2003, 98: 750–763.
ROUSSEEUW P J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis [J]. Journal of Computational and Applied Mathematics, 1987, 20: 53–65.
TIBSHIRANI R, WALTHER G, HASTIE T. Estimating the number of clusters in a dataset via the gap statistic [J]. Journal of the Royal Statistical Society, Series B, 2001, 63: 411–423.
PERMUTER H, FRANCOS J, JERMYN I H. Gaussian mixture models of texture and colour for image database retrieval [C]//Proceedings of ICASSP. Hong Kong, China: IEEE, 2003: 569–572.
VERMA B, RAHMAN A. Cluster-oriented ensemble classifier: Impact of multicluster characterization on ensemble classifier learning [J]. IEEE Transaction on Knowledge and Data Engineering, 2012, 24(4): 605–618.
WANG J H. Consistent selection of the number of clusters via cross-validation [J]. Biometrika, 2010, 97(4): 893–904.
EVERITT B, LANDAU S, LEESE M. Cluster analysis [M]. London: Arnold, 2001.
KIRKPATRICK S, GELATT C D, VECCHI J M P. Optimization by simulated annealing [J]. Science, 1983, 220(4598): 671–681.
BERTSIMAS D, TSITSIKLIS J. Simulated annealing [J]. Statistical Science, 1993, 8(1): 10–15.
CHIB S, GREENBERG E. Understanding the Metropolis-Hastings algorithm [J]. American Statistician, 1995, 49(4): 327–335.
FAIGLE U, KERN W. Note on the convergence of simulated annealing algorithms [J]. SIAM Journal of Control and Optimization, 1991, 29(1): 153–159.
ARTHUR D, VASSILVITSKII S. k-means++: The advantage of careful seeding [C]//Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. New Orleans, Louisiana: ACM, 2007: 1027–1035.
MCALLESTER D, SELMAN B, KAUTZ H. Evidence for invariants in local search [C]//Proceedings of the 14th National Conference on Artificial Intelligence. Menlo Park, USA: AAAI Press, 1997: 321–326.
YANG Z W, FANG T. On the accuracy of image normalization by Zernike moments [J]. Image and Vision Computing, 2010, 28: 403–413.
LICHMAN M. UCI machine learning database [DB/OL]. (2010-02-02). http://archive.ics.uci.edu/ml/.
BREITENBACH M, GRUDIC G E. Clustering through ranking on manifolds [C]//Proceedings of 22nd International Conference on Machine Learning. Bonn, Germany: ACM, 2005: 73–80.
MANJUNATH B S, MA W Y. Texture features for browsing and retrieval of image data [J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 1996, 18(8): 837–842.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: the National Basic Research Program (973) of China (No. 2012CB719903), and the National Natural Science Foundation of China (No. 41071256)
Rights and permissions
About this article
Cite this article
Yang, Z., Huo, H. & Fang, T. Automatically finding the number of clusters based on simulated annealing. J. Shanghai Jiaotong Univ. (Sci.) 22, 139–147 (2017). https://doi.org/10.1007/s12204-017-1813-9
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12204-017-1813-9