Abstract
There are two main objectives in this paper: the first one is to introduce a collection of two-dimensional benchmark data sets with a wide variety of clustering characteristics that are typical for real-world data sets. These simple 2-D data sets allow the user to easily evaluate clustering solutions from a variety of different clustering algorithms; the second one is to evaluate four different commonly used clustering validation indices by using these 2-D benchmark data sets. It is shown that even for simple 2-D data sets there is a large discrepancy on the ideal number of clusters suggested by traditional cluster validation indices. The performed experiments also suggest that the Dunn and the GAP statistic seems to be more robust cluster validation indices, even though they still fail to comply with common sense clustering solutions in more than 50% of the cases.
Chapter PDF
Similar content being viewed by others
References
Ultsch, A.: Clustering with som: U*c. In: Workshop on Self-Organizing Maps, pp. 75–82 (2005), www.uni-marburg.de/fb12/datenbionik/Daten
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int’l Conf. Computer Vision, vol. 2, pp. 416–423 (July 2001)
Alpert, S., Galun, M., Basri, R., Brandt, A.: Image segmentation by probabilistic bottom-up aggregation and cue integration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2007)
Santos, J.M., Marques de Sá, J.: Human clustering on bi-dimensional data: An assessment. Technical Report 1, INEB - Instituto de Engenharia Biomédica, Porto, Portugal (October 2005)
Santos, J.M.: Bi-dimensioanl data sets, http://www.dema.isep.ipp.pt/~jms/datasets
Davies, D., Bouldin, D.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 224–227 (1971)
Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4(1), 95–104 (1974)
Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20, 53–65 (1987)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 411–423 (2001)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8), 651–666 (2010)
Xu, R., Wunsch, D.: Clustering. IEEE Press Series on Computational intelligence. IEEE (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Santos, J.M., Embrechts, M. (2014). A Family of Two-Dimensional Benchmark Data Sets and Its Application to Comparing Different Cluster Validation Indices. In: MartÃnez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-Lopez, J.A., Salas-RodrÃguez, J., Suen, C.Y. (eds) Pattern Recognition. MCPR 2014. Lecture Notes in Computer Science, vol 8495. Springer, Cham. https://doi.org/10.1007/978-3-319-07491-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-07491-7_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07490-0
Online ISBN: 978-3-319-07491-7
eBook Packages: Computer ScienceComputer Science (R0)