A Family of Two-Dimensional Benchmark Data Sets and Its Application to Comparing Different Cluster Validation Indices

Santos, Jorge M.; Embrechts, Mark

doi:10.1007/978-3-319-07491-7_5

Jorge M. Santos^19,20 &
Mark Embrechts²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8495))

Included in the following conference series:

Mexican Conference on Pattern Recognition

1733 Accesses
1 Citations

Abstract

There are two main objectives in this paper: the first one is to introduce a collection of two-dimensional benchmark data sets with a wide variety of clustering characteristics that are typical for real-world data sets. These simple 2-D data sets allow the user to easily evaluate clustering solutions from a variety of different clustering algorithms; the second one is to evaluate four different commonly used clustering validation indices by using these 2-D benchmark data sets. It is shown that even for simple 2-D data sets there is a large discrepancy on the ideal number of clusters suggested by traditional cluster validation indices. The performed experiments also suggest that the Dunn and the GAP statistic seems to be more robust cluster validation indices, even though they still fail to comply with common sense clustering solutions in more than 50% of the cases.

Download to read the full chapter text

Chapter PDF

Reliable Clustering Quality Estimation from Low to High Dimensional Data

An effective clustering scheme for high-dimensional data

Article 19 October 2023

A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering

References

Ultsch, A.: Clustering with som: U*c. In: Workshop on Self-Organizing Maps, pp. 75–82 (2005), www.uni-marburg.de/fb12/datenbionik/Daten
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int’l Conf. Computer Vision, vol. 2, pp. 416–423 (July 2001)
Google Scholar
Alpert, S., Galun, M., Basri, R., Brandt, A.: Image segmentation by probabilistic bottom-up aggregation and cue integration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (June 2007)
Google Scholar
Santos, J.M., Marques de Sá, J.: Human clustering on bi-dimensional data: An assessment. Technical Report 1, INEB - Instituto de Engenharia Biomédica, Porto, Portugal (October 2005)
Google Scholar
Santos, J.M.: Bi-dimensioanl data sets, http://www.dema.isep.ipp.pt/~jms/datasets
Davies, D., Bouldin, D.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 224–227 (1971)
Google Scholar
Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics 4(1), 95–104 (1974)
Article MathSciNet Google Scholar
Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics 20, 53–65 (1987)
Article MATH Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 411–423 (2001)
Article MATH MathSciNet Google Scholar
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8), 651–666 (2010)
Article Google Scholar
Xu, R., Wunsch, D.: Clustering. IEEE Press Series on Computational intelligence. IEEE (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

ISEP, School of Engineering, Polytechnic of Porto - Dept. of Mathematics, Portugal
Jorge M. Santos
INEB, Biomedical Engineering Institute, Porto, Portugal
Jorge M. Santos
Dept. Ind. Systems Eng., Rensselaer Polytechnic Institute, Troy, NY, USA
Mark Embrechts

Authors

Jorge M. Santos
View author publications
You can also search for this author in PubMed Google Scholar
Mark Embrechts
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Luis Enrique Erro No. 1, 72840, Sta. Maria Tonantzintla, Puebla, Mexico
José Francisco Martínez-Trinidad & Jesús Ariel Carrasco-Ochoa &
Faculty of Computer Sciences, Autonomous University of Puebla (BUAP), Av. San Claudio y 14 Sur, 72570, Ciudad Universitaria, Puebla, Mexico
José Arturo Olvera-Lopez
Instituto Politécnico Nacional (IPN), Cerro Blanco 141, 76090, Colinas del Cimatario, Querétaro, Mexico
Joaquín Salas-Rodríguez
Centre for Pattern Recognition and Machine Intelligence, Computer Science and Software Engineering Department, Concordia University, 1455 de Maisonneuve Blvd West, Suite, EV3.403, Montreal, QC, Canada
Ching Y. Suen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, J.M., Embrechts, M. (2014). A Family of Two-Dimensional Benchmark Data Sets and Its Application to Comparing Different Cluster Validation Indices. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-Lopez, J.A., Salas-Rodríguez, J., Suen, C.Y. (eds) Pattern Recognition. MCPR 2014. Lecture Notes in Computer Science, vol 8495. Springer, Cham. https://doi.org/10.1007/978-3-319-07491-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-07491-7_5
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07490-0
Online ISBN: 978-3-319-07491-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Family of Two-Dimensional Benchmark Data Sets and Its Application to Comparing Different Cluster Validation Indices

Abstract

Chapter PDF

Similar content being viewed by others

Reliable Clustering Quality Estimation from Low to High Dimensional Data

An effective clustering scheme for high-dimensional data

A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A Family of Two-Dimensional Benchmark Data Sets and Its Application to Comparing Different Cluster Validation Indices

Abstract

Chapter PDF

Similar content being viewed by others

Reliable Clustering Quality Estimation from Low to High Dimensional Data

An effective clustering scheme for high-dimensional data

A New Topology-Preserving Distance Metric with Applications to Multi-dimensional Data Clustering

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation