Improved Graph-Based Metrics for Clustering High-Dimensional Datasets

Bayá, Ariel E.; Granitto, Pablo M.

doi:10.1007/978-3-642-16952-6_19

Ariel E. Bayá²¹ &
Pablo M. Granitto²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6433))

Included in the following conference series:

Ibero-American Conference on Artificial Intelligence

1408 Accesses
2 Citations

Abstract

Clustering is one of the most used tools for data analysis. Unfortunately, most methods suffer from a lack of performance when dealing with high dimensional spaces. Recently, some works showed evidence that the use of graph-based metrics can moderate this problem. In particular, the Penalized K-Nearest Neighbour Graph metric (PKNNG) showed good results in several situations. In this work we propose two improvements to this metric that makes it suitable for application to very different domains. First, we introduce an appropriate way to manage outliers, a typical problem in graph-based metrics. Then, we propose a simple method to select an optimal value of K, the number of neighbours considered in the k-nn graph. We analyze the proposed modifications using both artificial and real data, finding strong evidence that supports our improvements. Then we compare our new method to other graph based metrics, showing that it achieves a good performance on high dimensional datasets coming from very different domains, including DNA microarrays and face and digits image recognition problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Franti, P., Virmajoki, O., Hautamaki, V.: Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph. IEEE Trans. Pattern Analysis Machine Intelligence 28, 1875–1881 (2006)
Article Google Scholar
Yang, M., Wu, K.: A Similarity-Based Robust Clustering Method. IEEE Trans. Pattern Analysis Machine Intelligence 26, 434–448 (2004)
Article Google Scholar
Yu, J.: General C-Means Clustering Model. IEEE Trans. Pattern Analysis Machine Intelligence 27, 1197–1211 (2005)
Article Google Scholar
Xu, R., Wunsch II, D.: Survey of Clustering Algorithms. IEEE Trans. on Neural Networks 16, 645–678 (2005)
Article Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I.: A Stability Based Method for Discovering Structure in Clustered Data. In: Proc. Pacific Symposium on Biocomputing, vol. 7, pp. 6–17 (2002)
Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the Number of Clusters in a Dataset Via the Gap Statistic. J. of the Royal Statistical Soc. B 63, 411–423 (2001)
Article MathSciNet MATH Google Scholar
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Fischer, B., Buhmann, J.M.: Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans. Pattern Analysis Machine Intelligence 25, 513–518 (2003)
Google Scholar
Fischer, B., Buhmann, J.M.: Bagging for Path-based clustering. IEEE Trans. Pattern Analysis Machine Intelligence 25, 1411–1415 (2003)
Article Google Scholar
Baya, A., Granitto, P.M.: ISOMAP based metrics for Clustering. Inteligencia Artificial 37, 15–23 (2007)
Google Scholar
Baya, A., Granitto, P.M.: Penalized K-Nearest-Neighbor-Graph Based Metrics for Clustering, Technical Report, Cifasis 0001.09, http://arxiv.org/abs/1006.2734
Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
van der Laan, M.J., Pollard, K.S.: A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference 117, 275–303 (2003)
Article MathSciNet MATH Google Scholar
Tenenbaum, J., de Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)
Article Google Scholar
Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York (1990)
Book MATH Google Scholar
Caputo, B., Sim, K., Furesjo, F., Smola, A.: Appearance-Based Object Recognition Using SVMs: Which Kernel Should I Use? In: Proceedings of Neural Information Processing Systems Workshop on Statistical Methods for Computational Experiments In Visual Processing and Computer Vision (2002)
Google Scholar
Graham, D.B., Allinson, N.M.: Characterizing Virtual Eigensignatures for General Purpose Face Recognition. Face Recognition: From Theory to Applications 163, 446–456 (1998)
Article Google Scholar
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences USA 95, 14863–14868 (1998)
Article Google Scholar
Pomeroy, S., Tamayo, P., et al.: Gene Expression-Based Classification and Outcome Prediction of Central Nervous System Embryonal Tumors. Nature 415, 436–442 (2002)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86, 2278–2324 (1998)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CIFASIS, French Argentine International Center for Information and Systems Sciences, UPCAM (France) / UNR–CONICET (Argentina), Bv. 27 de Febrero 210 Bis, 2000, Rosario, Argentina
Ariel E. Bayá & Pablo M. Granitto

Authors

Ariel E. Bayá
View author publications
You can also search for this author in PubMed Google Scholar
Pablo M. Granitto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento Académico de Computación, Instituto Tecnológico Autónomo de México, Río Hondo No. 1, 01000, Mexico, D.F., México
Angel Kuri-Morales
Department of Computer Science and Engineering, Universidad Nacional del Sur, Alem 1253, 8000, Bahía Blanca, Argentina
Guillermo R. Simari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bayá, A.E., Granitto, P.M. (2010). Improved Graph-Based Metrics for Clustering High-Dimensional Datasets. In: Kuri-Morales, A., Simari, G.R. (eds) Advances in Artificial Intelligence – IBERAMIA 2010. IBERAMIA 2010. Lecture Notes in Computer Science(), vol 6433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16952-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-16952-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16951-9
Online ISBN: 978-3-642-16952-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics