Skip to main content

Improved Graph-Based Metrics for Clustering High-Dimensional Datasets

  • Conference paper
Advances in Artificial Intelligence – IBERAMIA 2010 (IBERAMIA 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6433))

Included in the following conference series:

Abstract

Clustering is one of the most used tools for data analysis. Unfortunately, most methods suffer from a lack of performance when dealing with high dimensional spaces. Recently, some works showed evidence that the use of graph-based metrics can moderate this problem. In particular, the Penalized K-Nearest Neighbour Graph metric (PKNNG) showed good results in several situations. In this work we propose two improvements to this metric that makes it suitable for application to very different domains. First, we introduce an appropriate way to manage outliers, a typical problem in graph-based metrics. Then, we propose a simple method to select an optimal value of K, the number of neighbours considered in the k-nn graph. We analyze the proposed modifications using both artificial and real data, finding strong evidence that supports our improvements. Then we compare our new method to other graph based metrics, showing that it achieves a good performance on high dimensional datasets coming from very different domains, including DNA microarrays and face and digits image recognition problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Franti, P., Virmajoki, O., Hautamaki, V.: Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph. IEEE Trans. Pattern Analysis Machine Intelligence 28, 1875–1881 (2006)

    Article  Google Scholar 

  2. Yang, M., Wu, K.: A Similarity-Based Robust Clustering Method. IEEE Trans. Pattern Analysis Machine Intelligence 26, 434–448 (2004)

    Article  Google Scholar 

  3. Yu, J.: General C-Means Clustering Model. IEEE Trans. Pattern Analysis Machine Intelligence 27, 1197–1211 (2005)

    Article  Google Scholar 

  4. Xu, R., Wunsch II, D.: Survey of Clustering Algorithms. IEEE Trans. on Neural Networks 16, 645–678 (2005)

    Article  Google Scholar 

  5. Ben-Hur, A., Elisseeff, A., Guyon, I.: A Stability Based Method for Discovering Structure in Clustered Data. In: Proc. Pacific Symposium on Biocomputing, vol. 7, pp. 6–17 (2002)

    Google Scholar 

  6. Tibshirani, R., Walther, G., Hastie, T.: Estimating the Number of Clusters in a Dataset Via the Gap Statistic. J. of the Royal Statistical Soc. B 63, 411–423 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  7. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  8. Fischer, B., Buhmann, J.M.: Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans. Pattern Analysis Machine Intelligence 25, 513–518 (2003)

    Google Scholar 

  9. Fischer, B., Buhmann, J.M.: Bagging for Path-based clustering. IEEE Trans. Pattern Analysis Machine Intelligence 25, 1411–1415 (2003)

    Article  Google Scholar 

  10. Baya, A., Granitto, P.M.: ISOMAP based metrics for Clustering. Inteligencia Artificial 37, 15–23 (2007)

    Google Scholar 

  11. Baya, A., Granitto, P.M.: Penalized K-Nearest-Neighbor-Graph Based Metrics for Clustering, Technical Report, Cifasis 0001.09, http://arxiv.org/abs/1006.2734

  12. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  13. van der Laan, M.J., Pollard, K.S.: A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference 117, 275–303 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  14. Tenenbaum, J., de Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000)

    Article  Google Scholar 

  15. Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York (1990)

    Book  MATH  Google Scholar 

  16. Caputo, B., Sim, K., Furesjo, F., Smola, A.: Appearance-Based Object Recognition Using SVMs: Which Kernel Should I Use? In: Proceedings of Neural Information Processing Systems Workshop on Statistical Methods for Computational Experiments In Visual Processing and Computer Vision (2002)

    Google Scholar 

  17. Graham, D.B., Allinson, N.M.: Characterizing Virtual Eigensignatures for General Purpose Face Recognition. Face Recognition: From Theory to Applications 163, 446–456 (1998)

    Article  Google Scholar 

  18. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences USA 95, 14863–14868 (1998)

    Article  Google Scholar 

  19. Pomeroy, S., Tamayo, P., et al.: Gene Expression-Based Classification and Outcome Prediction of Central Nervous System Embryonal Tumors. Nature 415, 436–442 (2002)

    Article  Google Scholar 

  20. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86, 2278–2324 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bayá, A.E., Granitto, P.M. (2010). Improved Graph-Based Metrics for Clustering High-Dimensional Datasets. In: Kuri-Morales, A., Simari, G.R. (eds) Advances in Artificial Intelligence – IBERAMIA 2010. IBERAMIA 2010. Lecture Notes in Computer Science(), vol 6433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16952-6_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16952-6_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16951-9

  • Online ISBN: 978-3-642-16952-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics