Skip to main content

Unsupervised Sparsification of Similarity Graphs

  • Conference paper
  • First Online:
Classification as a Tool for Research

Abstract

Cluster analysis often grapples with high-dimensional and noisy data. The paper in hand identifies sparsification as an approach to address this problem. Sparsification improves both the runtime and the quality of cluster algorithms that exploit pairwise object similarities, i.e., that rely on similarity graphs. Sparsification has been addressed in the field of graphical cluster algorithms in the past, but the developed approaches leave the burden of parameter tuning to the user. Our approach to sparsification relies on the inherent characteristics of the data and is completely unsupervised. It leads to significant improvements in the cluster quality and outperforms even the optimum supervised approaches to sparsification that rely on a single global threshold.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Black, P. E. (2004). “Sparsification”, in dictionary of algorithms and data structures [online]. In U.S. National Institute of Standards and Technology, (Eds.), Algorithms and theory of computation handbook. Boca Raton: CRC Press LLC. URL http://www.itl.nist.gov/div897/sqg/dads/HTML/sparsificatn.html.

  • Ertöz, L., Steinbach, M., & Kumar, V. (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In SDM.

    Google Scholar 

  • Everitt, B. S. (1993). Cluster analysis. New York: Toronto.

    Google Scholar 

  • Gollub, T. (2008). Verfahren zur modellbildung fr das dokumenten-clustering. Diplomarbeit, Bauhaus-Universität Weimar, Fakultät Medien, Mediensysteme, April 2008. In German.

    Google Scholar 

  • Guha, S., Rastogi, R., & Shim, K. (1999). Rock: A robust clustering algorithm for categorical attributes. In ICDE ’99: Proceedings of the 15th International Conference on Data Engineering (p. 512). Washington, DC, USA: IEEE Computer Society. ISBN 0-7695-0071-4.

    Google Scholar 

  • Jain, A. K., Murty, M. N., & Flynn, P. J. (2000). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264–323. ISSN 0360-0300. http://doi.acm.org/10.1145/331499.331504.

    Google Scholar 

  • Karypis, G., Han, E.-H., & Kumar, V. (1999). Chameleon: A hierarchical clustering algorithm using dynamic modeling. Technical Report Paper No. 432, Minneapolis: University of Minnesota.

    Google Scholar 

  • Kaufman, L., & Rousseuw, P. J. (1990). Finding groups in data. New York: Wiley.

    Book  Google Scholar 

  • Kumar, V. (2000). An introduction to cluster analysis for data mining. Technical report, CS Dept, University of Minnesota, USA.

    Google Scholar 

  • Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. ISSN 0960-3174. http://dx.doi.org/10.1007/s11222-007-9033-z.

    Google Scholar 

  • Minsky, M. (1965). Models, minds, machines. In Proceedings of the IFIP Congress (pp. 45–49).

    Google Scholar 

  • Rose, T. G., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1 – From yesterday’s news to tomorrow’s language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation.

    Google Scholar 

  • Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 410–420).

    Google Scholar 

  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  • Stein, B., & Meyer zu Eißen, S. (2003). Automatic document categorization: interpreting the perfomance of clustering algorithms. In A. Gnter, R. Kruse & B. Neumann (Eds.), KI 2003: Advances in artificial intelligence, volume 2821 LNAI of Lecture Notes in Artificial Intelligence (pp. 254–266). Springer, September 2003. ISBN 3-540-20059-2.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tim Gollub .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gollub, T., Stein, B. (2010). Unsupervised Sparsification of Similarity Graphs. In: Locarek-Junge, H., Weihs, C. (eds) Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10745-0_7

Download citation

Publish with us

Policies and ethics