Unsupervised Sparsification of Similarity Graphs

Gollub, Tim; Stein, Benno

doi:10.1007/978-3-642-10745-0_7

Tim Gollub³ &
Benno Stein

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2142 Accesses
3 Citations
7 Altmetric

Abstract

Cluster analysis often grapples with high-dimensional and noisy data. The paper in hand identifies sparsification as an approach to address this problem. Sparsification improves both the runtime and the quality of cluster algorithms that exploit pairwise object similarities, i.e., that rely on similarity graphs. Sparsification has been addressed in the field of graphical cluster algorithms in the past, but the developed approaches leave the burden of parameter tuning to the user. Our approach to sparsification relies on the inherent characteristics of the data and is completely unsupervised. It leads to significant improvements in the cluster quality and outperforms even the optimum supervised approaches to sparsification that rely on a single global threshold.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Black, P. E. (2004). “Sparsification”, in dictionary of algorithms and data structures [online]. In U.S. National Institute of Standards and Technology, (Eds.), Algorithms and theory of computation handbook. Boca Raton: CRC Press LLC. URL http://www.itl.nist.gov/div897/sqg/dads/HTML/sparsificatn.html.
Ertöz, L., Steinbach, M., & Kumar, V. (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In SDM.
Google Scholar
Everitt, B. S. (1993). Cluster analysis. New York: Toronto.
Google Scholar
Gollub, T. (2008). Verfahren zur modellbildung fr das dokumenten-clustering. Diplomarbeit, Bauhaus-Universität Weimar, Fakultät Medien, Mediensysteme, April 2008. In German.
Google Scholar
Guha, S., Rastogi, R., & Shim, K. (1999). Rock: A robust clustering algorithm for categorical attributes. In ICDE ’99: Proceedings of the 15th International Conference on Data Engineering (p. 512). Washington, DC, USA: IEEE Computer Society. ISBN 0-7695-0071-4.
Google Scholar
Jain, A. K., Murty, M. N., & Flynn, P. J. (2000). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264–323. ISSN 0360-0300. http://doi.acm.org/10.1145/331499.331504.
Google Scholar
Karypis, G., Han, E.-H., & Kumar, V. (1999). Chameleon: A hierarchical clustering algorithm using dynamic modeling. Technical Report Paper No. 432, Minneapolis: University of Minnesota.
Google Scholar
Kaufman, L., & Rousseuw, P. J. (1990). Finding groups in data. New York: Wiley.
Book Google Scholar
Kumar, V. (2000). An introduction to cluster analysis for data mining. Technical report, CS Dept, University of Minnesota, USA.
Google Scholar
Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. ISSN 0960-3174. http://dx.doi.org/10.1007/s11222-007-9033-z.
Google Scholar
Minsky, M. (1965). Models, minds, machines. In Proceedings of the IFIP Congress (pp. 45–49).
Google Scholar
Rose, T. G., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1 – From yesterday’s news to tomorrow’s language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation.
Google Scholar
Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 410–420).
Google Scholar
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication ACM, 18(11), 613–620.
Article MATH Google Scholar
Stein, B., & Meyer zu Eißen, S. (2003). Automatic document categorization: interpreting the perfomance of clustering algorithms. In A. Gnter, R. Kruse & B. Neumann (Eds.), KI 2003: Advances in artificial intelligence, volume 2821 LNAI of Lecture Notes in Artificial Intelligence (pp. 254–266). Springer, September 2003. ISBN 3-540-20059-2.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Media/Media Systems, Bauhaus-Universität Weimar, Weimar, Germany
Tim Gollub

Authors

Tim Gollub
View author publications
You can also search for this author in PubMed Google Scholar
Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tim Gollub .

Editor information

Editors and Affiliations

LS für BWL, insb. Finanzwirtschaft und, Finanzdienstleistungen, TU Dresden, Helmholtzstr. 10, Dresden, 01062, Germany
Hermann Locarek-Junge
FG Computergestützte Statistik, Univ. Dortmund, Vogelpothsweg 87, Dortmund, 44227, Germany
Claus Weihs

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gollub, T., Stein, B. (2010). Unsupervised Sparsification of Similarity Graphs. In: Locarek-Junge, H., Weihs, C. (eds) Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10745-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-10745-0_7
Published: 03 May 2010
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10744-3
Online ISBN: 978-3-642-10745-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics