Abstract
Cluster analysis often grapples with high-dimensional and noisy data. The paper in hand identifies sparsification as an approach to address this problem. Sparsification improves both the runtime and the quality of cluster algorithms that exploit pairwise object similarities, i.e., that rely on similarity graphs. Sparsification has been addressed in the field of graphical cluster algorithms in the past, but the developed approaches leave the burden of parameter tuning to the user. Our approach to sparsification relies on the inherent characteristics of the data and is completely unsupervised. It leads to significant improvements in the cluster quality and outperforms even the optimum supervised approaches to sparsification that rely on a single global threshold.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Black, P. E. (2004). “Sparsification”, in dictionary of algorithms and data structures [online]. In U.S. National Institute of Standards and Technology, (Eds.), Algorithms and theory of computation handbook. Boca Raton: CRC Press LLC. URL http://www.itl.nist.gov/div897/sqg/dads/HTML/sparsificatn.html.
Ertöz, L., Steinbach, M., & Kumar, V. (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In SDM.
Everitt, B. S. (1993). Cluster analysis. New York: Toronto.
Gollub, T. (2008). Verfahren zur modellbildung fr das dokumenten-clustering. Diplomarbeit, Bauhaus-Universität Weimar, Fakultät Medien, Mediensysteme, April 2008. In German.
Guha, S., Rastogi, R., & Shim, K. (1999). Rock: A robust clustering algorithm for categorical attributes. In ICDE ’99: Proceedings of the 15th International Conference on Data Engineering (p. 512). Washington, DC, USA: IEEE Computer Society. ISBN 0-7695-0071-4.
Jain, A. K., Murty, M. N., & Flynn, P. J. (2000). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264–323. ISSN 0360-0300. http://doi.acm.org/10.1145/331499.331504.
Karypis, G., Han, E.-H., & Kumar, V. (1999). Chameleon: A hierarchical clustering algorithm using dynamic modeling. Technical Report Paper No. 432, Minneapolis: University of Minnesota.
Kaufman, L., & Rousseuw, P. J. (1990). Finding groups in data. New York: Wiley.
Kumar, V. (2000). An introduction to cluster analysis for data mining. Technical report, CS Dept, University of Minnesota, USA.
Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. ISSN 0960-3174. http://dx.doi.org/10.1007/s11222-007-9033-z.
Minsky, M. (1965). Models, minds, machines. In Proceedings of the IFIP Congress (pp. 45–49).
Rose, T. G., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1 – From yesterday’s news to tomorrow’s language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation.
Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 410–420).
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communication ACM, 18(11), 613–620.
Stein, B., & Meyer zu Eißen, S. (2003). Automatic document categorization: interpreting the perfomance of clustering algorithms. In A. Gnter, R. Kruse & B. Neumann (Eds.), KI 2003: Advances in artificial intelligence, volume 2821 LNAI of Lecture Notes in Artificial Intelligence (pp. 254–266). Springer, September 2003. ISBN 3-540-20059-2.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gollub, T., Stein, B. (2010). Unsupervised Sparsification of Similarity Graphs. In: Locarek-Junge, H., Weihs, C. (eds) Classification as a Tool for Research. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10745-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-10745-0_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10744-3
Online ISBN: 978-3-642-10745-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)