1 Abstract
A method for assessing cluster stability is presented in this paper. We hypothesize that if one uses a “consistent” clustering algorithm to partition several independent samples then the clustered samples should be identically distributed. We use the two sample energy test approach for analyzing this hypothesis. Such a test is not very efficient in the clustering problems because outliers in the samples and limitations of the clustering algorithms heavily contribute to the noise level. Thus, we repeat calculating the value of the test statistic many times and an empirical distribution of this statistic is obtained. We choose the value of the “true” number of clusters as the one which yields the most concentrated distribution. Results of the numerical experiments are reported.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ya. Belopolskaya, L. Klebanov, and V. Volkovich. Characterization of elliptic distributions. Journal of Mathematical Sciences, 127(1):1682–1686, 2005.
A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pages 6–17, 2002.
R. Calinski and J. Harabasz. A dendrite method for cluster analysis. Commun Statistics, 3:1–27, 1974.
G. Celeux and G. Govaert. A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14:315, 1992.
W. J. Conover, M. E. Johnson, and M. M. Johnson. Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics, 23:351–361, 1981.
T. M. Cover and J.A. Thomas. Elements of Information Theory. New York: Wiley, 1991.
I. Dhillon, J. Kogan, and Ch. Nicholas. Feature selection and document clustering. In M. Berry, editor, A Comprehensive Survey of Text Mining, pages 73–100. Springer, Berlin Heildelberg New York, 2003.
I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, January 2001. Also appears as IBM Research Report RJ 10147, July 1999.
B. S. Duran. A survey of nonparametric tests for scale. Communications in statistics — Theory and Methods, 5:1287–1312, 1976.
C. Fraley and A.E. Raftery. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8):578–588, 1998.
J. H. Friedman. Exploratory projection pursuit. J. of the American Statistical Association, 82(397):249–266, 1987.
J. H. Friedman and L. C. Rafsky. Multivariate generalizations of the Wolfowitz and Smirnov two-sample tests. Annals of Statistics, 7:697–717, 1979.
A. K. Jain and J. V. Moreau. Bootstrap technique in cluster analysis. Pattern Recognition, 20(5):547–568, 1987.
J. Hartigan. Statistical theory in clustering. J Classification, 2:6376, 1985.
L. Klebanov. One class of distribution free multivariate tests. SPb. Math. Society, Preprint, 2003(03), 2003.
L. Klebanov, T. Kozubowskii, S. Rachev, and V. Volkovich. Characterization of distributions symmetric with respect to a group of transformations and testing of corresponding statistical hypothesis. Statistics and Probability Letters, 53:241–247, 2001.
J. Kogan, C. Nicholas, and V. Volkovich. Text mining with information-theoretical clustering. Computing in Science and Engineering, pages 52–59, November/December 2003.
W. Krzanowski and Y. Lai. A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics, 44:2334, 1985.
E. Levine and E. Domany. Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13:2573–2593, 2001.
K. Rose, E. Gurewitz, and G. Fox. Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945–848, 1990.
V. Roth, V. Lange, M. Braun, and Buhmann J. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299–1323, 2004.
S. Still and W. Bialek. How many clusters? An information-theoretic perspective. Neural computation, 16(12):2483–2506, December 2004.
C. Sugar and G. James. Finding the number of clusters in a data set: An information theoretic approach. J of the American Statistical Association, 98:750–763, 2003.
R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B, 63(2):411423, 2001.
V. Volkovich, J. Kogan, and C. Nicholas. k-means initialization by sampling large datasets. In I. Dhillon and J. Kogan, editors, Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with SDM 2004), pages 17–22, 2004.
G. Zech and B. Asian. New test for the multivariate two-sample problem based on the concept of minimum energy. The Journal of Statistical Computation and Simulation, 75(2):109–119, february 2005.
A.A Zinger, A.V. Kakosyan, and L.B Klebanov. Characterization of distributions by means of the mean values of statistics in connection with some probability metrics. In Stability Problems for Stochastic Models, VNIISI, pages 47–55, 1989
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Volkovich, Z., Barzily, Z., Morozensky, L. (2006). A cluster stability criteria based on the two-sample test concept. In: Last, M., Szczepaniak, P.S., Volkovich, Z., Kandel, A. (eds) Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol 23. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33880-2_33
Download citation
DOI: https://doi.org/10.1007/3-540-33880-2_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33879-6
Online ISBN: 978-3-540-33880-2
eBook Packages: EngineeringEngineering (R0)