Abstract
A clustering algorithm, in essence, is characterized by two features (1) the way in which the heterogeneity within resp. between clusters is measured (objective function) (2) the steps in which the splitting resp. fusioning proceeds. For categorical data there are no “standard indices” formalizing the first aspect. Instead, a number of ad hoc concepts have been used in cluster analysis, labelled “similarity”, “information”, “impurity” and the like. To clarify matters, we start out from a set of axioms summarizing our conception of “dispersion” for categorical attributes. To no surprise, it turns out, that some well-known measures, including the Gini index and the entropy, qualify as measures of dispersion. We try to indicate, how these measures can be used in unsupervised classification problems as well. Due to its simple analytic form, the Gini index allows for a dispersion-decomposition formula that can be made the starting point for a CART-like cluster tree. Trees are favoured because of i) factor selection and ii) communicability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
ANDRITSOS, P., TSAPARAS, P., MILLER, R.J. and SEVCIK, K.C. (2004): LIMBO: Scal-able clustering of categorical data. In: E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. Böhm and E. Ferrari (Eds.): Advances in Database Technology—EDBT 2004. Springer, Berlin, 123-146.
BARBARA, D., LI, Y. and COUTO, J. (2002): COOLCAT: An entropy-based algorithm for categorical clustering. Proceedings of the Eleventh International Conference on Informa-tion and Knowledge Management, 582-589.
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A. and STONE, C.J. (1984): Classification and Regression Trees. CRC Press, Florida.
FAHRMEIR, L., HAMERLE, A. and TUTZ, G. (1996): Multivariate statistische Methoden. de Gruyter, Berlin.
RENYI, A. (1971): Wahrscheinlichkeitsrechnung. Mit einem Anhang über Informationstheo-rie. VEB Deutscher Verlag der Wissenschaften, Berlin.
TEBOULLE, M., BERKHIN, P., DHILLON, I., GUAN, Y. and KOGAN, J. (2006): Clustering with entropy-like k means algorithms. In: J. Kogan, C. Nicholas, and M. Teboulle (Eds.): Grouping Multidimensional Data: Recent Advances in Clustering. Springer Verlag, New York, 127-160.
TONG, Y.L. (1980): Probability inequalities in multivariate distributions. In: Z.W. Birnbaum and E. Lukacs (Eds.): Probability and Mathematical Statistics. Academic Press, New York.
WITTING, H., MÜLLER-FUNK, U. (1995): Mathematische Statistik II - Asymptotische Statistik: Parametrische Modelle und nicht-parametrische Funktionale. Teubner, Stuttgart.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Müller-Funk, U. (2008). Measures of Dispersion and Cluster-Trees for Categorical Data. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-78246-9_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)