Measures of Dispersion and Cluster-Trees for Categorical Data

Müller-Funk, Ulrich

doi:10.1007/978-3-540-78246-9_20

Ulrich Müller-Funk⁵

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

6069 Accesses
1 Citations
3 Altmetric

Abstract

A clustering algorithm, in essence, is characterized by two features (1) the way in which the heterogeneity within resp. between clusters is measured (objective function) (2) the steps in which the splitting resp. fusioning proceeds. For categorical data there are no “standard indices” formalizing the first aspect. Instead, a number of ad hoc concepts have been used in cluster analysis, labelled “similarity”, “information”, “impurity” and the like. To clarify matters, we start out from a set of axioms summarizing our conception of “dispersion” for categorical attributes. To no surprise, it turns out, that some well-known measures, including the Gini index and the entropy, qualify as measures of dispersion. We try to indicate, how these measures can be used in unsupervised classification problems as well. Due to its simple analytic form, the Gini index allows for a dispersion-decomposition formula that can be made the starting point for a CART-like cluster tree. Trees are favoured because of i) factor selection and ii) communicability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

ANDRITSOS, P., TSAPARAS, P., MILLER, R.J. and SEVCIK, K.C. (2004): LIMBO: Scal-able clustering of categorical data. In: E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. Böhm and E. Ferrari (Eds.): Advances in Database Technology—EDBT 2004. Springer, Berlin, 123-146.
Google Scholar
BARBARA, D., LI, Y. and COUTO, J. (2002): COOLCAT: An entropy-based algorithm for categorical clustering. Proceedings of the Eleventh International Conference on Informa-tion and Knowledge Management, 582-589.
Google Scholar
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A. and STONE, C.J. (1984): Classification and Regression Trees. CRC Press, Florida.
MATH Google Scholar
FAHRMEIR, L., HAMERLE, A. and TUTZ, G. (1996): Multivariate statistische Methoden. de Gruyter, Berlin.
Google Scholar
RENYI, A. (1971): Wahrscheinlichkeitsrechnung. Mit einem Anhang über Informationstheo-rie. VEB Deutscher Verlag der Wissenschaften, Berlin.
Google Scholar
TEBOULLE, M., BERKHIN, P., DHILLON, I., GUAN, Y. and KOGAN, J. (2006): Clustering with entropy-like k means algorithms. In: J. Kogan, C. Nicholas, and M. Teboulle (Eds.): Grouping Multidimensional Data: Recent Advances in Clustering. Springer Verlag, New York, 127-160.
Chapter Google Scholar
TONG, Y.L. (1980): Probability inequalities in multivariate distributions. In: Z.W. Birnbaum and E. Lukacs (Eds.): Probability and Mathematical Statistics. Academic Press, New York.
Google Scholar
WITTING, H., MÜLLER-FUNK, U. (1995): Mathematische Statistik II - Asymptotische Statistik: Parametrische Modelle und nicht-parametrische Funktionale. Teubner, Stuttgart.
Google Scholar

Download references

Author information

Authors and Affiliations

ERCIS, Leonardo-Campus 3, 48149, Münster, Germany
Ulrich Müller-Funk

Authors

Ulrich Müller-Funk
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Institute of Business Economics and Information Systems, University of Hildesheim, Marienburgerplatz 22, 31141, Hildesheim, Germany
Christine Preisach
Lehrstuhl für Mustererkennung und Bildverarbeitung, Universität Freiburg, Gebäude 052, 79110, Freiburg i. Br, Germany
Hans Burkhardt
Institute of Computer Science and Institute of Business Economics and Information Systems, Marienburgerplatz 22, 31141, Hildesheim, Germany
Lars Schmidt-Thieme
Fakultät für Wirtschaftswissenschaften, Lehrstuhl für Betriebswirtschaftslehre, insbes. Marketing, Universitätsstraße 25, 33615, Bielefeld, Germany
Reinhold Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Müller-Funk, U. (2008). Measures of Dispersion and Cluster-Trees for Categorical Data. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-78246-9_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78239-1
Online ISBN: 978-3-540-78246-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics