Skip to main content

Measures of Dispersion and Cluster-Trees for Categorical Data

  • Conference paper
Data Analysis, Machine Learning and Applications

Abstract

A clustering algorithm, in essence, is characterized by two features (1) the way in which the heterogeneity within resp. between clusters is measured (objective function) (2) the steps in which the splitting resp. fusioning proceeds. For categorical data there are no “standard indices” formalizing the first aspect. Instead, a number of ad hoc concepts have been used in cluster analysis, labelled “similarity”, “information”, “impurity” and the like. To clarify matters, we start out from a set of axioms summarizing our conception of “dispersion” for categorical attributes. To no surprise, it turns out, that some well-known measures, including the Gini index and the entropy, qualify as measures of dispersion. We try to indicate, how these measures can be used in unsupervised classification problems as well. Due to its simple analytic form, the Gini index allows for a dispersion-decomposition formula that can be made the starting point for a CART-like cluster tree. Trees are favoured because of i) factor selection and ii) communicability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • ANDRITSOS, P., TSAPARAS, P., MILLER, R.J. and SEVCIK, K.C. (2004): LIMBO: Scal-able clustering of categorical data. In: E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. Böhm and E. Ferrari (Eds.): Advances in Database Technology—EDBT 2004. Springer, Berlin, 123-146.

    Google Scholar 

  • BARBARA, D., LI, Y. and COUTO, J. (2002): COOLCAT: An entropy-based algorithm for categorical clustering. Proceedings of the Eleventh International Conference on Informa-tion and Knowledge Management, 582-589.

    Google Scholar 

  • BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A. and STONE, C.J. (1984): Classification and Regression Trees. CRC Press, Florida.

    MATH  Google Scholar 

  • FAHRMEIR, L., HAMERLE, A. and TUTZ, G. (1996): Multivariate statistische Methoden. de Gruyter, Berlin.

    Google Scholar 

  • RENYI, A. (1971): Wahrscheinlichkeitsrechnung. Mit einem Anhang über Informationstheo-rie. VEB Deutscher Verlag der Wissenschaften, Berlin.

    Google Scholar 

  • TEBOULLE, M., BERKHIN, P., DHILLON, I., GUAN, Y. and KOGAN, J. (2006): Clustering with entropy-like k means algorithms. In: J. Kogan, C. Nicholas, and M. Teboulle (Eds.): Grouping Multidimensional Data: Recent Advances in Clustering. Springer Verlag, New York, 127-160.

    Chapter  Google Scholar 

  • TONG, Y.L. (1980): Probability inequalities in multivariate distributions. In: Z.W. Birnbaum and E. Lukacs (Eds.): Probability and Mathematical Statistics. Academic Press, New York.

    Google Scholar 

  • WITTING, H., MÜLLER-FUNK, U. (1995): Mathematische Statistik II - Asymptotische Statistik: Parametrische Modelle und nicht-parametrische Funktionale. Teubner, Stuttgart.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Müller-Funk, U. (2008). Measures of Dispersion and Cluster-Trees for Categorical Data. In: Preisach, C., Burkhardt, H., Schmidt-Thieme, L., Decker, R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78246-9_20

Download citation

Publish with us

Policies and ethics