Skip to main content

Clustering Large Categorical Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Abstract

Clustering methods often come down to the optimization of a numeric criterion defined from a distance or from a dissimilarity measure. It is possible to show that this problem is often equivalent to the estimation of the parameters of a probabilistic model under the classification likelihood approach. For instance, we know that the inertia criterion optimized under the k-means algorithm corresponds to the hypothesis of a population arising from a Gaussian mixture. In this paper, we propose an adapted mixture model for categorical data. Using the classification likelihood approach, we develop the Classification EM algorithm (CEM) to estimate the parameters of the mixture model. With our probabilistic model, the data are not denatured and the estimated parameters readily indicate the characteristics of the clusters. This probabilistic approach gives an interpretation of the criterion optimized by the k-modes algorithm which is an extension of k-means to categorical attributes and allows us to study the behavior of this algorithm.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Celeux, G. and Govaert, G. (1992): A Classification EM Algorithm for Clustering and two Stochastic Versions. Computational Statistics & Data Analysis, 14, 315–332.

    Article  MATH  MathSciNet  Google Scholar 

  2. Dempster, A., Laird, N. and Rubin, D. (1977): Mixture Densities, Maximum Likelihood from incomplete data via the EM Algorithm. Journal of the Royal Statistical Society, 39,1, 1–38.

    MATH  MathSciNet  Google Scholar 

  3. Diday, E., Bochi, S., Brossier, G. and Celeux, G. (1980): Optimisation en Classification Automatique, Le Chesnay, INRIA.

    Google Scholar 

  4. Everitt, B. (1984): An introduction to Latent Variables Models, Chapman and Hall.

    Google Scholar 

  5. Forgy, E. W. (1965): Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification. Biometrics, 21,3, 768.

    Google Scholar 

  6. Govaert, G. and Nadif, M. (1996): Comparison of the Mixture and the Classification Maximum Likelihood in Cluster Analysis with binary data. Comput. Statis. and Data Analysis, 23, 65–81.

    Article  MATH  Google Scholar 

  7. Huang, Z. (1997): A Fast Clustering Algorithm to Cluster very large categorical data sets in Data Mining. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (SIGMOD-DMKD’97).

    Google Scholar 

  8. Huang, Z. (1998): Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery, 2, 283–304.

    Article  Google Scholar 

  9. Mc Lachlan, G. J. and Basford, K. E. (1989): Mixture Models, Inference and Applications to Clustering, Marcel Dekker.

    Google Scholar 

  10. Mac Queen, J. B. (1967): Some Methods for Classification and Analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 281–297.

    Google Scholar 

  11. Nadif, M. and Marchetti, F. (1993): Classification de Données Qualitatives et Modèles. Revue de Statistique Appliquée, XLI, 1, 55–69.

    MathSciNet  Google Scholar 

  12. Ralambondrainy H. (1995): A Conceptual Version of the k-means Algorithm, Pattern Recognition Letters, 16, pp. 1147–1157.

    Article  Google Scholar 

  13. Symons M. J. (1981): Clustering Criteria and Multivariate Normal Mixture, Biometrics, 27, pp 387–397.

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jollois, FX., Nadif, M. (2002). Clustering Large Categorical Data. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_25

Download citation

  • DOI: https://doi.org/10.1007/3-540-47887-6_25

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43704-8

  • Online ISBN: 978-3-540-47887-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics