Clustering Large Categorical Data

Jollois, François-Xavier; Nadif, Mohamed

doi:10.1007/3-540-47887-6_25

Clustering Large Categorical Data

François-Xavier Jollois⁴ &
Mohamed Nadif⁴

Conference paper
First Online: 01 January 2002

2190 Accesses
12 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Abstract

Clustering methods often come down to the optimization of a numeric criterion defined from a distance or from a dissimilarity measure. It is possible to show that this problem is often equivalent to the estimation of the parameters of a probabilistic model under the classification likelihood approach. For instance, we know that the inertia criterion optimized under the k-means algorithm corresponds to the hypothesis of a population arising from a Gaussian mixture. In this paper, we propose an adapted mixture model for categorical data. Using the classification likelihood approach, we develop the Classification EM algorithm (CEM) to estimate the parameters of the mixture model. With our probabilistic model, the data are not denatured and the estimated parameters readily indicate the characteristics of the clusters. This probabilistic approach gives an interpretation of the criterion optimized by the k-modes algorithm which is an extension of k-means to categorical attributes and allows us to study the behavior of this algorithm.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Celeux, G. and Govaert, G. (1992): A Classification EM Algorithm for Clustering and two Stochastic Versions. Computational Statistics & Data Analysis, 14, 315–332.
Article MATH MathSciNet Google Scholar
Dempster, A., Laird, N. and Rubin, D. (1977): Mixture Densities, Maximum Likelihood from incomplete data via the EM Algorithm. Journal of the Royal Statistical Society, 39,1, 1–38.
MATH MathSciNet Google Scholar
Diday, E., Bochi, S., Brossier, G. and Celeux, G. (1980): Optimisation en Classification Automatique, Le Chesnay, INRIA.
Google Scholar
Everitt, B. (1984): An introduction to Latent Variables Models, Chapman and Hall.
Google Scholar
Forgy, E. W. (1965): Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification. Biometrics, 21,3, 768.
Google Scholar
Govaert, G. and Nadif, M. (1996): Comparison of the Mixture and the Classification Maximum Likelihood in Cluster Analysis with binary data. Comput. Statis. and Data Analysis, 23, 65–81.
Article MATH Google Scholar
Huang, Z. (1997): A Fast Clustering Algorithm to Cluster very large categorical data sets in Data Mining. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (SIGMOD-DMKD’97).
Google Scholar
Huang, Z. (1998): Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery, 2, 283–304.
Article Google Scholar
Mc Lachlan, G. J. and Basford, K. E. (1989): Mixture Models, Inference and Applications to Clustering, Marcel Dekker.
Google Scholar
Mac Queen, J. B. (1967): Some Methods for Classification and Analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 281–297.
Google Scholar
Nadif, M. and Marchetti, F. (1993): Classification de Données Qualitatives et Modèles. Revue de Statistique Appliquée, XLI, 1, 55–69.
MathSciNet Google Scholar
Ralambondrainy H. (1995): A Conceptual Version of the k-means Algorithm, Pattern Recognition Letters, 16, pp. 1147–1157.
Article Google Scholar
Symons M. J. (1981): Clustering Criteria and Multivariate Normal Mixture, Biometrics, 27, pp 387–397.
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire d’Informatique Théorique et Appliquée, Université de Metz, Ile du Saulcy, 57045, Metz Cedex, France
François-Xavier Jollois & Mohamed Nadif

Authors

François-Xavier Jollois
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nadif
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EE Department, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC
Ming-Syan Chen
IBM Thomas J. Watson Research Center, 30 Sawmill River Road, Hawthorne, NY, 10532, USA
Philip S. Yu
School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore, 119260
Bing Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jollois, FX., Nadif, M. (2002). Clustering Large Categorical Data. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_25

Download citation

DOI: https://doi.org/10.1007/3-540-47887-6_25
Published: 29 April 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics