Abstract
In this paper we present a linearization of the Lerman clustering index for determining the number of clusters in a data set. Our goal was to apply the linearized index to large data sets containing both numerical and categorical values. The initial index, which was based on the set of pairs of objects, had a complexity O(n 2). In this work its complexity is reduced to O(n),and so, we can apply it to large data sets frequently encountered in Data Mining applications. The clustering algorithm used is an extention of the k-means algorithm to domains with mixed numerical and categorical values (Huang (1998)). The quality of the index is empirically evaluated on some data sets, both artificial and real.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
HUANG, Z. (1998): Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283–304.
LERMAN, I.C. (1973): Étude distributionnelle de statisques de proximité entre tructures finies de même type; application à la classification automatique. Cahiers du Bureau Universitaire de Recherche Opérationnelle, 19, Paris.
LERMAN, I.C. (1981): Classification et Analyse Ordinale des Données. Dunod, Paris.
LERMAN, I.C. (1983): Sur la signification des Classes Issues d’une Classiifcation Automatique de Données. In: J. Felsenstein (Eds.): NATO ASI Series, Vol Cl Numerical Taxaromy. Springer-Verlag.
MILLIGAN, G.W. and COOPER, M.C. (1985): An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.
MOLLIèRRE, J.L. (1986): What’s the real number of clusters. In: W. Gaul and M. Schader (Eds): Classification as a tool research.. North Holland.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lerman, I., da Costa, J.P., Silva, H. (2002). Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion. In: Jajuga, K., Sokołowski, A., Bock, HH. (eds) Classification, Clustering, and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-56181-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-56181-8_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43691-1
Online ISBN: 978-3-642-56181-8
eBook Packages: Springer Book Archive