Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion

  • Israel Lerman
  • Joaquim Pinto da Costa
  • Helena Silva
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


In this paper we present a linearization of the Lerman clustering index for determining the number of clusters in a data set. Our goal was to apply the linearized index to large data sets containing both numerical and categorical values. The initial index, which was based on the set of pairs of objects, had a complexity O(n 2). In this work its complexity is reduced to O(n),and so, we can apply it to large data sets frequently encountered in Data Mining applications. The clustering algorithm used is an extention of the k-means algorithm to domains with mixed numerical and categorical values (Huang (1998)). The quality of the index is empirically evaluated on some data sets, both artificial and real.


Numerical Attribute Categorical Attribute Cluster Index Data Mining Application Initial Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. HUANG, Z. (1998): Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283–304.CrossRefGoogle Scholar
  2. LERMAN, I.C. (1973): Étude distributionnelle de statisques de proximité entre tructures finies de même type; application à la classification automatique. Cahiers du Bureau Universitaire de Recherche Opérationnelle, 19, Paris.Google Scholar
  3. LERMAN, I.C. (1981): Classification et Analyse Ordinale des Données. Dunod, Paris.zbMATHGoogle Scholar
  4. LERMAN, I.C. (1983): Sur la signification des Classes Issues d’une Classiifcation Automatique de Données. In: J. Felsenstein (Eds.): NATO ASI Series, Vol Cl Numerical Taxaromy. Springer-Verlag.Google Scholar
  5. MILLIGAN, G.W. and COOPER, M.C. (1985): An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.CrossRefGoogle Scholar
  6. MOLLIèRRE, J.L. (1986): What’s the real number of clusters. In: W. Gaul and M. Schader (Eds): Classification as a tool research.. North Holland.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Israel Lerman
    • 1
  • Joaquim Pinto da Costa
    • 2
  • Helena Silva
    • 3
  1. 1.IRISAUniversity of Rennes IFrance
  2. 2.DMA/FCUP, Faculdade de CiênciasUniversidade do PortoPortugal
  3. 3.ISEPUniversidade do PortoPortugal

Personalised recommendations