Skip to main content

Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion

  • Conference paper
Classification, Clustering, and Data Analysis

Abstract

In this paper we present a linearization of the Lerman clustering index for determining the number of clusters in a data set. Our goal was to apply the linearized index to large data sets containing both numerical and categorical values. The initial index, which was based on the set of pairs of objects, had a complexity O(n 2). In this work its complexity is reduced to O(n),and so, we can apply it to large data sets frequently encountered in Data Mining applications. The clustering algorithm used is an extention of the k-means algorithm to domains with mixed numerical and categorical values (Huang (1998)). The quality of the index is empirically evaluated on some data sets, both artificial and real.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • HUANG, Z. (1998): Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283–304.

    Article  Google Scholar 

  • LERMAN, I.C. (1973): Étude distributionnelle de statisques de proximité entre tructures finies de même type; application à la classification automatique. Cahiers du Bureau Universitaire de Recherche Opérationnelle, 19, Paris.

    Google Scholar 

  • LERMAN, I.C. (1981): Classification et Analyse Ordinale des Données. Dunod, Paris.

    MATH  Google Scholar 

  • LERMAN, I.C. (1983): Sur la signification des Classes Issues d’une Classiifcation Automatique de Données. In: J. Felsenstein (Eds.): NATO ASI Series, Vol Cl Numerical Taxaromy. Springer-Verlag.

    Google Scholar 

  • MILLIGAN, G.W. and COOPER, M.C. (1985): An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.

    Article  Google Scholar 

  • MOLLIèRRE, J.L. (1986): What’s the real number of clusters. In: W. Gaul and M. Schader (Eds): Classification as a tool research.. North Holland.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lerman, I., da Costa, J.P., Silva, H. (2002). Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion. In: Jajuga, K., Sokołowski, A., Bock, HH. (eds) Classification, Clustering, and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-56181-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-56181-8_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43691-1

  • Online ISBN: 978-3-642-56181-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics