Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion

Lerman, Israel; da Costa, Joaquim Pinto; Silva, Helena

doi:10.1007/978-3-642-56181-8_16

Israel Lerman⁷,
Joaquim Pinto da Costa⁸ &
Helena Silva⁹

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

1768 Accesses
2 Citations

Abstract

In this paper we present a linearization of the Lerman clustering index for determining the number of clusters in a data set. Our goal was to apply the linearized index to large data sets containing both numerical and categorical values. The initial index, which was based on the set of pairs of objects, had a complexity O(n ²). In this work its complexity is reduced to O(n),and so, we can apply it to large data sets frequently encountered in Data Mining applications. The clustering algorithm used is an extention of the k-means algorithm to domains with mixed numerical and categorical values (Huang (1998)). The quality of the index is empirically evaluated on some data sets, both artificial and real.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

HUANG, Z. (1998): Extensions to the K-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283–304.
Article Google Scholar
LERMAN, I.C. (1973): Étude distributionnelle de statisques de proximité entre tructures finies de même type; application à la classification automatique. Cahiers du Bureau Universitaire de Recherche Opérationnelle, 19, Paris.
Google Scholar
LERMAN, I.C. (1981): Classification et Analyse Ordinale des Données. Dunod, Paris.
MATH Google Scholar
LERMAN, I.C. (1983): Sur la signification des Classes Issues d’une Classiifcation Automatique de Données. In: J. Felsenstein (Eds.): NATO ASI Series, Vol Cl Numerical Taxaromy. Springer-Verlag.
Google Scholar
MILLIGAN, G.W. and COOPER, M.C. (1985): An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.
Article Google Scholar
MOLLIèRRE, J.L. (1986): What’s the real number of clusters. In: W. Gaul and M. Schader (Eds): Classification as a tool research.. North Holland.
Google Scholar

Download references

Author information

Authors and Affiliations

IRISA, University of Rennes I, France
Israel Lerman
DMA/FCUP, Faculdade de Ciências, Universidade do Porto, Portugal
Joaquim Pinto da Costa
ISEP, Universidade do Porto, Portugal
Helena Silva

Authors

Israel Lerman
View author publications
You can also search for this author in PubMed Google Scholar
Joaquim Pinto da Costa
View author publications
You can also search for this author in PubMed Google Scholar
Helena Silva
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Wroclaw University of Economics, ul. Komandorska 118/120, 53-345, Wroclaw, Poland
Krzysztof Jajuga
Department of Statistics, Cracow University of Economics, ul. Rakowicka 27, 31-510, Cracow, Poland
Andrzej Sokołowski
Institute of Statistics, Technical University of Aachen, Wuellnerstrasse 3, 52056, Aachen, Germany
Hans-Hermann Bock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lerman, I., da Costa, J.P., Silva, H. (2002). Validation of Very Large Data Sets Clustering by Means of a Nonparametric Linear Criterion. In: Jajuga, K., Sokołowski, A., Bock, HH. (eds) Classification, Clustering, and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-56181-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-56181-8_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43691-1
Online ISBN: 978-3-642-56181-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics