Skip to main content

A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data

  • Conference paper
Data Science and Classification

Abstract

Symbolic Data Analysis (SDA) aims to to describe and analyze complex and structured data extracted, for example, from large databases. Such data, which can be expressed as concepts, are modeled by symbolic objects described by multivalued variables. In the present paper we present a new distance, based on the Wasserstein metric, in order to cluster a set of data described by distributions with finite continue support, or, as called in SDA, by “histograms”. The proposed distance permits us to define a measure of inertia of data with respect to a barycenter that satisfies the Huygens theorem of decomposition of inertia. We propose to use this measure for an agglomerative hierarchical clustering of histogram data based on the Ward criterion. An application to real data validates the procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • AITCHISON, J. (1986): The Statistical Analysis of Compositional Data, New York: Chapman Hall.

    MATH  Google Scholar 

  • BOCK, H.H. and DIDAY, E. (2000): Analysis of Symbolic Data, Exploratory methods for extracting statistical information from complex data, Studies in Classification, Data Analysis and Knowledge Organisation, Springer-Verlag.

    Google Scholar 

  • BILLARD, L., DIDAY, E. (2003): From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis Journal of the American Statistical Association, 98, 462, 470–487.

    Article  MathSciNet  Google Scholar 

  • CHAVENT, M., DE CARVALHO, F.A.T., LECHEVALLIER, Y., and VERDE, R. (2003): Trois nouvelles méthodes de classification automatique des données symbolique de type intervalle, Revue de Statistique Appliquée, LI, 4, 5–29.

    Google Scholar 

  • GIBBS, A.L. and SU, F.E. (2002): On choosing and bounding probability metrics, International Statistical Review, 70, 419.

    Article  MATH  Google Scholar 

  • IRPINO, A. and VERDE, R.(2005): A New Distance for Symbolic Data Clustering, CLADAG 2005, Book of short papers, MUP, 393–396.

    Google Scholar 

  • MALLOWS, C. L. (1972): A note on asymptotic joint normality. Annals of Mathematical Statistics, 43(2), 508–515.

    MATH  MathSciNet  Google Scholar 

  • WARD, J.H. (1963): Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, vol. 58, 238–244.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin · Heidelberg

About this paper

Cite this paper

Irpino, A., Verde, R. (2006). A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data. In: Batagelj, V., Bock, HH., Ferligoj, A., Žiberna, A. (eds) Data Science and Classification. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg . https://doi.org/10.1007/3-540-34416-0_20

Download citation

Publish with us

Policies and ethics