Skip to main content

Linear Storage and Potentially Constant Time Hierarchical Clustering Using the Baire Metric and Random Spanning Paths

  • Conference paper
  • First Online:
Analysis of Large and Complex Data

Abstract

We study how random projections can be used with large data sets in order (1) to cluster the data using a fast, binning approach which is characterized in terms of direct inducing of a hierarchy through use of the Baire metric; and (2) based on clusters found, selecting subsets of the original data for further analysis. In this work, we focus on random projection that is used for processing high dimensional data. A random projection, outputting a random permutation of the observation set, provides a random spanning path. We show how a spanning path relates to contiguity- or adjacency-constrained clustering. We study performance properties of hierarchical clustering constructed from random spanning paths, and we introduce a novel visualization of the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Boutsidis, C., Zouzias, A., & Drineas, P. (2010). Random projections for k-Means clustering. Advances in Neural Information Processing Systems, 23(iii), 298–306.

    Google Scholar 

  • Braunstein, L. A., Zhenhua W. U., Chen, Y., Buldyrev, S. V., Kalisky, T., Sreenivasan, S., Cohen, R., López, E., Havlin, S., & Stanley, H. E. (2007). Optimal path and minimal spanning trees in random weighted networks. International Journal of Bifurcation and Chaos, 17 (7), 2215–2255.

    Article  MathSciNet  MATH  Google Scholar 

  • Contreras, P., & Murtagh, F. (2012). Fast, linear time hierarchical clustering using the baire metric. Journal of Classification, 29, 118–143.

    Article  MathSciNet  MATH  Google Scholar 

  • Ferligoj, A., & Batagelj, V. (1982). Clustering with relational constraint. Psychometrika, 47, 413–426.

    Article  MathSciNet  MATH  Google Scholar 

  • Fern, X. Z., Brodley, C. E. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach. In T. Fawcett & N. Mishra (Eds.), Proceedings 20th International Conference on Machine Learning (pp. 186–193).

    Google Scholar 

  • Kaski, S. (1998). Dimensionality reduction by random mapping: Fast similarity computation for clustering. In IJCNN’98, IEEE International Joint Conference on Neural Networks (Vol. 1, pp. 413–418).

    Google Scholar 

  • Legendre, P., & Legendre, L. (2012). Numerical ecology (3rd ed.). Amsterdam: Elsevier.

    MATH  Google Scholar 

  • Manton, K. G., Huang, H. & Xiliang G. U. (2008). Chapter 3 - Molecular basis of CNS aging, frailty, fitness and longevity: A Model based on cellular energetic. In J. P. Tsai (Ed.), Leading-edge cognitive disorders research, New York: Nova Science, Hauppauge.

    Google Scholar 

  • Matrix Market (2013). Matrix market exchange formats, http://math.nist.gov/MatrixMarket/ formats.html

  • Murtagh, F. (1985). Multidimensional clustering algorithms. Heidelberg and Vienna: Physica-Verlag.

    MATH  Google Scholar 

  • Murtagh, F. (2004). On ultrametricity, data coding, and computation. Journal of Classification, 21, 167–184.

    Article  MathSciNet  MATH  Google Scholar 

  • Murtagh, F. (2013). MoreLikeThis and Scoring in Solr, report, 4 pp., 26 May 2013. http://www.multiresolutions.com/HiClBaireRanSpanPaths

  • Murtagh, F., & Contreras, P. (2015). Constant time search and retrieval in massive data with linear time and space setup, through randomly projected piling and sparse p-adic coding, article in preparation.

    Google Scholar 

  • Murtagh, F., Downs, G., & Contreras, P. (2008). Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding. SIAM Journal of Scientific Computing, 30, 707–730.

    Article  MathSciNet  MATH  Google Scholar 

  • Solr (2013). Solr, Apache Lucene based search server, http://lucene.apache.org/solr

  • Urruty, T., Djeraba, C., & Simovici, D. A. (2007). Clustering by random projections, Advances in data mining. Theoretical aspects and applications lecture notes in computer science (Vol. 4597, pp. 107–119).

    Google Scholar 

Download references

Acknowledgements

We are grateful to Paul Morris for initial discussions related to this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fionn Murtagh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Murtagh, F., Contreras, P. (2016). Linear Storage and Potentially Constant Time Hierarchical Clustering Using the Baire Metric and Random Spanning Paths. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_4

Download citation

Publish with us

Policies and ethics