Abstract
We study how random projections can be used with large data sets in order (1) to cluster the data using a fast, binning approach which is characterized in terms of direct inducing of a hierarchy through use of the Baire metric; and (2) based on clusters found, selecting subsets of the original data for further analysis. In this work, we focus on random projection that is used for processing high dimensional data. A random projection, outputting a random permutation of the observation set, provides a random spanning path. We show how a spanning path relates to contiguity- or adjacency-constrained clustering. We study performance properties of hierarchical clustering constructed from random spanning paths, and we introduce a novel visualization of the results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Boutsidis, C., Zouzias, A., & Drineas, P. (2010). Random projections for k-Means clustering. Advances in Neural Information Processing Systems, 23(iii), 298–306.
Braunstein, L. A., Zhenhua W. U., Chen, Y., Buldyrev, S. V., Kalisky, T., Sreenivasan, S., Cohen, R., López, E., Havlin, S., & Stanley, H. E. (2007). Optimal path and minimal spanning trees in random weighted networks. International Journal of Bifurcation and Chaos, 17 (7), 2215–2255.
Contreras, P., & Murtagh, F. (2012). Fast, linear time hierarchical clustering using the baire metric. Journal of Classification, 29, 118–143.
Ferligoj, A., & Batagelj, V. (1982). Clustering with relational constraint. Psychometrika, 47, 413–426.
Fern, X. Z., Brodley, C. E. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach. In T. Fawcett & N. Mishra (Eds.), Proceedings 20th International Conference on Machine Learning (pp. 186–193).
Kaski, S. (1998). Dimensionality reduction by random mapping: Fast similarity computation for clustering. In IJCNN’98, IEEE International Joint Conference on Neural Networks (Vol. 1, pp. 413–418).
Legendre, P., & Legendre, L. (2012). Numerical ecology (3rd ed.). Amsterdam: Elsevier.
Manton, K. G., Huang, H. & Xiliang G. U. (2008). Chapter 3 - Molecular basis of CNS aging, frailty, fitness and longevity: A Model based on cellular energetic. In J. P. Tsai (Ed.), Leading-edge cognitive disorders research, New York: Nova Science, Hauppauge.
Matrix Market (2013). Matrix market exchange formats, http://math.nist.gov/MatrixMarket/ formats.html
Murtagh, F. (1985). Multidimensional clustering algorithms. Heidelberg and Vienna: Physica-Verlag.
Murtagh, F. (2004). On ultrametricity, data coding, and computation. Journal of Classification, 21, 167–184.
Murtagh, F. (2013). MoreLikeThis and Scoring in Solr, report, 4 pp., 26 May 2013. http://www.multiresolutions.com/HiClBaireRanSpanPaths
Murtagh, F., & Contreras, P. (2015). Constant time search and retrieval in massive data with linear time and space setup, through randomly projected piling and sparse p-adic coding, article in preparation.
Murtagh, F., Downs, G., & Contreras, P. (2008). Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding. SIAM Journal of Scientific Computing, 30, 707–730.
Solr (2013). Solr, Apache Lucene based search server, http://lucene.apache.org/solr
Urruty, T., Djeraba, C., & Simovici, D. A. (2007). Clustering by random projections, Advances in data mining. Theoretical aspects and applications lecture notes in computer science (Vol. 4597, pp. 107–119).
Acknowledgements
We are grateful to Paul Morris for initial discussions related to this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Murtagh, F., Contreras, P. (2016). Linear Storage and Potentially Constant Time Hierarchical Clustering Using the Baire Metric and Random Spanning Paths. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-25226-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25224-7
Online ISBN: 978-3-319-25226-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)