Intra-feature Random Forest Clustering

Cohen, Michael

doi:10.1007/978-3-319-72926-8_4

Michael Cohen¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10710))

Included in the following conference series:

International Workshop on Machine Learning, Optimization, and Big Data

Abstract

Clustering algorithms are commonly used to find structure in data without explicitly being told what they are looking for. One key desideratum of a clustering algorithm is that the clusters it identifies given some set of features will generalize well to features that have not been measured. Yeung et al. (2001) introduce a Figure of Merit closely aligned to this desideratum, which they use to evaluate clustering algorithms. Broadly, the Figure of Merit measures the within-cluster variance of features of the data that were not available to the clustering algorithm. Using this metric, Yeung et al. found no clustering algorithms that reliably outperformed k-means on a suite of real world datasets (Yeung et al. 2001). This paper presents a novel clustering algorithm, intra-feature random forest clustering (IRFC), that does outperform k-means on a variety of real world datasets per this metric. IRFC begins by training an ensemble of decision trees of limited depth to predict randomly selected features given the remaining features. It then aggregates the partitions that are implied by these trees, and outputs however many clusters are specified by an input parameter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Albaum, S.P., Hahne, H., Otto, A., Haußmann, U., Becher, D., Poetsch, A., Goesmann, A., Nattkemper, T.W.: A guide through the computational analysis of isotope-labeled mass spectrometry-based quantitative proteomics data: an application study. Proteome sci. 9(1), 1 (2011)
Article Google Scholar
Becker, R.A., Caceres, R., Hanson, K., Loh, J.M., Urbanek, S., Varshavsky, A., Volinsky, C.: A tale of one city: using cellular network data for urban planning. IEEE Pervasive Comput. 10(4), 18–26 (2011)
Article Google Scholar
Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, vol. 7, pp. 6–17, December 2001
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Breiman, L.: Random Forests Manual v4.0. Technical report, UC Berkeley (2003). ftp://ftp.stat.berkeley.edu/pub/users/breiman/Using_random_forestsv4.0.pdf
Chicco, G., Napoli, R., Piglione, F.: Application of clustering algorithms and self organising maps to classify electricity customers. In: Power Tech Conference Proceedings, 2003 IEEE Bologna, vol. 1, 7 pp. IEEE, June 2003
Google Scholar
Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7), 1 (2002)
Article Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96, no. 34, pp. 226–231, August 1996
Google Scholar
Harrigan, K.R.: An application of clustering for strategic group analysis. Strateg. Manag. J. 6(1), 55–73 (1985)
Article Google Scholar
Hilas, C.S., Mastorocostas, P.A.: An application of supervised and unsupervised learning approaches to telecommunications fraud detection. Knowl. Based Syst. 21(7), 721–726 (2008)
Article Google Scholar
Iliadis, L.S.: A decision support system applying an integrated fuzzy model for long-term forest fire risk estimation. Environ. Model Softw. 20(5), 613–621 (2005)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)
MATH Google Scholar
Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 23–34 (1988)
Article MathSciNet Google Scholar
Li, A., Walling, J., Ahn, S., Kotliarov, Y., Su, Q., Quezado, M., Oberholtzer, J.C., Park, J., Zenklusen, J.C., Fine, H.A.: Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69(5), 2091–2099 (2009)
Article Google Scholar
Masulli, F., Schenone, A.: A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artif. Intell. Med. 16(2), 129–147 (1999)
Article Google Scholar
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52(1–2), 91–118 (2003)
Article Google Scholar
Park, B.: Hybrid neuro-fuzzy application in short-term freeway traffic volume forecasting. Transp. Res. Rec. J. Transp. Res. Board 1802, 190–196 (2002)
Article Google Scholar
Pavlidis, N.G., Tasoulis, D.K., Vrahatis, M.N.: Financial forecasting through unsupervised clustering and evolutionary trained neural networks. In: The 2003 Congress on Evolutionary Computation (CEC 2003), vol. 4, pp. 2314–2321. IEEE, December 2003
Google Scholar
Pham, D.T.: Applications of unsupervised clustering algorithms to aircraft identification using high range resolution radar. In: Proceedings of the IEEE 1998 National Aerospace and Electronics Conference (NAECON 1998), pp. 228–235. IEEE, July 1998
Google Scholar
Singh, C., Kim, Y.: An efficient technique for reliability analysis of power systems including time dependent sources. IEEE Trans. Power Syst. 3(3), 1090–1096 (1988)
Article Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63(2), 411–423 (2001)
Article MathSciNet Google Scholar
Vega-Pons, S., Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms. Int. J. Pattern Recognit. Artif. Intell. 25(03), 337–372 (2011)
Article MathSciNet Google Scholar
Wang, C.H.: Apply robust segmentation to the service industry using kernel induced fuzzy clustering techniques. Expert Syst. Appl. 37(12), 8395–8400 (2010)
Article Google Scholar
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Article Google Scholar
Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17(4), 309–318 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Galvanize, San Francisco, CA, USA
Michael Cohen

Authors

Michael Cohen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Cohen .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy
Giuseppe Nicosia
University of Florida, Gainesville, FL, USA
Panos Pardalos
University of Catania, Catania, Italy
Giovanni Giuffrida
Harvard University, Cambridge, MA, USA
Renato Umeton

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cohen, M. (2018). Intra-feature Random Forest Clustering. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science(), vol 10710. Springer, Cham. https://doi.org/10.1007/978-3-319-72926-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-72926-8_4
Published: 21 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72925-1
Online ISBN: 978-3-319-72926-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics