Skip to main content

Intra-feature Random Forest Clustering

  • Conference paper
  • First Online:
Machine Learning, Optimization, and Big Data (MOD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10710))

Included in the following conference series:

Abstract

Clustering algorithms are commonly used to find structure in data without explicitly being told what they are looking for. One key desideratum of a clustering algorithm is that the clusters it identifies given some set of features will generalize well to features that have not been measured. Yeung et al. (2001) introduce a Figure of Merit closely aligned to this desideratum, which they use to evaluate clustering algorithms. Broadly, the Figure of Merit measures the within-cluster variance of features of the data that were not available to the clustering algorithm. Using this metric, Yeung et al. found no clustering algorithms that reliably outperformed k-means on a suite of real world datasets (Yeung et al. 2001). This paper presents a novel clustering algorithm, intra-feature random forest clustering (IRFC), that does outperform k-means on a variety of real world datasets per this metric. IRFC begins by training an ensemble of decision trees of limited depth to predict randomly selected features given the remaining features. It then aggregates the partitions that are implied by these trees, and outputs however many clusters are specified by an input parameter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Albaum, S.P., Hahne, H., Otto, A., Haußmann, U., Becher, D., Poetsch, A., Goesmann, A., Nattkemper, T.W.: A guide through the computational analysis of isotope-labeled mass spectrometry-based quantitative proteomics data: an application study. Proteome sci. 9(1), 1 (2011)

    Article  Google Scholar 

  • Becker, R.A., Caceres, R., Hanson, K., Loh, J.M., Urbanek, S., Varshavsky, A., Volinsky, C.: A tale of one city: using cellular network data for urban planning. IEEE Pervasive Comput. 10(4), 18–26 (2011)

    Article  Google Scholar 

  • Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, vol. 7, pp. 6–17, December 2001

    Google Scholar 

  • Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  • Breiman, L.: Random Forests Manual v4.0. Technical report, UC Berkeley (2003). ftp://ftp.stat.berkeley.edu/pub/users/breiman/Using_random_forestsv4.0.pdf

  • Chicco, G., Napoli, R., Piglione, F.: Application of clustering algorithms and self organising maps to classify electricity customers. In: Power Tech Conference Proceedings, 2003 IEEE Bologna, vol. 1, 7 pp. IEEE, June 2003

    Google Scholar 

  • Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 3(7), 1 (2002)

    Article  Google Scholar 

  • Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96, no. 34, pp. 226–231, August 1996

    Google Scholar 

  • Harrigan, K.R.: An application of clustering for strategic group analysis. Strateg. Manag. J. 6(1), 55–73 (1985)

    Article  Google Scholar 

  • Hilas, C.S., Mastorocostas, P.A.: An application of supervised and unsupervised learning approaches to telecommunications fraud detection. Knowl. Based Syst. 21(7), 721–726 (2008)

    Article  Google Scholar 

  • Iliadis, L.S.: A decision support system applying an integrated fuzzy model for long-term forest fire risk estimation. Environ. Model Softw. 20(5), 613–621 (2005)

    Article  Google Scholar 

  • Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis, vol. 344. Wiley, Hoboken (2009)

    MATH  Google Scholar 

  • Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44, 23–34 (1988)

    Article  MathSciNet  Google Scholar 

  • Li, A., Walling, J., Ahn, S., Kotliarov, Y., Su, Q., Quezado, M., Oberholtzer, J.C., Park, J., Zenklusen, J.C., Fine, H.A.: Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. Cancer Res. 69(5), 2091–2099 (2009)

    Article  Google Scholar 

  • Masulli, F., Schenone, A.: A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artif. Intell. Med. 16(2), 129–147 (1999)

    Article  Google Scholar 

  • Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52(1–2), 91–118 (2003)

    Article  Google Scholar 

  • Park, B.: Hybrid neuro-fuzzy application in short-term freeway traffic volume forecasting. Transp. Res. Rec. J. Transp. Res. Board 1802, 190–196 (2002)

    Article  Google Scholar 

  • Pavlidis, N.G., Tasoulis, D.K., Vrahatis, M.N.: Financial forecasting through unsupervised clustering and evolutionary trained neural networks. In: The 2003 Congress on Evolutionary Computation (CEC 2003), vol. 4, pp. 2314–2321. IEEE, December 2003

    Google Scholar 

  • Pham, D.T.: Applications of unsupervised clustering algorithms to aircraft identification using high range resolution radar. In: Proceedings of the IEEE 1998 National Aerospace and Electronics Conference (NAECON 1998), pp. 228–235. IEEE, July 1998

    Google Scholar 

  • Singh, C., Kim, Y.: An efficient technique for reliability analysis of power systems including time dependent sources. IEEE Trans. Power Syst. 3(3), 1090–1096 (1988)

    Article  Google Scholar 

  • Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63(2), 411–423 (2001)

    Article  MathSciNet  Google Scholar 

  • Vega-Pons, S., Ruiz-Shulcloper, J.: A survey of clustering ensemble algorithms. Int. J. Pattern Recognit. Artif. Intell. 25(03), 337–372 (2011)

    Article  MathSciNet  Google Scholar 

  • Wang, C.H.: Apply robust segmentation to the service industry using kernel induced fuzzy clustering techniques. Expert Syst. Appl. 37(12), 8395–8400 (2010)

    Article  Google Scholar 

  • Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)

    Article  Google Scholar 

  • Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17(4), 309–318 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Cohen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cohen, M. (2018). Intra-feature Random Forest Clustering. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds) Machine Learning, Optimization, and Big Data. MOD 2017. Lecture Notes in Computer Science(), vol 10710. Springer, Cham. https://doi.org/10.1007/978-3-319-72926-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-72926-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-72925-1

  • Online ISBN: 978-3-319-72926-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics