Advertisement

Computational Statistics

, Volume 34, Issue 1, pp 301–321 | Cite as

Assessing variable importance in clustering: a new method based on unsupervised binary decision trees

  • Ghattas BadihEmail author
  • Michel Pierre
  • Boyer Laurent
Original Paper
  • 133 Downloads

Abstract

We consider different approaches for assessing variable importance in clustering. We focus on clustering using binary decision trees (CUBT), which is a non-parametric top-down hierarchical clustering method designed for both continuous and nominal data. We suggest a measure of variable importance for this method similar to the one used in Breiman’s classification and regression trees. This score is useful to rank the variables in a dataset, to determine which variables are the most important or to detect the irrelevant ones. We analyze both stability and efficiency of this score on different data simulation models in the presence of noise, and compare it to other classical variable importance measures. Our experiments show that variable importance based on CUBT is much more efficient than other approaches in a large variety of situations.

Keywords

Unsupervised learning CUBT Deviance Variable importance Variables ranking 

Notes

Acknowledgements

We thank Claude Deniau and Pascal Auquier for their valuable comments. This work was partially supported by the Project ECOS SUD U14E02.

References

  1. Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst 14:585–591Google Scholar
  2. Bock RD (1972) Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika 37:29–51CrossRefzbMATHGoogle Scholar
  3. Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24:6MathSciNetzbMATHGoogle Scholar
  4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  5. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth and Brooks, LondonzbMATHGoogle Scholar
  6. Chen X, Xu X, Huang JZ, Ye Y (2013) Tw-\(k\)-means: automated two-level variable weighting clustering algorithm for multiview data. IEEE Trans Knowl Data Eng 25(4):932–944CrossRefGoogle Scholar
  7. Fisher R (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188CrossRefGoogle Scholar
  8. Fraiman R, Ghattas B, Svarc M (2013) Interpretable clustering using unsupervised binary trees. Adv Data Anal Classif 7:125–145MathSciNetCrossRefzbMATHGoogle Scholar
  9. Ghattas B (1999) Importance des variables dans les méthodes cart. Modulad 24:29–39Google Scholar
  10. Ghattas B, Michel P, Boyer L (2017) Clustering nominal data using unsupervised binary decision trees: comparisons with the state of the art methods. Pattern Recognit 67:177–185CrossRefGoogle Scholar
  11. Guyon I, Weston J, Barnhill S, Vapnik VN (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422CrossRefzbMATHGoogle Scholar
  12. Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):12–22Google Scholar
  13. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classifcation and clustering. IEEE TKDE 17:491–502Google Scholar
  14. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Neyman J, Le Cam LM (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297Google Scholar
  15. Muraki E (1992) A generalized partial credit model: application of an em algorithm. Appl Psychol Measur 16:159–176CrossRefGoogle Scholar
  16. R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
  17. Rakotomamonjy A (2003) Variable selection using SVM-based criteria. J Mach Learn Res 3:1357–1370MathSciNetzbMATHGoogle Scholar
  18. Reif M (2014) mcIRT: IRT models for multiple choice items. Technical report, R package version 0.41Google Scholar
  19. Rizopoulos D (2006) ltm: an R package for latent variable modelling and item response theory analyses. J Stat Softw 17(5):1–25CrossRefGoogle Scholar
  20. Weston J, Elisseff A, Schoelkopf B, Tipping M (2003) Use of the zero norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461MathSciNetGoogle Scholar
  21. Williams G, Huang JZ, Chen X, Wang Q, Xiao L (2015) wskm: weighted k-means clustering. Technical report, R package version 1.4.28Google Scholar
  22. Zhu L, Miao L, Zhang D (2012) Iterative Laplacian score for feature selection. Pattern Recognit 321:80–87Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.I2M UMR 7373Aix Marseille Université, CNRS, Centrale MarseilleMarseilleFrance
  2. 2.SPMC EA3279Aix Marseille UniversitéMarseilleFrance

Personalised recommendations