Assessing variable importance in clustering: a new method based on unsupervised binary decision trees
- 133 Downloads
We consider different approaches for assessing variable importance in clustering. We focus on clustering using binary decision trees (CUBT), which is a non-parametric top-down hierarchical clustering method designed for both continuous and nominal data. We suggest a measure of variable importance for this method similar to the one used in Breiman’s classification and regression trees. This score is useful to rank the variables in a dataset, to determine which variables are the most important or to detect the irrelevant ones. We analyze both stability and efficiency of this score on different data simulation models in the presence of noise, and compare it to other classical variable importance measures. Our experiments show that variable importance based on CUBT is much more efficient than other approaches in a large variety of situations.
KeywordsUnsupervised learning CUBT Deviance Variable importance Variables ranking
We thank Claude Deniau and Pascal Auquier for their valuable comments. This work was partially supported by the Project ECOS SUD U14E02.
- Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv Neural Inf Process Syst 14:585–591Google Scholar
- Ghattas B (1999) Importance des variables dans les méthodes cart. Modulad 24:29–39Google Scholar
- Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):12–22Google Scholar
- Liu H, Yu L (2005) Toward integrating feature selection algorithms for classifcation and clustering. IEEE TKDE 17:491–502Google Scholar
- MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Neyman J, Le Cam LM (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pp 281–297Google Scholar
- R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
- Reif M (2014) mcIRT: IRT models for multiple choice items. Technical report, R package version 0.41Google Scholar
- Williams G, Huang JZ, Chen X, Wang Q, Xiao L (2015) wskm: weighted k-means clustering. Technical report, R package version 1.4.28Google Scholar
- Zhu L, Miao L, Zhang D (2012) Iterative Laplacian score for feature selection. Pattern Recognit 321:80–87Google Scholar