Variable Selection for Classification and Regression in Large p, Small n Problems

  • Wei-Yin LohEmail author
Conference paper
Part of the Lecture Notes in Statistics book series (LNS, volume 205)


Classification and regression problems in which the number of predictor variables is larger than the number of observations are increasingly common with rapid technological advances in data collection. Because some of these variables may have little or no influence on the response, methods that can identify the unimportant variables are needed. Two methods that have been proposed for this purpose are EARTH and Random forest (RF). This article presents an alternative method, derived from the GUIDE classification and regression tree algorithm, that employs recursive partitioning to determine the degree of importance of the variables. Simulation experiments show that the new method improves the prediction accuracy of several nonparametric regression models more than Random forest and EARTH. The results indicate that it is not essential to correctly identify all the important variables in every situation. Conditions for which this occurs are obtained for the linear model. The article concludes with an application of the new method to identify rare molecules in a large genomic data set.


Variable Selection Random Forest Importance Score Variable Selection Method Artificial Variable 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was partially supported by the U.S. Army Research Office under grants W911NF-05-1-0047 and W911NF-09-1-0205. The author is grateful to K. Doksum, S. Tang, and K. Tsui for helpful discussions and to S. Tang for the computer code for EARTH.


  1. 1.
    Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefzbMATHGoogle Scholar
  2. 2.
    Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, BelmontzbMATHGoogle Scholar
  3. 3.
    Chernoff H, Lo S-H, Zheng T (2009) Discovering influential variables: a method of partitions. Ann Appl Stat 3:1335–1369CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Doksum K, Tang S, Tsui K-W (2008) Nonparametric variable selection: the EARTH algorithm. J Am Stat Assoc 103:1609–1620CrossRefzbMATHMathSciNetGoogle Scholar
  5. 5.
    Friedman J (1991) Multivariate adaptive regression splines (with discussion). Ann Stat 19:1–141CrossRefzbMATHGoogle Scholar
  6. 6.
    Loh W-Y (2002) Regression trees with unbiased variable selection and interaction detection. Stat Sin 12:361–386zbMATHMathSciNetGoogle Scholar
  7. 7.
    Loh W-Y (2009) Improving the precision of classification trees. Ann Appl Stat 3:1710–1737CrossRefzbMATHMathSciNetGoogle Scholar
  8. 8.
    Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biometrics Bull 2:110–114CrossRefGoogle Scholar
  9. 9.
    Seber GAF, Lee AJ (2003) Linear regression analysis. 2nd edn. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  10. 10.
    Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinf 8:25CrossRefGoogle Scholar
  11. 11.
    Tuv E, Borisov A, Torkkola K (2006) Feature selection using ensemble based ranking against artificial contrasts. In: IJCNN ’06. International joint conference on neural networks, Vancouver, CanadaGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.University of WisconsinMadisonUSA

Personalised recommendations