Advertisement

Random forest with acceptance–rejection trees

  • Peter Calhoun
  • Melodie J. Hallett
  • Xiaogang Su
  • Guy Cafri
  • Richard A. Levine
  • Juanjuan FanEmail author
Original Paper

Abstract

In this paper, we propose a new random forest method based on completely randomized splitting rules with an acceptance–rejection criterion for quality control. We show how the proposed acceptance–rejection (AR) algorithm can outperform the standard random forest algorithm (RF) and some of its variants including extremely randomized (ER) trees and smooth sigmoid surrogate (SSS) trees. Twenty datasets were analyzed to compare prediction performance and a simulated dataset was used to assess variable selection bias. In terms of prediction accuracy for classification problems, the proposed AR algorithm performed the best, with ER being the second best. For regression problems, RF and SSS performed the best, followed by AR, and then ER at the last. However, each algorithm was most accurate for at least one study. We investigate scenarios where the AR algorithm can yield better predictive performance. In terms of variable importance, both RF and SSS demonstrated selection bias in favor of variables with many possible splits, while both ER and AR largely removed this bias.

Keywords

Classification and regression trees Supervised learning Prediction Variable selection bias Ensemble methods 

Notes

Acknowledgements

This research was supported in part by NSF Grant 163310.

Supplementary material

180_2019_929_MOESM1_ESM.pdf (123 kb)
Supplementary material 1 (pdf 123 KB)

References

  1. Allwein E, Schapire R, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141MathSciNetzbMATHGoogle Scholar
  2. Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9:1545–1588CrossRefGoogle Scholar
  3. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140zbMATHGoogle Scholar
  4. Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefGoogle Scholar
  5. Breiman L (2004) Consistency for a simply model of random forests. Technical report, University of California at BerkeleyGoogle Scholar
  6. Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning, pp 161–168Google Scholar
  7. Caruana R, Karampatziakis N, Yessenalina A (2008) An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th international conference on machine learning, pp 96–103Google Scholar
  8. Chambers J, Cleveland W, Kleiner B, Tukey P (1983) Graphical methods for data analysis. Wadsworth, BelmontzbMATHGoogle Scholar
  9. Cutler D, Edwards T Jr, Beard K, Cutler A, Hess K, Gibson J, Lawler J (2007) Random forest for classification in ecology. Ecology 88:2783–2792CrossRefGoogle Scholar
  10. Davis R, Anderson Z (1989) Exponential survival trees. Stat Med 8:947–962CrossRefGoogle Scholar
  11. Derrig R, Francis L (2008) Distinguishing the forest from the trees: a comparison of tree-based data mining methods. Variance 2:184–208Google Scholar
  12. Dietterich T, Bakiri G (1995) Solving multiclass learning problems via error–correcting output codes. J Artif Intell Res 2:263–286CrossRefGoogle Scholar
  13. Fan J, Su X, Levine R, Nunn M, LeBlanc M (2006) Trees for correlated survival data by goodness of split, with applications to tooth prognosis. J Am Stat Assoc 101:959–967MathSciNetCrossRefGoogle Scholar
  14. Friedman J (2001) Greedy function approximation: the gradient boosting machine. Ann Stat 29:1189–1232MathSciNetCrossRefGoogle Scholar
  15. Genuer R, Poggi JM, Tuleau C (2008) Random forests: some methodological insights. arXivGoogle Scholar
  16. Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42CrossRefGoogle Scholar
  17. Gordon L, Olshen R (1985) Tree-structured survival analysis. Cancer Treat Rep 69:1065–1069Google Scholar
  18. Hajjem A, Bellavance F, Larocque D (2014) Mixed effects random forest for clustered data. J Stat Comput Simul 84:1313–1328MathSciNetCrossRefGoogle Scholar
  19. Hanley J, McNeil B (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36CrossRefGoogle Scholar
  20. Ho T (1995) Random decision forest. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1, pp 278–282Google Scholar
  21. Ho T (1998) The random subspace method of constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20:832–844CrossRefGoogle Scholar
  22. Hosmer D, Lemeshow S (1989) Applied logistic regression. Wiley, New York zbMATHGoogle Scholar
  23. Hothorn T, Leisch F, Zeileis A, Hornik K (2005) The design and analysis of benchmark experiments. J Comput Graph Stat 14:675–699MathSciNetCrossRefGoogle Scholar
  24. Ishwaran H (2015) The effect of splitting on random forests. Mach Learn 99:75–118MathSciNetCrossRefGoogle Scholar
  25. Ishwaran H, Kogalur UB (2016) Random forests for survival, regression, and classification (RF-SRC). R package version 2.2.0Google Scholar
  26. Ishwaran H, Kogalur U, Gorodeski E, Minn A, Lauer M (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105:205–217MathSciNetCrossRefGoogle Scholar
  27. König I, Malley J, Weimar C, Diener HC, Ziegler A (2007) Practical experiences on the necessity of external validation. Stat Med 26:5499–5511MathSciNetCrossRefGoogle Scholar
  28. Leisch F, Dimitriadou E (2010) mlbench: Machine Learning Benchmark Problems. R package version 2.1-1Google Scholar
  29. Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22Google Scholar
  30. Malley J, Kruppa J, Dasgupta A, Malley K, Ziegler A (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform in Med 51:74–81CrossRefGoogle Scholar
  31. Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed May 2018
  32. Segal M, Xiao Y (2011) Multivariate random forests. WIREs Data Min Knowl Discov 1:80–87CrossRefGoogle Scholar
  33. Sela R, Simonoff J (2012) RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 86:169–207MathSciNetCrossRefGoogle Scholar
  34. Shah A, Bartlett J, Carpenter J, Nicholas O, Hemingway H (2014) Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am J Epidemiol 179:764–774CrossRefGoogle Scholar
  35. Strobl C, Boulesteix A, Zeileis A, Augustin T (2007a) Unbiased split selection for classification trees based on the gini index. Comput Stat Data Anal 52:483–501MathSciNetCrossRefGoogle Scholar
  36. Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007b) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform 8:25–46CrossRefGoogle Scholar
  37. Su X, Kang J, Liu L, Yang Q, Fan J, Levine R (2016) Smooth sigmoid surrogate (SSS): An alternative to greedy search in recursive partitioning. Comput Stat Data Anal Under Rev Google Scholar
  38. Su X, Pena A, Liu L, Levine R (2018) Random forests of interaction trees for estimating individualized treatment effects in randomized trials. Stat Med 37:2547–2560MathSciNetCrossRefGoogle Scholar
  39. Torgo L (1999) Inductive learning of tree-based regression models. Ph.D. thesis, University of PortoGoogle Scholar
  40. Yoo W, Ference B, Cote M, Schwartz A (2012) A comparison of logistic re gression, logic regression, classification tree, and random forests to identify effective gene–gene and gene–environment interactions. Int J Appl Sci Technol 2:268Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • Peter Calhoun
    • 1
  • Melodie J. Hallett
    • 2
  • Xiaogang Su
    • 3
  • Guy Cafri
    • 4
  • Richard A. Levine
    • 5
    • 6
  • Juanjuan Fan
    • 6
    Email author
  1. 1.Jaeb Center for Health ResearchTampaUSA
  2. 2.Department of SociologySan Diego State UniversitySan DiegoUSA
  3. 3.Department of Mathematical SciencesUniversity of TexasEl PasoUSA
  4. 4.Johnson & Johnson Medical DevicesSan DiegoUSA
  5. 5.Analytic Studies and Institutional ResearchSan Diego State UniversitySan DiegoUSA
  6. 6.Department of Mathematics and StatisticsSan Diego State UniversitySan DiegoUSA

Personalised recommendations