## Abstract

In this paper, we propose a new random forest method based on completely randomized splitting rules with an acceptance–rejection criterion for quality control. We show how the proposed acceptance–rejection (AR) algorithm can outperform the standard random forest algorithm (RF) and some of its variants including extremely randomized (ER) trees and smooth sigmoid surrogate (SSS) trees. Twenty datasets were analyzed to compare prediction performance and a simulated dataset was used to assess variable selection bias. In terms of prediction accuracy for classification problems, the proposed AR algorithm performed the best, with ER being the second best. For regression problems, RF and SSS performed the best, followed by AR, and then ER at the last. However, each algorithm was most accurate for at least one study. We investigate scenarios where the AR algorithm can yield better predictive performance. In terms of variable importance, both RF and SSS demonstrated selection bias in favor of variables with many possible splits, while both ER and AR largely removed this bias.

## Keywords

Classification and regression trees Supervised learning Prediction Variable selection bias Ensemble methods## Notes

### Acknowledgements

This research was supported in part by NSF Grant 163310.

## Supplementary material

## References

- Allwein E, Schapire R, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141MathSciNetzbMATHGoogle Scholar
- Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9:1545–1588CrossRefGoogle Scholar
- Breiman L (1996) Bagging predictors. Mach Learn 24:123–140zbMATHGoogle Scholar
- Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefGoogle Scholar
- Breiman L (2004) Consistency for a simply model of random forests. Technical report, University of California at BerkeleyGoogle Scholar
- Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning, pp 161–168Google Scholar
- Caruana R, Karampatziakis N, Yessenalina A (2008) An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th international conference on machine learning, pp 96–103Google Scholar
- Chambers J, Cleveland W, Kleiner B, Tukey P (1983) Graphical methods for data analysis. Wadsworth, BelmontzbMATHGoogle Scholar
- Cutler D, Edwards T Jr, Beard K, Cutler A, Hess K, Gibson J, Lawler J (2007) Random forest for classification in ecology. Ecology 88:2783–2792CrossRefGoogle Scholar
- Davis R, Anderson Z (1989) Exponential survival trees. Stat Med 8:947–962CrossRefGoogle Scholar
- Derrig R, Francis L (2008) Distinguishing the forest from the trees: a comparison of tree-based data mining methods. Variance 2:184–208Google Scholar
- Dietterich T, Bakiri G (1995) Solving multiclass learning problems via error–correcting output codes. J Artif Intell Res 2:263–286CrossRefGoogle Scholar
- Fan J, Su X, Levine R, Nunn M, LeBlanc M (2006) Trees for correlated survival data by goodness of split, with applications to tooth prognosis. J Am Stat Assoc 101:959–967MathSciNetCrossRefGoogle Scholar
- Friedman J (2001) Greedy function approximation: the gradient boosting machine. Ann Stat 29:1189–1232MathSciNetCrossRefGoogle Scholar
- Genuer R, Poggi JM, Tuleau C (2008) Random forests: some methodological insights. arXivGoogle Scholar
- Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42CrossRefGoogle Scholar
- Gordon L, Olshen R (1985) Tree-structured survival analysis. Cancer Treat Rep 69:1065–1069Google Scholar
- Hajjem A, Bellavance F, Larocque D (2014) Mixed effects random forest for clustered data. J Stat Comput Simul 84:1313–1328MathSciNetCrossRefGoogle Scholar
- Hanley J, McNeil B (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36CrossRefGoogle Scholar
- Ho T (1995) Random decision forest. In: Proceedings of the 3rd international conference on document analysis and recognition, vol 1, pp 278–282Google Scholar
- Ho T (1998) The random subspace method of constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20:832–844CrossRefGoogle Scholar
- Hosmer D, Lemeshow S (1989) Applied logistic regression. Wiley, New York zbMATHGoogle Scholar
- Hothorn T, Leisch F, Zeileis A, Hornik K (2005) The design and analysis of benchmark experiments. J Comput Graph Stat 14:675–699MathSciNetCrossRefGoogle Scholar
- Ishwaran H (2015) The effect of splitting on random forests. Mach Learn 99:75–118MathSciNetCrossRefGoogle Scholar
- Ishwaran H, Kogalur UB (2016) Random forests for survival, regression, and classification (RF-SRC). R package version 2.2.0Google Scholar
- Ishwaran H, Kogalur U, Gorodeski E, Minn A, Lauer M (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105:205–217MathSciNetCrossRefGoogle Scholar
- König I, Malley J, Weimar C, Diener HC, Ziegler A (2007) Practical experiences on the necessity of external validation. Stat Med 26:5499–5511MathSciNetCrossRefGoogle Scholar
- Leisch F, Dimitriadou E (2010) mlbench: Machine Learning Benchmark Problems. R package version 2.1-1Google Scholar
- Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22Google Scholar
- Malley J, Kruppa J, Dasgupta A, Malley K, Ziegler A (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform in Med 51:74–81CrossRefGoogle Scholar
- Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning. http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed May 2018
- Segal M, Xiao Y (2011) Multivariate random forests. WIREs Data Min Knowl Discov 1:80–87CrossRefGoogle Scholar
- Sela R, Simonoff J (2012) RE-EM trees: a data mining approach for longitudinal and clustered data. Mach Learn 86:169–207MathSciNetCrossRefGoogle Scholar
- Shah A, Bartlett J, Carpenter J, Nicholas O, Hemingway H (2014) Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study. Am J Epidemiol 179:764–774CrossRefGoogle Scholar
- Strobl C, Boulesteix A, Zeileis A, Augustin T (2007a) Unbiased split selection for classification trees based on the gini index. Comput Stat Data Anal 52:483–501MathSciNetCrossRefGoogle Scholar
- Strobl C, Boulesteix A, Zeileis A, Hothorn T (2007b) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform 8:25–46CrossRefGoogle Scholar
- Su X, Kang J, Liu L, Yang Q, Fan J, Levine R (2016) Smooth sigmoid surrogate (SSS): An alternative to greedy search in recursive partitioning. Comput Stat Data Anal Under Rev Google Scholar
- Su X, Pena A, Liu L, Levine R (2018) Random forests of interaction trees for estimating individualized treatment effects in randomized trials. Stat Med 37:2547–2560MathSciNetCrossRefGoogle Scholar
- Torgo L (1999) Inductive learning of tree-based regression models. Ph.D. thesis, University of PortoGoogle Scholar
- Yoo W, Ference B, Cote M, Schwartz A (2012) A comparison of logistic re gression, logic regression, classification tree, and random forests to identify effective gene–gene and gene–environment interactions. Int J Appl Sci Technol 2:268Google Scholar