Auto-CES: An Automatic Pruning Method Through Clustering Ensemble Selection
Ensemble learning is a machine learning approach where multiple learners are trained to solve a particular problem. Random Forest is an ensemble learning algorithm which comprises numerous decision trees and nominates a class through majority voting for classification and averaging approach for regression. The prior research affirms that the learning time of the Random Forest algorithm linearly increases when the number of trees in the forest augments. This large number of decision trees in the Random Forest can cause certain challenges. Firstly, it can enlarge the model complexity, and secondly, it can negatively affect the efficiency of large-scale datasets. Hence, ensemble pruning methods (e.g. Clustering Ensemble Selection (CES)) are devised to select a subset of decision trees out of the forest. The main challenge is that the prior CES models require the number of clusters as input. To solve the problem, we devise an Automatic CES pruning model (Auto-CES) for Random Forest which can automatically find the proper number of clusters. Our proposed model is able to obtain an optimal subset of trees that can provide the same or even better effectiveness compared to the original set. Auto-CES has two components: clustering and selection. First, our algorithm utilizes a new clustering technique to classify homogeneous trees. In selection part, it takes both accuracy and diversity of the trees into consideration to choose the best tree.
Extensive experiments are conducted on five datasets. The results show that our algorithm can perform the classification task more effectively than the state-of-the-art rivals.
KeywordsMachine learning Ensemble method Random Forest Decision tree Clustering Ensemble Selection Pruning of Random Forest
- 1.Bernard, S., Heutte, L., Adam, S.: On the selection of decision trees in random forests. In: International Joint Conference on Neural Networks, IJCNN 2009, pp. 302–307. IEEE (2009)Google Scholar
- 2.Bharathidason, S., Venkataeswaran, C.J.: Improving classification accuracy based on random forest model with uncorrelated high performing trees. Int. J. Comput. Appl. 101(13), 26–30 (2014)Google Scholar
- 3.Blake, C.L., Merz, C.J.: UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine 55 (1998). http://www.ics.uci.edu/ mlearn/mlrepository.html
- 8.Elghazel, H., Aussem, A., Perraud, F.: Trading-off diversity and accuracy for optimal ensemble tree selection in random forests. In: Okun, O., Valentini, G., Re, M. (eds.) Ensembles in Machine Learning Applications. SCI, vol. 373, pp. 169–179. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22910-7_10CrossRefGoogle Scholar
- 11.Gacquer, D., Delcroix, V., Delmotte, F., Piechowiak, S.: On the effectiveness of diversity when training multiple classifier systems. In: Sossai, C., Chemello, G. (eds.) ECSQARU 2009. LNCS (LNAI), vol. 5590, pp. 493–504. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02906-6_43CrossRefGoogle Scholar
- 14.Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: Proceedings of the 37th International Conference on Software Engineering, vol. 2, pp. 7–16. IEEE Press (2015)Google Scholar
- 16.Holmes, G., Donkin, A., Witten, I.H.: WEKA: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, pp. 357–361. IEEE (1994)Google Scholar
- 22.Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)Google Scholar
- 24.Tripoliti, E.E., Fotiadis, D.I., Manis, G.: Dynamic construction of random forests: evaluation using biomedical engineering problems. In: 2010 10th IEEE International Conference on Information Technology and Applications in Biomedicine (ITAB), pp. 1–4. IEEE (2010)Google Scholar