An Ensemble of Optimal Trees for Class Membership Probability Estimation

Khan, Zardad; Gul, Asma; Mahmoud, Osama; Miftahuddin, Miftahuddin; Perperoglou, Aris; Adler, Werner; Lausen, Berthold

doi:10.1007/978-3-319-25226-1_34

Zardad Khan^20,21,
Asma Gul^20,22,
Osama Mahmoud^20,23,
Miftahuddin Miftahuddin²⁰,
Aris Perperoglou²⁰,
Werner Adler²⁴ &
…
Berthold Lausen²⁰

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2294 Accesses
5 Citations

Abstract

Machine learning methods can be used for estimating the class membership probability of an observation. We propose an ensemble of optimal trees in terms of their predictive performance. This ensemble is formed by selecting the best trees from a large initial set of trees grown by random forest. A proportion of trees is selected on the basis of their individual predictive performance on out-of-bag observations. The selected trees are further assessed for their collective performance on an independent training data set. This is done by adding the trees one by one starting from the highest predictive tree. A tree is selected for the final ensemble if it increases the predictive performance of the previously combined trees. The proposed method is compared with probability estimation tree, random forest and node harvest on a number of bench mark problems using Brier score as a performance measure. In addition to reducing the number of trees in the ensemble, our method gives better results in most of the cases. The results are supported by a simulation study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ali, K. M., & Pazzani, M. J. (1996). Error reduction through learning multiple descriptions. Machine Learning, 24, 173–202.
Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Article MATH Google Scholar
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378.
Article MathSciNet MATH Google Scholar
Gul, A., Khan, Z., Mahmoud, O., Perperoglou, A., Miftahuddin, M., Adler, W., et al. (2015). Ensemble of k-nearest neighbour classifiers for class membership probability estimation. In The Proceedings of European Conference on Data Analysis, 2014.
Google Scholar
Hothorn, T., & Lausen, B. (2003). Double-bagging: Combining classifiers by bootstrap aggregation. Pattern Recognition, 36, 1303–1309.
Article MATH Google Scholar
Kruppa, J., Liu, Y., Biau, G., Kohler, M., Konig, I. R., Malley, J. D., et al. (2014a). Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory. Biometrical Journal, 56, 534–563.
Article MathSciNet MATH Google Scholar
Kruppa, J., Liu, Y., Diener, H. C., Weimar, C., Konig, I. R., & Ziegler, A. (2014b). Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications. Biometrical Journal, 56, 564–583.
Article MathSciNet MATH Google Scholar
Kruppa, J., Ziegler, A., & Konig, I. R. (2012). Risk estimation and risk prediction using machine-learning methods. Human Genetics, 131, 1639–1654.
Article Google Scholar
Liaw, A., & Wiener, M. (2002). Classification and regression by random forest. R News, 2, 18–22.
Google Scholar
Maclin, R., & Opitz, D. (2011). Popular ensemble methods: An empirical study. Journal of Artificial Research, 11, 169–189.
MATH Google Scholar
Mahmoud, O., Harrison, A., Perperoglou, A., Gul, A., Khan, Z., & Lausen, B. (2014b). propOverlap: Feature (Gene) selection based on the proportional overlapping scores. R package version 1.0. http://CRAN.R-project.org/package=propOverlap
Mahmoud, O., Harrison, A., Perperoglou, A., Gul, A., Khan, Z., Metodiev, M. V., et al. (2014a). A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics, 15, 274.
Google Scholar
Malley, J., Kruppa, J., Dasgupta, A., Malley, K., & Ziegler, A. (2012). Probability machines: Consistent probability estimation using nonparametric learning machines. Methods of Information in Medicine, 51, 74–81.
Article Google Scholar
Meinshausen, N. (2010). Node harvest. The Annals of Applied Statistics, 4, 2049–2072.
Article MathSciNet MATH Google Scholar
Platt, J. C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. J. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61–74). Cambridge, MA: MIT Press.
Google Scholar
R Core Team. (2014). R: A language and environment for statistical computing. http://www.R-project.org/

Download references

Author information

Authors and Affiliations

Department of Mathematical Sciences, University of Essex, Colchester, UK
Zardad Khan, Asma Gul, Osama Mahmoud, Miftahuddin Miftahuddin, Aris Perperoglou & Berthold Lausen
Department of Statistics, Abdul Wali Khan University, Mardan, Pakistan
Zardad Khan
Department of Statistics, Shaheed Benazir Bhutto Women University Peshawar, Khyber Pukhtoonkhwa, Pakistan
Asma Gul
Department of Applied Statistics, Helwan University, Cairo, Egypt
Osama Mahmoud
Department of Biometry and Epidemiology, University of Erlangen-Nuremberg, Erlangen, Germany
Werner Adler

Authors

Zardad Khan
View author publications
You can also search for this author in PubMed Google Scholar
Asma Gul
View author publications
You can also search for this author in PubMed Google Scholar
Osama Mahmoud
View author publications
You can also search for this author in PubMed Google Scholar
Miftahuddin Miftahuddin
View author publications
You can also search for this author in PubMed Google Scholar
Aris Perperoglou
View author publications
You can also search for this author in PubMed Google Scholar
Werner Adler
View author publications
You can also search for this author in PubMed Google Scholar
Berthold Lausen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zardad Khan .

Editor information

Editors and Affiliations

Jacobs University Bremen , Bremen, Germany
Adalbert F.X. Wilhelm
Universität Ulm, Institute of Medical Systems Biology Universität Ulm, Ulm, Baden-Württemberg, Germany
Hans A. Kestler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khan, Z. et al. (2016). An Ensemble of Optimal Trees for Class Membership Probability Estimation. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-25226-1_34
Published: 04 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25224-7
Online ISBN: 978-3-319-25226-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics