Machine Learning

, Volume 101, Issue 1–3, pp 325–343 | Cite as

Two-level quantile regression forests for bias correction in range prediction

  • Thanh-Tung Nguyen
  • Joshua Z. Huang
  • Thuy Thi Nguyen


Quantile regression forests (QRF), a tree-based ensemble method for estimation of conditional quantiles, has been proven to perform well in terms of prediction accuracy, especially for range prediction. However, the model may have bias and suffer from working with high dimensional data (thousands of features). In this paper, we propose a new bias correction method, called bcQRF that uses bias correction in QRF for range prediction. In bcQRF, a new feature weighting subspace sampling method is used to build the first level QRF model. The residual term of the first level QRF model is then used as the response feature to train the second level QRF model for bias correction. The two-level models are used to compute bias-corrected predictions. Extensive experiments on both synthetic and real world data sets have demonstrated that the bcQRF method significantly reduced prediction errors and outperformed most existing regression random forests. The new method performed especially well on high dimensional data.


Bias correction Random forests Quantile regression forests  High dimensional data Data mining 



This work is supported by the Shenzhen New Industry Development Fund under Grant NO.JC201005270342A and the project “Some Advanced Statistical Learning Techniques for Computer Vision” funded by the National Foundation of Science and Technology Development, Vietnam under grant number 102.01-2011.17.


  1. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.MathSciNetMATHGoogle Scholar
  2. Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical report, Technical Report 547, Statistics Dept. UCB.Google Scholar
  3. Breiman, Leo. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefMATHGoogle Scholar
  4. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton: CRC Press.MATHGoogle Scholar
  5. Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.MathSciNetCrossRefMATHGoogle Scholar
  6. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (Vol. 2). New York: Springer.CrossRefMATHGoogle Scholar
  7. Hothorn, T., Hornik, K., & Zeileis, A. (2011) party: A laboratory for recursive part (y) itioning. r package version 0.9-9999. URL: Accessed 28 Nov 2013.
  8. Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the boruta package. Journal of Statistical Software, 36, 1–13.Google Scholar
  9. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R news, 2(3), 18–22.Google Scholar
  10. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7, 983–999.MathSciNetMATHGoogle Scholar
  11. Meinshausen, N. (2012). Quantregforest: quantile regression forests. R package version 0.2-3.Google Scholar
  12. Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., et al. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. New England Journal of Medicine, 346(25), 1937–1947.CrossRefGoogle Scholar
  13. Roy, M. H., & Larocque, D. (2012). Robustness of random forests for regression. Journal of Nonparametric Statistics, 24(4), 993–1006.MathSciNetCrossRefMATHGoogle Scholar
  14. Sandri, M., & Zuccolotto, P. (2008). A bias correction algorithm for the gini variable importance measure in classification trees. Journal of Computational and Graphical Statistics, 17(3), 27.MathSciNetCrossRefGoogle Scholar
  15. Sandri, M., & Zuccolotto, P. (2010). Analysis and correction of bias in total decrease in node impurity measures for tree-based algorithms. Statistics and Computing, 20(4), 393–407.MathSciNetCrossRefGoogle Scholar
  16. Stoppiglia, H., Dreyfus, G., Dubois, R., & Oussar, Y. (2003). Ranking a random feature for variable and feature selection. The Journal of Machine Learning Research, 3, 1399–1414.MATHGoogle Scholar
  17. Tung, N. T., Huang, J. Z., Imran, K., Li, M. J., & Williams, G. (2014). Extensions to quantile regression forests for very high dimensional data. In Advances in knowledge discovery and data mining, vol. 8444, (pp. 247–258). Springer.Google Scholar
  18. Tuv, E., Borisov, A., & Torkkola, K. (2006). Feature selection using ensemble based ranking against artificial contrasts. In Neural Networks, 2006. IJCNN’06. International Joint Conference on, (pp. 2181–2186). IEEE.Google Scholar
  19. Tuv, E., Borisov, A., Runger, G., & Torkkola, K. (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. The Journal of Machine Learning Research, 10, 1341–1366.MathSciNetMATHGoogle Scholar
  20. Welch, B. L. (1947). The generalization ofstudent’s’ problem when several different population variances are involved. Biometrika, 84, 28–35.MathSciNetGoogle Scholar
  21. Xu, R. (2013). Improvements to random forest methodology. PhD thesis, Iowa State University.Google Scholar
  22. Zhang, G., & Yan, L. (2012). Bias-corrected random forests in regression. Journal of Applied Statistics, 39(1), 151–160.MathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  • Thanh-Tung Nguyen
    • 1
    • 2
    • 3
  • Joshua Z. Huang
    • 1
    • 4
  • Thuy Thi Nguyen
    • 5
  1. 1.Shenzhen Key Laboratory of High Performance Data MiningShenzhen Institutes of Advanced Technology, Chinese Academy of SciencesShenzhenChina
  2. 2.School of Computer Science and Engineering, Water Resources UniversityHanoiVietnam
  3. 3.University of Chinese Academy of SciencesBeijing China
  4. 4.College of Computer Science and Software EngineeringShenzhen UniversityShenzhenChina
  5. 5.Faculty of Information Technology, Vietnam National University of AgricultureHanoiVietnam

Personalised recommendations