An empirical comparison of random forest-based and other learning-to-rank algorithms

Abstract

Random forest (RF)-based pointwise learning-to-rank (LtR) algorithms use surrogate loss functions to minimize the ranking error. In spite of their competitive performance to other state-of-the-art LtR algorithms, these algorithms, unlike other frameworks such as boosting and neural network, have not been thoroughly investigated in the literature so far. In the first part of this study, we aim to better understand and improve the RF-based pointwise LtR algorithms. When working with such an algorithm, currently we need to choose a setting from a number of available options such as (1) classification versus regression setting, (2) using absolute relevance judgements versus mapped labels, (3) the number of features using which a split-point for data is chosen, and (4) using weighted versus un-weighted average of the predictions of multiple base learners (i.e., trees). We conduct a thorough study on these four aspects as well as on a pairwise objective function for RF-based rank-learners. Experimental results on several benchmark LtR datasets demonstrate that performance can be significantly improved by exploring these aspects. In the second part of this paper, we, guided by our investigations performed into RF-based rank-learners, conduct extensive comparison between these and state-of-the-art rank-learning algorithms. This comparison reveals some interesting and insightful findings about LtR algorithms including the finding that RF-based LtR algorithms are among the most robust techniques across datasets with diverse properties.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    www.google.com.

  2. 2.

    Manning et al. [1] nicely explain these models.

  3. 3.

    http://research.microsoft.com/en-us/um/beijing/projects/letor/.

  4. 4.

    http://research.microsoft.com/en-us/projects/mslr/.

  5. 5.

    To know details of these metrics, the reader can go through Järvelin and Kekäläinen [31], Chapelle et al. [32] and Ibrahim and Murshed [4].

  6. 6.

    This algorithm is also used by Ibrahim and Carman [15].

  7. 7.

    While there exist several functions to be used as splitting criterion, Marko Robnik-Sikonja [35] shows that this choice makes insignificant performance variation, if at all.

  8. 8.

    While in the literature most of the implementations of a tree uses a depth-first (i.e., recursive) exploration of the nodes, the implementation shown here uses a breadth-first exploration mainly because we think that this represents a more systematic way of exploring the nodes. For an entropy-based objective function, the node exploration strategy does not affect the tree structure, i.e., the data partitions [15].

  9. 9.

    For all of the experiments of this section, for two larger datasets (MSLR-WEB10K and Yahoo), the bold and italic and bold figures denote that the best performance is significant with p value less than 0.01 and 0.05, respectively. For the smaller datasets, an average over five independent runs is reported (and each run is the result of fivefold cross-validation), and the winning value is given in italic font.

  10. 10.

    Recall that the usual practice has been to use the average relevance of the instances as the score.

  11. 11.

    Since the properties of HP2004 and NP2004 datasets are similar to that of TD2004, we do not conduct further experiments.

  12. 12.

    Although the features are not disclosed for the Yahoo dataset, a particular feature index corresponding to BM25 is mentioned.

  13. 13.

    This is intuitive since navigational queries have very few relevant documents, setting a higher value for k facilitates the base ranker(s) put those few relevant documents into the training set of rank-learning phase.

  14. 14.

    Ibrahim and Carman [15] report results of only RF-rand, RF-point with classification and RF-list.

  15. 15.

    We perform a pairwise significance test on the comparatively larger (in terms of number of queries) MQ2007 and MQ2008 datasets. Since for the rest of the datasets the number of queries is small, the significance test results may not be reliable.

  16. 16.

    As explained earlier, since HP2004 and NP2004 datasets contain navigational queries, MAP may not be considered to be a very effective choice for evaluation of this type of information need [1, Sec. 8.4]. That is why, in Table 12 we chose NDCG@10 for overall comparison.

  17. 17.

    We, however, did run a pilot experiment on comparison between RF-point and RF-hybrid with \(K \in \{23, 50, 80, 130\}\) (the value 23 is used as \(\sqrt{(}M)\)), and did not observe improvement in performance of RF-hybrid over RF-point for any of the settings. We thus conclude that the K is not likely to play a significant role in the relative performance of these two systems for the Yahoo dataset.

  18. 18.

    The inventors of the said technique [63] admit that there is a concern of degradation of accuracy, slightly though.

  19. 19.

    http://research.microsoft.com/en-us/um/beijing/projects/letor//Baselines/RankSVM-Primal.html.

  20. 20.

    https://people.cs.umass.edu/~vdang/ranklib.html.

  21. 21.

    https://code.google.com/p/jforests/.

References

  1. 1.

    Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge

    Google Scholar 

  2. 2.

    Li H (2011) Learning to rank for information retrieval and natural language processing. Synth Lect Hum Lang Technol 4(1):1–113

    Google Scholar 

  3. 3.

    Liu TY (2011) Learning to rank for information retrieval. Springer, Berlin

    Google Scholar 

  4. 4.

    Ibrahim M, Murshed M (2016) From tf-idf to learning-to-rank: an overview. In: Handbook of research on innovations in information retrieval, analysis, and management. IGI Global, USA, pp 62–109

  5. 5.

    Karatzoglou A, Baltrunas L, Shi Y (2013) Learning to rank for recommender systems. In: Proceedings of the 7th ACM conference on recommender systems, ACM, pp 493–494

  6. 6.

    Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237

    Google Scholar 

  7. 7.

    Santos RL, Macdonald C, Ounis I (2013) Learning to rank query suggestions for adhoc and diversity search. Inf Retr 16(4):429–451

    Google Scholar 

  8. 8.

    Li Z, Tang J, Mei T (2019) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083

    Google Scholar 

  9. 9.

    Li Z, Tang J, He X (2018) Robust structured nonnegative matrix factorization for image representation. IEEE Trans Neural Netw Learn Syst 29(5):1947–1960

    MathSciNet  Google Scholar 

  10. 10.

    Dang V, Bendersky M, Croft WB (2013) Two-stage learning to rank for information retrieval. In: Advances in information retrieval. Springer, pp 423–434

  11. 11.

    Macdonald C, Santos RL, Ounis I (2013) The whens and hows of learning to rank for web search. Inf Retr 16(5):584–628

    Google Scholar 

  12. 12.

    Aslam JA, Kanoulas E, Pavlu V, Savev S, Yilmaz E (2009) Document selection methodologies for efficient and effective learning-to-rank. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, pp 468–475

  13. 13.

    Pavlu V (2008) Large scale ir evaluation. ProQuest LLC, Ann Arbor

    Google Scholar 

  14. 14.

    Qin T, Liu TY, Xu J, Li H (2010) Letor: a benchmark collection for research on learning to rank for information retrieval. Inf Retr 13(4):346–374

    Google Scholar 

  15. 15.

    Ibrahim M, Carman M (2016) Comparing pointwise and listwise objective functions for random-forest-based learning-to-rank. ACM TOIS 34(4):20

    Google Scholar 

  16. 16.

    Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    MATH  Google Scholar 

  17. 17.

    Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7

    Google Scholar 

  18. 18.

    Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1):173–180

    Google Scholar 

  19. 19.

    Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 161–168

  20. 20.

    Criminisi A (2011) Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found Trends Comput Graph Vis 7(2–3):81–227

    MATH  Google Scholar 

  21. 21.

    Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res 15(1):3133–3181

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Biau G (2012) Analysis of a random forests model. J Mach Learn Res 98888:1063–1095

    MathSciNet  MATH  Google Scholar 

  23. 23.

    Cui Z, Chen W, He Y, Chen Y (2015) Optimal action extraction for random forests and boosted trees. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 179–188

  24. 24.

    Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(1):3

    Google Scholar 

  25. 25.

    Dong Y, Zhang Y, Yue J, Hu Z (2016) Comparison of random forest, random ferns and support vector machine for eye state classification. Multimed Tools Appl 75(19):11763–11783

    Google Scholar 

  26. 26.

    Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recogn Lett 27(4):294–300

    Google Scholar 

  27. 27.

    Chapelle O, Chang Y (2011) Yahoo! learning to rank challenge overview. J Mach Learn Res Proc Track 14:1–24

    Google Scholar 

  28. 28.

    Geurts P, Louppe G (2011) Learning to rank with extremely randomized trees. In: JMLR: workshop and conference proceedings, vol 14

  29. 29.

    Mohan A, Chen Z, Weinberger KQ (2011) Web-search ranking with initialized gradient boosted regression trees. J Mach Learn Res Proc Track 14:77–89

    Google Scholar 

  30. 30.

    Han X, Lei S (2018) Feature selection and model comparison on microsoft learning-to-rank data sets. arXiv preprint arXiv:1803.05127

  31. 31.

    Järvelin K, Kekäläinen J (2000) Ir evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 41–48

  32. 32.

    Chapelle O, Metlzer D, Zhang Y, Grinspan P (2009) Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM). ACM, pp 621–630

  33. 33.

    Li P, Wu Q, Burges C (2007) McRank: learning to rank using classification and gradient boosting. Adv Neural Inf Process Syst 20:897–904

    Google Scholar 

  34. 34.

    Cossock D, Zhang T (2006) Subset ranking using regression. Learning Theory, pp 605–619

  35. 35.

    Robnik-Šikonja M (2004) Improving random forests. In: Machine learning: ECML 2004. Springer, pp 359–370

  36. 36.

    Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42

    MATH  Google Scholar 

  37. 37.

    Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res 15(1):1625–1651

    MathSciNet  MATH  Google Scholar 

  38. 38.

    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  39. 39.

    Zaidi N, Webb G, Carman M, Petitjean F (2015) Deep broad learning—big models for big data. arXiv preprint arXiv:1509.01346

  40. 40.

    Winham SJ, Freimuth RR, Biernacka JM (2013) A weighted random forests approach to improve predictive performance. Stat Anal Data Min ASA Data Sci J 6(6):496–505

    MathSciNet  MATH  Google Scholar 

  41. 41.

    Bernard S, Heutte L, Adam S (2009) On the selection of decision trees in random forests. In: International joint conference on neural networks, 2009. IJCNN 2009. IEEE, pp 302–307

  42. 42.

    Li HB, Wang W, Ding HW, Dong J (2010) Trees weighting random forest method for classifying high-dimensional noisy data. In: 2010 IEEE 7th international conference on e-business engineering (ICEBE). IEEE, pp 160–163

  43. 43.

    Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a random forest? In: MLDM. Springer, pp 154–168

  44. 44.

    Freund Y, Iyer R, Schapire RE, Singer Y (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969

    MathSciNet  MATH  Google Scholar 

  45. 45.

    Tax N, Bockting S, Hiemstra D (2015) A cross-benchmark comparison of 87 learning to rank methods. Inf Process Manag 51(6):757–772

    Google Scholar 

  46. 46.

    Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 133–142

  47. 47.

    Xu J, Li H (2007) Adarank: a boosting algorithm for information retrieval. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 391–398

  48. 48.

    Wu Q, Burges CJ, Svore KM, Gao J (2010) Adapting boosting for information retrieval measures. Inf Retr 13(3):254–270

    Google Scholar 

  49. 49.

    Metzler D, Croft WB (2007) Linear feature-based models for information retrieval. Inf Retr 10(3):257–274

    Google Scholar 

  50. 50.

    Ibrahim M (2019) Sampling non-relevant documents of training sets for learning-to-rank algorithms. Int J Mach Learn Comput 10(2) (to appear)

  51. 51.

    Ibrahim M (2019) Reducing correlation of random forest-based learning-to-rank algorithms using subsample size. Comput Intell 35(2):1–25

    MathSciNet  Google Scholar 

  52. 52.

    Ibrahim M, Carman M (2014) Improving scalability and performance of random forest based learning-to-rank algorithms by aggressive subsampling. In: Proceedings of the 12th Australasian data mining conference, pp 91–99

  53. 53.

    He B, Macdonald C, Ounis I (2008) Retrieval sensitivity under training using different measures. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 67–74

  54. 54.

    Robertson S (2008) On the optimisation of evaluation metrics. In: Keynote, SIGIR 2008 workshop learning to rank for information retrieval (LR4IR)

  55. 55.

    Donmez P, Svore KM, Burges CJ (2009) On the local optimality of lambdarank. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, pp 460–467

  56. 56.

    Yilmaz E, Robertson S (2010) On the choice of effectiveness measures for learning to rank. Inf Retr 13(3):271–290

    Google Scholar 

  57. 57.

    Hanbury A, Lupu M (2013) Toward a model of domain-specific search. In: Proceedings of the 10th conference on open research areas in information retrieval, Le Centre De Hautes Etudes Internationales D’Informatique Documentaire, pp 33–36

  58. 58.

    Hawking D (2004) Challenges in enterprise search. In: Proceedings of the 15th Australasian database conference, vol 27. Australian Computer Society, Inc., pp 15–24

  59. 59.

    McCallum A, Nigam K, Rennie J, Seymore K (1999) A machine learning approach to building domain-specific search engines. In: IJCAI, vol 99. Citeseer, pp 662–667

  60. 60.

    Owens L, Brown M, Poore K, Nicolson N (2008) The forrester wave: enterprise search, q2 2008. For information and knowledge management professionals

  61. 61.

    Yan X, Lau RY, Song D, Li X, Ma J (2011) Toward a semantic granularity model for domain-specific information retrieval. ACM TOIS 29(3):15

    Google Scholar 

  62. 62.

    Szummer M, Yilmaz E (2011) Semi-supervised learning to rank with preference regularization. In: Proceedings of the 20th ACM international conference on information and knowledge management (CIKM). ACM, pp 269–278

  63. 63.

    Tyree S, Weinberger KQ, Agrawal K, Paykin J (2011) Parallel boosted regression trees for web search ranking. In: Proceedings of the 20th international conference on world wide web. ACM, pp 387–396

  64. 64.

    Freund Y, Schapire RE (1995) A decision-theoretic generalization of on-line learning and an application to boosting. In: Computational learning theory. Springer, pp 23–37

  65. 65.

    Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232 (English summary)

    MathSciNet  MATH  Google Scholar 

  66. 66.

    Quoc C, Le V (2007) Learning to rank with nonsmooth cost functions. Proc Adv Neural Inf Process Syst 19:193–200

    Google Scholar 

  67. 67.

    Ganjisaffar Y, Caruana R, Lopes CV (2011) Bagging gradient-boosted trees for high precision, low variance ranking models. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 85–94

  68. 68.

    Ganjisaffar Y, Debeauvais T, Javanmardi S, Caruana R, Lopes CV (2011) Distributed tuning of machine learning algorithms using mapreduce clusters. In: Proceedings of the third workshop on large scale data mining: theory and applications. ACM, p 2

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Muhammad Ibrahim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Implementations and parameter settings of baseline algorithms

Appendix: Implementations and parameter settings of baseline algorithms

The description of different baseline algorithms along with their parameter settings is given as follows.

RankSVM [46] is an SVM-inspired pairwise LtR algorithm which has been used as a baseline in a large number of works on LtR. In our experiments, we use a publicly available implementation of it.Footnote 19

RankBoost [44] is an Adaboost-inspired [64] pairwise algorithm which, instead of using standard exponential loss, uses a pairwise loss. RankLibFootnote 20 is a popular package of a number of LtR algorithms from which we use RankBoost’s implementation. We set the number of trees of the ensemble to 500.

AdaRank [47] also uses the AdaBoost framework but unlike RankBoost it adopts a listwise approach. We use the implementation in RankLib with the number of base learners set to 500 and NDCG@10 as the optimization metric for learning.

CoorAsc [49] algorithm uses the coordinate ascent method in a listwise manner. Its RankLib implementation is used.

A popular gradient-boosted regression tree ensemble [65]-based pointwise LtR algorithm is Mart [33]. We use its implementation of RankLib with the following change in its default parameter settings: number of trees = 500, number of leaves for each tree = 7 (according to [17, Ch. 10], any value between 4 and 8 is likely to work well).

One of the most popular LtR algorithms is LambdaMart [48]. It blends the ingenuine idea of approximated gradient of its predecessor, namely LambdaRank [66] with the gradient boosting framework [65]. In the Yahoo LtR Challenge [27], a variant of LambdaMart topped the winning list. We use an open-source implementation of it mentioned in [67].Footnote 21 The parameter settings we maintain are as follows: number of trees = 500, number of leaves for each tree = 31 (Ganjisaffar et al. [68] report that value close to this has been found to work well for MSLR-WEB10K). On smaller datasets, in order to mitigate overfitting we set it, like Mart, to 7. The rest of the parameters are kept unchanged. We note that during training, all the six baselines make use of validation sets.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ibrahim, M. An empirical comparison of random forest-based and other learning-to-rank algorithms. Pattern Anal Applic 23, 1133–1155 (2020). https://doi.org/10.1007/s10044-019-00856-6

Download citation

Keywords

  • Learning-to-rank
  • Random forest
  • Decision tree
  • Parameter settings