Skip to main content

Undersampling Techniques to Re-balance Training Data for Large Scale Learning-to-Rank

  • Conference paper
Information Retrieval Technology (AIRS 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8870))

Included in the following conference series:

Abstract

Learning-to-rank (LtR) algorithms for information retrieval use the supervised learning framework to learn a ranking function from a training set consisting of query-document pairs. In this study we investigate the imbalanced nature of LtR training sets, which generally contain very few relevant documents as compared to the number of irrelevant documents. The need to include as many relevant documents as possible in the training set is well-known, but we ask the question as to how many irrelevant documents are needed in order to learn a good ranking function. We employ both random and deterministic undersampling techniques to reduce the number of irrelevant documents. Minimizing the training set size reduces the training time which is an important factor in large scale LtR. Extensive experiments on Letor benchmark datasets reveal that the performance of a LtR algorithm trained on a much smaller training set remains similar to that of the original training set. Thus this study suggests that for large scale LtR tasks, we can leverage undersampling techniques to reduce training time with negligible effect on performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aslam, J.A., Kanoulas, E., Pavlu, V., Savev, S., Yilmaz, E.: Document selection methodologies for efficient and effective learning-to-rank. In: Proc. of the 32nd International ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 468–475. ACM (2009)

    Google Scholar 

  2. Bendersky, M., Metzler, D., Croft, W.B.: Learning concept importance using a weighted dependence model. In: Proc. of 3rd ACM Intl. Conf. on Web Search and Data Mining, pp. 31–40. ACM (2010)

    Google Scholar 

  3. Chapelle, O., Chang, Y.: Yahoo! learning to rank challenge overview. Journal of Machine Learning Research-Proceedings Track 14, 1–24 (2011)

    Google Scholar 

  4. Chapelle, O., Chang, Y., Liu, T.Y.: Future directions in learning to rank. In: JMLR Workshop and Conference Proceedings, vol. 14, pp. 91–100 (2011)

    Google Scholar 

  5. Dang, V., Bendersky, M., Croft, W.B.: Two-stage learning to rank for information retrieval. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 423–434. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  6. Donmez, P., Carbonell, J.G.: Optimizing estimated loss reduction for active sampling in rank learning. In: Proc. of 25th International Conf. on Machine Learning, pp. 248–255 (2008)

    Google Scholar 

  7. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Statist 29(5), 1189–1232 (2001) (english summary)

    Google Scholar 

  8. Ganjisaffar, Y., Caruana, R., Lopes, C.V.: Bagging gradient-boosted trees for high precision, low variance ranking models. In: Proceedings of the 34th international ACM SIGIR Conference on Research and development in Information Retrieval, pp. 85–94. ACM (2011)

    Google Scholar 

  9. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  11. Järvelin, K., Kekäläinen, J.: Ir evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 41–48. ACM (2000)

    Google Scholar 

  12. Joachims, T.: Optimizing search engines using clickthrough data. In: Proc. of 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2002)

    Google Scholar 

  13. Li, H.: Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies 4(1), 1–113 (2011)

    Article  Google Scholar 

  14. Liu, T.Y.: Learning to rank for information retrieval. Springer, Heidelberg (2011)

    Book  MATH  Google Scholar 

  15. Long, B., Chapelle, O., Zhang, Y., Chang, Y., Zheng, Z., Tseng, B.: Active learning for ranking through expected loss optimization. In: Proceedings of the 33rd International ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 267–274. ACM (2010)

    Google Scholar 

  16. Macdonald, C., Santos, R.L., Ounis, I.: The whens and hows of learning to rank for web search. Information Retrieval, 1–45 (2012)

    Google Scholar 

  17. Pan, F., Converse, T., Ahn, D., Salvetti, F., Donato, G.: Feature selection for ranking using boosted trees. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 2025–2028. ACM (2009)

    Google Scholar 

  18. Pavlu, V.: Large scale ir evaluation. ProQuest LLC (2008)

    Google Scholar 

  19. Qin, T., Liu, T.-Y., Xu, J., Li, H.: Letor: A benchmark collection for research on learning to rank for information retrieval. information Retrieval 13(4), 346–374 (2010)

    Article  Google Scholar 

  20. Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Information Retrieval 13(3), 254–270 (2010)

    Article  Google Scholar 

  21. Yu, H.: Svm selective sampling for ranking with application to data retrieval. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 354–363. ACM (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Ibrahim, M., Carman, M. (2014). Undersampling Techniques to Re-balance Training Data for Large Scale Learning-to-Rank. In: Jaafar, A., et al. Information Retrieval Technology. AIRS 2014. Lecture Notes in Computer Science, vol 8870. Springer, Cham. https://doi.org/10.1007/978-3-319-12844-3_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12844-3_38

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12843-6

  • Online ISBN: 978-3-319-12844-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics