Assessing ranking metrics in top-N recommendation

Abstract

The evaluation of recommender systems is an area with unsolved questions at several levels. Choosing the appropriate evaluation metric is one of such important issues. Ranking accuracy is generally identified as a prerequisite for recommendation to be useful. Ranking metrics have been adapted for this purpose from the Information Retrieval field into the recommendation task. In this article, we undertake a principled analysis of the robustness and the discriminative power of different ranking metrics for the offline evaluation of recommender systems, drawing from previous studies in the information retrieval field. We measure the robustness to different sources of incompleteness that arise from the sparsity and popularity biases in recommendation. Among other results, we find that precision provides high robustness while normalized discounted cumulative gain offers the best discriminative power. In dealing with cold users, we also find that the geometric mean is more robust than the arithmetic mean as aggregation function over users.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. 1.

    https://github.com/usnistgov/trec_eval.

  2. 2.

    https://github.com/dvalcarce/rec_eval.

  3. 3.

    Actually, we limit the plot to the first 20 p-values.

  4. 4.

    https://grouplens.org/datasets/movielens.

  5. 5.

    http://snap.stanford.edu/data/web-BeerAdvocate.html.

References

  1. Allan, J., Croft, B., Moffat, A., & Sanderson, M. (2012). Frontiers, challenges, and opportunities for information retrieval: Report from SWIRL 2012 the second strategic workshop on information retrieval in lorne. SIGIR Forum, 46(1), 2–32. https://doi.org/10.1145/2215676.2215678.

    Article  Google Scholar 

  2. Anderson, C. (2008). The Long Tail: Why the Future of Business Is Selling Less of More. New York: Hachette Books.

    Google Scholar 

  3. Beel, J., & Langer, S. (2015). A comparison of offline evaluations, online evaluations, and user studies in the context of research-paper recommender systems. In S. Kapidakis, C. Mazurek, M. Werla (eds.) Proceedings of the 19th international conference on theory and practice of digital libraries, TPDL ’15 (pp. 153–168). Springer, Cham. https://doi.org/10.1007/978-3-319-24592-8_12.

  4. Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12), 29–38. https://doi.org/10.1145/138859.138861.

    Article  Google Scholar 

  5. Bellogín, A., Castells, P., & Cantador, I. (2011). Precision-oriented evaluation of recommender systems. In Proceedings of the 5th ACM conference on recommender systems, RecSys ’11 (p. 333). ACM, New York, NY, USA. https://doi.org/10.1145/2043932.2043996.

  6. Bellogín, A., Castells, P., & Cantador, I. (2017). Statistical biases in information retrieval metrics for recommender systems. Information Retrieval Journal, 20(6), 606–634. https://doi.org/10.1007/s10791-017-9312-z.

    Article  Google Scholar 

  7. Bellogín, A., Wang, J., & Castells, P. (2013). Bridging memory-based collaborative filtering and text retrieval. Information Retrieval, 16(6), 697–724. https://doi.org/10.1007/s10791-012-9214-z.

    Article  Google Scholar 

  8. Bennett, J., & Lanning, S (2007). The netflix prize. In Proceedings of the KDD cup workshop 2007 (pp. 3–6). ACM, New York, NY, USA.

  9. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  10. Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. (2007). Bias and the limits of pooling for large collections. Information Retrieval, 10(6), 491–508. https://doi.org/10.1007/s10791-007-9032-x.

    Article  Google Scholar 

  11. Buckley, C., & Voorhees, E.M (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00 (pp. 33–40). ACM, New York, NY, USA. https://doi.org/10.1145/345508.345543.

  12. Buckley, C., & Voorhees, E.M. (2004). Retrieval evaluation with incomplete information. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04 (pp. 25–32). ACM, New York, NY, USA. https://doi.org/10.1145/1008992.1009000.

  13. Büttcher, S., Clarke, C.L.A., Yeung, P.C.K., & Soboroff, I. (2007). Reliable information retrieval evaluation with incomplete and biased judgements. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07, (pp. 63–70). ACM, New York, NY, USA. https://doi.org/10.1145/1277741.1277755.

  14. Campos, P. G., Díez, F., & Cantador, I. (2014). Time-aware recommender systems: a comprehensive survey and analysis of existing evaluation protocols. User Model. User-Adapt. Interact., 24(1–2), 67–119. https://doi.org/10.1007/s11257-012-9136-x.

    Article  Google Scholar 

  15. Cañamares, R., & Castells, P. (2018). Should i follow the crowd? A probabilistic analysis of the effectiveness of popularity in recommender systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18 (pp. 415–424). ACM, New York, NY, USA. https://doi.org/10.1145/3209978.3210014.

  16. Castells, P., Hurley, N.J., & Vargas, S. (2015). Novelty and diversity in recommender systems. In F. Ricci, L. Rokach, B. Shapira (eds.) Recommender Systems Handbook, 2nd edn. (pp. 881–918). Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7637-6_26.

  17. Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09 (p. 621). ACM, New York, NY, USA. https://doi.org/10.1145/1645953.1646033.

  18. Cremonesi, P., Garzotto, F., Negro, S., Papadopoulos, A.V., & Turrin, R. (2011). Looking for “Good” recommendations: A comparative evaluation of recommender systems. In Proceedings of the 13th IFIP TC 13 international conference on human-computer interaction–Part III, INTERACT ’11 (pp. 152–168). Springer, Berlin. https://doi.org/10.1007/978-3-642-23765-2_11.

  19. Cremonesi, P., Garzotto, F., & Turrin, R. (2013). User-centric vs. system-centric evaluation of recommender systems. In Proceedings of the 14th IFIP TC 13 international conference on human-computer interaction–Part III, INTERACT ’13 (pp. 334–351). Springer, Berlin. https://doi.org/10.1007/978-3-642-40477-1_21.

  20. Cremonesi, P., Koren, Y., & Turrin, R. (2010). Performance of recommender algorithms on top-N recommendation tasks. In Proceedings of the 4th ACM conference on recommender systems, RecSys ’10 (pp. 39–46). ACM, New York, NY, USA. https://doi.org/10.1145/1864708.1864721.

  21. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap, monographs on statistics and applied probability. Boca Raton, Florida, USA: Chapman & Hall/CRC.

    Google Scholar 

  22. Ferro, N., Fuhr, N., Grefenstette, G., Konstan, J. A., Castells, P., Daly, E. M., et al. (2018). The dagstuhl perspectives workshop on performance modeling and prediction. SIGIR Forum, 52(1), 91–101. https://doi.org/10.1145/3274784.3274789.

    Article  Google Scholar 

  23. Fuhr, N. (2018). Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum, 51(3), 32–41. https://doi.org/10.1145/3190580.3190586.

    Article  Google Scholar 

  24. Garcin, F., Faltings, B., Donatsch, O., Alazzawi, A., Bruttin, C., & Huber, A. (2014). Offline and online evaluation of news recommender systems at swissinfo.ch. In Proceedings of the 8th ACM conference on recommender systems, RecSys ’14 (pp. 169–176). ACM, New York, NY, USA. https://doi.org/10.1145/2645710.2645745.

  25. Gilotte, A., Calauzènes, C., Nedelec, T., Abraham, A., & Dollé, S. (2018). Offline A/B testing for recommender systems. In Proceedings of the 11th ACM international conference on web search and data mining, WSDM ’18 (pp. 198–206). ACM.

  26. Gini, C. (1912). Variabilità e Mutuabilità. Cuppini: Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche.

    Google Scholar 

  27. Gruson, A., Chandar, P., Charbuillet, C., McInerney, J., Hansen, S., Tardieu, D., & Carterette, B. (2019). Offline evaluation to make decisions about playlist recommendation. In Proceedings of the 12th ACM international conference on web search and data mining, WSDM ’19 (pp. 420–428). ACM.

  28. Gunawardana, A., & Shani, G. (2015). Evaluating recommender systems. In: F. Ricci, L. Rokach, B. Shapira (eds.) Recommender systems handbook, 2nd edn. (pp. 265–308). Springer, Boston, MA, USA. https://doi.org/10.1007/978-1-4899-7637-6_8.

  29. Harper, F. M., & Konstan, J. A. (2015). The movielens datasets: History and context. Acm Transactions on Interactive Intelligent Systems, 5(4), 19:1–19:19. https://doi.org/10.1145/2827872.

    Article  Google Scholar 

  30. Herlocker, J. L., Konstan, J. A., Terveen, L. G., & Riedl, J. T. (2004). Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1), 5–53. https://doi.org/10.1145/963770.963772.

    Article  Google Scholar 

  31. Hofmann, T. (2004). Latent semantic models for collaborative filtering. ACM Transactions on Information Systems, 22(1), 89–115. https://doi.org/10.1145/963770.963774.

    Article  Google Scholar 

  32. Hosanagar, K., Fleder, D., Lee, D., & Buja, A. (2014). Will the global village fracture into tribes: Recommender systems and their effects on consumers. Management Science, 60(4), 805–823. https://doi.org/10.2139/ssrn.1321962.

    Article  Google Scholar 

  33. Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 eighth IEEE international conference on data mining, ICDM ’08 (pp. 263–272). IEEE, Washington, DC, USA. https://doi.org/10.1109/ICDM.2008.22.

  34. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. https://doi.org/10.1145/582415.582418.

    Article  Google Scholar 

  35. Joachims, T., Swaminathan, A., & Schnabel, T. (2017). Unbiased learning-to-rank with biased feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining, WSDM ’17 (pp. 781–789). ACM. https://doi.org/10.1145/3018661.3018699.

  36. Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1–2), 81–93. https://doi.org/10.1093/biomet/30.1-2.81.

    Article  MATH  Google Scholar 

  37. Kohavi, R., Longbotham, R., Sommerfield, D., & Henne, R. M. (2009). Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery, 18(1), 140–181. https://doi.org/10.1007/s10618-008-0114-1.

    MathSciNet  Article  Google Scholar 

  38. Konstan, J.A., & Adomavicius, G. (2013). Toward identification and adoption of best practices in algorithmic recommender systems research. In Proceedings of the international workshop on reproducibility and replication in recommender systems evaluation, RepSys ’13 (pp. 23–28). ACM, New York, NY, USA. https://doi.org/10.1145/2532508.2532513.

  39. Kutlu, M., Elsayed, T., & Lease, M. (2018). Intelligent topic selection for low-cost information retrieval evaluation: A New perspective on deep vs. shallow judging. Information Processing & Management, 54(1), 37–59. https://doi.org/10.1016/j.ipm.2017.09.002.

    Article  Google Scholar 

  40. Losada, D. E., Parapar, J., & Barreiro, A. (2017). Multi-armed bandits for adjudicating documents in pooling-based evaluation of information retrieval systems. Information Processing & Management, 53(5), 1005–1025. https://doi.org/10.1016/j.ipm.2017.04.005.

    Article  Google Scholar 

  41. Lu, X., Moffat, A., & Culpepper, J. S. (2016). The effect of pooling and evaluation depth on IR metrics. Information Retrieval Journal, 19(4), 416–445. https://doi.org/10.1007/s10791-016-9282-6.

    Article  Google Scholar 

  42. Marlin, B. M., Zemel, R. S., Roweis, S., & Slaney, M. (2007). Collaborative filtering and the missing at random assumption. In Proceedings of the 23rd conference on uncertainty in artificial intelligence, UAI’07 (pp. 267–275). Arlington, VA: AUAI Press.

  43. Matos-Junior, O., Ziviani, N., Botelho, F., Cristo, M., Lacerda, A., & da Silva, A. S. (2012). Using taxonomies for product recommendation. Journal of Information and Data Management, 3(2), 85–100.

    Google Scholar 

  44. McNee, S. M., Riedl, J. T., & Konstan, J. A. (2006). Being accurate is not enough: How accuracy metrics have hurt recommender systems. In CHI ’06 extended abstracts on human factors in computing systems, CHI EA ’06 (p. 1097). ACM, New York, NY, USA. https://doi.org/10.1145/1125451.1125659.

  45. Ning, X., & Karypis, G. (2011). SLIM: sparse linear methods for top-N recommender systems. In Proceedings of the 2011 IEEE 11th international conference on data mining, ICDM ’11 (pp. 497–506). IEEE, Washington, DC, USA. https://doi.org/10.1109/ICDM.2011.134.

  46. Parapar, J., Bellogín, A., Castells, P., & Barreiro, Á. (2013). Relevance-based Language modelling for recommender systems. Information Processing & Management, 49(4), 966–980. https://doi.org/10.1016/j.ipm.2013.03.001.

    Article  Google Scholar 

  47. Parapar, J., Losada, D.E., PresedoQuindimil, M.A., & Barreiro, Á. (2019). Using score distributions to compare statistical significance tests for information retrieval evaluation. Journal of the Association for Information Science and Technology. https://doi.org/10.1002/asi.24203

  48. Park, S.-T., Chu, W., Park, S.T., & Chu, W. (2009). Pairwise preference regression for cold-start recommendation. In Proceedings of the 3rd ACM conference on recommender systems, RecSys ’09 (pp. 21–28). ACM, New York, NY, USA. https://doi.org/10.1145/1639714.1639720.

  49. Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, UAI ’09 (pp. 452–461). AUAI Press, Arlington, VA, US.

  50. Ricci, F., Rokach, L., & Shapira, B. (2015). Recommender systems handbook, 2nd edn. Boston, MA: Springer. https://doi.org/10.1007/978-1-4899-7637-6.

  51. Robertson, S. (2006). On GMAP: And other transformations. In Proceedings of the 15th ACM international conference on information and knowledge management, CIKM ’06 (pp. 78–83). ACM, New York, NY, USA. https://doi.org/10.1145/1183614.1183630.

  52. Rossetti, M., Stella, F., & Zanker, M. (2016). Contrasting offline and online results when evaluating recommendation algorithms. In Proceedings of the 10th ACM conference on recommender systems, RecSys ’16 (pp. 31–34). ACM, New York, NY, USA. https://doi.org/10.1145/2959100.2959176.

  53. Said, A., & Bellogín, A. (2014). Comparative recommender system evaluation: Benchmarking recommendation frameworks. In Proceedings of the 8th ACM conference on recommender systems, RecSys ’14 (pp. 129–136). ACM, New York, NY, USA. https://doi.org/10.1145/2645710.2645746.

  54. Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06 (pp. 525–532). ACM, New York, NY, USA. https://doi.org/10.1145/1148170.1148261.

  55. Sakai, T. (2012). Evaluation with informational and navigational intents. In Proceedings of the 21st international conference on world wide web, WWW ’12 (pp. 499–508). New York, NY: ACM. https://doi.org/10.1145/2187836.2187904.

  56. Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval, 11(5), 447–470. https://doi.org/10.1007/s10791-008-9059-7.

    Article  Google Scholar 

  57. Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., & Joachims, T. (2016). Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the 33rd international conference on machine learning, ICML ’16 (pp. 1670–1679).

  58. Siroker, D., & Koomen, P. (2013). A/B testing: The most powerful way to turn clicks into customers. New York: Wiley.

    Google Scholar 

  59. Spärck Jones, K., & Van Rijsbergen, C.J. (1975). Report on the need for and provision of an ideal information retrieval test collection. British Library Research and Development Reports. University Computer Laboratory

  60. Steck, H. (2010). Training and testing of recommender systems on data missing not at random. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10 (pp. 713–722). New York, NY: ACM. https://doi.org/10.1145/1835804.1835895.

  61. Swaminathan, A., Krishnamurthy, A., Agarwal, A., Dudík, M., Langford, J., Jose, D., Zitouni, I. (2017). Off-policy evaluation for slate recommendation. In Proceedings of the 31st annual conference on neural information processing systems, NIPS ’17 (pp. 3635–3645).

  62. Takács, G., Pilászy, I., Németh, B., & Tikk, D. (2009). Scalable collaborative filtering approaches for large recommender systems. Journal of Machine Learning Research, 10, 623–656. https://doi.org/10.1145/1577069.1577091.

    Article  Google Scholar 

  63. Valcarce, D., Bellogín, A., Parapar, J., & Castells, P. (2018). On the robustness and discriminative power of information retrieval metrics for top-N recommendation. In Proceedings of the 12th ACM conference on recommender systems, RecSys ’18 (pp. 260–268). New York, NY: ACM. https://doi.org/10.1145/3240323.3240347.

  64. Valcarce, D., Parapar, J., & Barreiro, Á. (2016). Efficient pseudo-relevance feedback methods for collaborative filtering recommendation. In Proceedings of the 38th European conference on information retrieval, ECIR ’16 (pp. 602–613). Berlin: Springer. https://doi.org/10.1007/978-3-319-30671-1_44.

  65. Valcarce, D., Parapar, J., & Barreiro, Á. (2016). Item-based relevance modelling of recommendations for getting rid of long tail products. Knowledge-Based Systems, 103, 41–51. https://doi.org/10.1016/j.knosys.2016.03.021.

    Article  Google Scholar 

  66. Valcarce, D., Parapar, J., Barreiro, Á. (2016). Language models for collaborative filtering neighbourhoods. In Proceedings of the 38th European conference on information retrieval, ECIR ’16 (pp. 614–625). Berlin: Springer. https://doi.org/10.1007/978-3-319-30671-1_45.

  67. Voorhees, E.M. (2002). The philosophy of information retrieval evaluation. In Evaluation of cross-language information retrieval systems: Second workshop of the cross-language evaluation forum, CLEF 2001 (pp. 355–370). Berlin: Springer. https://doi.org/10.1007/3-540-45691-0_34.

  68. Voorhees, E.M. (2005). Overview of the TREC 2004 Robust Track. In ACM SIGIR forum

  69. Wang, J., de Vries, A.P., & Reinders, M. J. T. (2006). A user-item relevance model for log-based collaborative filtering. In Proceedings of the 28th European conference on IR research, ECIR ’06 (pp. 37–48). London: Springer. https://doi.org/10.1007/11735106_5.

  70. Webber, W., Moffat, A., & Zobel, J. (2010). The effect of pooling and evaluation depth on metric stability. In Proceedings of the 3rd international workshop on evaluating information access, EVIA ’10 (pp. 7–15).

  71. Yang, L., Cui, Y., Xuan, Y., Wang, C., Belongie, S., & Estrin, D. (2018). Unbiased offline recommender evaluation for missing-not-at-random implicit feedback. In Proceedings of the 12th ACM conference on recommender systems, RecSys ’18 (pp. 279–287). ACM.

  72. Yilmaz, E., & Aslam, J. A. (2008). Estimating average precision when judgments are incomplete. Knowledge and Information Systems, 16(2), 173–211. https://doi.org/10.1007/s10115-007-0101-7.

    Article  Google Scholar 

  73. Yin, H., Cui, B., Li, J., Yao, J., & Chen, C. (2012). Challenging the long tail recommendation. Proceedings of the VLDB Endowment, 5(9), 896–907. https://doi.org/10.14778/2311906.2311916.

    Article  Google Scholar 

  74. Yu, H. T., Jatowt, A., Blanco, R., Joho, H., & Jose, J. M. (2017). An in-depth study on diversity evaluation: The importance of intrinsic diversity. Information Processing & Management, 53(4), 799–813. https://doi.org/10.1016/j.ipm.2017.03.001.

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Javier Parapar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by the Spanish Ministry of Science, Innovation and Universities and the ERDF (Projects TIN2016-80630-P and RTI2018-093336-B-C22) and by the Regional Government of Galicia and the ERDF (accreditation ED431G/01 and ED431B 2019/03). The authors also acknowledge the very helpful feedback from the anonymous reviewers.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Valcarce, D., Bellogín, A., Parapar, J. et al. Assessing ranking metrics in top-N recommendation. Inf Retrieval J 23, 411–448 (2020). https://doi.org/10.1007/s10791-020-09377-x

Download citation

Keywords

  • Recommender systems
  • Top-N recommendation
  • Evaluation
  • Ranking metrics
  • Robustness
  • Discriminative power