Modelling Randomness in Relevance Judgments and Evaluation Measures

  • Marco Ferrante
  • Nicola FerroEmail author
  • Silvia Pontarollo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10772)


We propose a general stochastic approach which defines relevance as a set of binomial random variables where the expectation p of each variable indicates the quantity of relevance for each relevance grade. This represents the first step in the direction of modelling evaluation measures as a transformation of random variables, turning them into random evaluation measures. We show that a consequence of this new approach is to remove the distinction between binary and multi-graded measures and, at the same time, to deal with incomplete information, providing a single unified framework for all these different aspects. We experiment on TREC collections to show how these new random measures correlate to existing ones and which desirable properties, such as robustness to pool downsampling and discriminative power, they have.


  1. 1.
    Alonso, O., Mizzaro, S.: Using crowdsourcing for TREC relevance assessment. IPM 48(6), 1053–1066 (2012)Google Scholar
  2. 2.
    Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. SIGIR 2004, 25–32 (2004)Google Scholar
  3. 3.
    Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. CIKM 2009, 621–630 (2009)Google Scholar
  4. 4.
    Ferrante, M., Ferro, N., Maistro, M.: Towards a formal framework for utility-oriented measurements of retrieval effectiveness. ICTIR 2015, 21–30 (2015)CrossRefGoogle Scholar
  5. 5.
    Hosseini, M., Cox, I.J., Milić-Frayling, N., Kazai, G., Vinay, V.: On aggregating labels from multiple crowd workers to infer relevance of documents. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 182–194. Springer, Heidelberg (2012). CrossRefGoogle Scholar
  6. 6.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. TOIS 20(4), 422–446 (2002)CrossRefGoogle Scholar
  7. 7.
    Kazai, G., Craswell, N., Yilmaz, E., Tahaghoghi, S.S.M.: An analysis of systematic judging errors in information retrieval. CIKM 2012, 105–114 (2012)Google Scholar
  8. 8.
    Kendall, M.G.: Rank Correlation Methods. Griffin, Oxford (1948)zbMATHGoogle Scholar
  9. 9.
    Maddalena, E., Mizzaro, S., Scholer, F., Turpin, A.: On crowdsourcing relevance magnitudes for information retrieval evaluation. TOIS 35(3), 19:1–19:32 (2017)CrossRefGoogle Scholar
  10. 10.
    Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. TOIS 27(1), 201–227 (2008)CrossRefGoogle Scholar
  11. 11.
    Park, L.A.F.: Uncertainty in rank-biased precision. ADCS 2016, 73–76 (2016)CrossRefGoogle Scholar
  12. 12.
    Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending average precision to graded relevance judgments. SIGIR 2010, 603–610 (2010)Google Scholar
  13. 13.
    Sakai, T.: Evaluating evaluation metrics based on the bootstrap. SIGIR 2006, 525–532 (2006)Google Scholar
  14. 14.
    Sakai, T.: Alternatives to Bpref. SIGIR 2007, 71–78 (2007)Google Scholar
  15. 15.
    Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retriev. 11(5), 447–470 (2008)CrossRefGoogle Scholar
  16. 16.
    Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. SIGIR 1998, 315–323 (1998)Google Scholar
  17. 17.
    Webber, W., Chandar, P., Carterette, B.A.: Alternative assessor disagreement and retrieval depth. CIKM 2012, 125–134 (2012)Google Scholar
  18. 18.
    Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. CIKM 2006, 102–111 (2006)Google Scholar
  19. 19.
    Yilmaz, E., Aslam, J.A., Robertson, S.E.: A new rank correlation coefficient for information retrieval. SIGIR 2008, 587–594 (2008)Google Scholar
  20. 20.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments. SIGIR 1998, 307–314 (1998)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Marco Ferrante
    • 1
  • Nicola Ferro
    • 2
    Email author
  • Silvia Pontarollo
    • 1
  1. 1.Department of MathematicsUniversity of PaduaPaduaItaly
  2. 2.Department of Information EngineeringUniversity of PaduaPaduaItaly

Personalised recommendations