How to Run an Evaluation Task

With a Primary Focus on Ad Hoc Information Retrieval
  • Tetsuya SakaiEmail author
Part of the The Information Retrieval Series book series (INRE, volume 41)


This chapter provides a general guideline for researchers who are planning to run a shared evaluation task for the first time, with a primary focus on simple ad hoc Information Retrieval (IR). That is, it is assumed that we have a static target document collection and a set of test topics (i.e., search requests), where participating systems are required to produce a ranked list of documents for each topic. The chapter provides a step-by-step description of what a task organiser team is expected to do. Section 1 discusses how to define the evaluation task; Sect. 2 how to publicise it and why it is important. Section 3 describes how to design and build test collections, as well as how inter-assessor agreement can be quantified. Section 4 explains how the results submitted by participants can be evaluated; examples of tools for computing evaluation measures and conducting statistical significance tests are provided. Finally, Sect. 5 discusses how the fruits of running the task should be shared to the research community, how progress should be monitored, and how we may be able to improve the task design for the next round. N.B.: A prerequisite to running a successful task is that you have a good team of organisers who can collaborate effectively. Each team member should be well-motivated and committed to running the task. They should respond to emails in a timely manner and should be able to meet deadlines. Organisers should be well-organised!


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Allan J, Carterette B, Aslam JA, Pavlu V, Dachev B, Kanoulas E (2008) Million query track 2007 overview. In: Proceedings of TREC 2007Google Scholar
  2. Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. SIGIR Forum 42(2):9CrossRefGoogle Scholar
  3. Bailey P, Craswell N, Soboroff I, Thomas P, de Vries AP, Yilmaz E (2008) Relevance assessment: are judges exchangeable and does it matter? In: Proceedings of ACM SIGIR 2008, pp 667–674Google Scholar
  4. Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, pp 25–32Google Scholar
  5. Buckley C, Voorhees EM (2005) Retrieval system evaluation. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. The MIT Press, Boston, chap 3Google Scholar
  6. Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of ACM ICML 2005, pp 89–96Google Scholar
  7. Carterette B (2012) Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30(1):4CrossRefGoogle Scholar
  8. Carterette B (2015) Bayesian inference for information retrieval evaluation. In: Proceedings of ACM ICTIR 2015, pp 31–40CrossRefGoogle Scholar
  9. Carterette B, Bennett PN, Chickering DM, Dumais ST (2008a) Here or there: preference judgments for relevance. In: Proceedings of ECIR 2008 (LNCS), vol 4956, pp 16–27Google Scholar
  10. Carterette B, Pavlu V, Kanoulas E, Aslam JA, Allan J (2008b) Evaluation over thousands of queries. In: Proceedings of ACM SIGIR 2008, pp 651–658Google Scholar
  11. Chandar P, Carterette B (2012) Using preference judgments for novel document retrieval. In: Proceedings of ACM SIGIR 2012, pp 861–870CrossRefGoogle Scholar
  12. Chapelle O, Metzler D, Zhang Y, Grinspan P (2009) Expected reciprocal rank for graded relevance. In: Proceedings of ACM CIKM 2009, pp 621–630Google Scholar
  13. Chapelle O, Ji S, Liao C, Velipasaoglu E, Lai L, Wu SL (2011) Intent-based diversification of web search results: metrics and algorithms. Inf Retr 14(6):572–592CrossRefGoogle Scholar
  14. Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46CrossRefGoogle Scholar
  15. Cohen J (1968) Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull 70(4):213–220CrossRefGoogle Scholar
  16. Crawley MJ (2015) Statistics: an introduction using R, 2nd edn. Wiley, ChichesterzbMATHGoogle Scholar
  17. Ekstrand-Abueg M, Pavlu V, Kato MP, Sakai T, Yamamoto T, Iwata M (2013) Exploring semi-automatic nugget extraction for Japanese one click access evaluation. In: Proceedings of ACM SIGIR 2013, pp 749–752Google Scholar
  18. Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382CrossRefGoogle Scholar
  19. Harman DK (2005) The TREC test collections. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. The MIT Press, Boston, chap 2Google Scholar
  20. Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4):422–446CrossRefGoogle Scholar
  21. Krippendorff K (2013) Content analysis: an introduction to its methodology, 3rd edn. SAGE Publications, Los AngelesGoogle Scholar
  22. Lease M, Yilmaz E (2011) Crowdsourcing for information retrieval. SIGIR Forum 45(2):66–75CrossRefGoogle Scholar
  23. Luo C, Sakai T, Liu Y, Dou Z, Xiong C, Xu J (2017) Overview of the NTCIR-13 we want web task. In: Proceedings of NTCIR-13Google Scholar
  24. Moffat A, Zobel J (2008) Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS 27(1):2CrossRefGoogle Scholar
  25. Nagata Y (2003) How to design the sample size (in Japanese). Asakura ShotenGoogle Scholar
  26. Randolph JJ (2005) Free-marginal multirater kappa (multirater κ free): an alternative to Fleiss’ fixed marginal multirater kappa. In: Joensuu learning and instruction symposium 2005Google Scholar
  27. Sakai T (2004) Ranking the NTCIR systems based on multigrade relevance. In: Proceedings of AIRS 2004 (LNCS), vol 3411, pp 251–262CrossRefGoogle Scholar
  28. Sakai T (2006) Evaluating evaluation metrics based on the bootstrap. In: Proceedings of ACM SIGIR 2006, pp 525–532Google Scholar
  29. Sakai T (2007) Alternatives to bpref. In: Proceedings of ACM SIGIR 2007, pp 71–78Google Scholar
  30. Sakai T (2014) Metrics, statistics, tests. In: PROMISE winter school 2013: bridging between information retrieval and databases (LNCS), vol 8173, pp 116–163Google Scholar
  31. Sakai T (2015) Information access evaluation methodology: for the progress of search engines (in Japanese). Corona Publishing, New YorkGoogle Scholar
  32. Sakai T (2016) Topic set size design. Inf Retr J 19(3):256–283CrossRefGoogle Scholar
  33. Sakai T (2017a) The effect of inter-assessor disagreement on IR system evaluation: a case study with lancers and students. In: Proceedings of EVIA 2017, pp 31–38Google Scholar
  34. Sakai T (2017b) The probability that your hypothesis is correct, credible intervals, and effect sizes for ir evaluation. In: Proceedings of ACM SIGIR 2017, pp 25–34Google Scholar
  35. Sakai T (2018a) Laboratory experiments in information retrieval: sample sizes, effect sizes, and statistical power. Springer, Cham. CrossRefGoogle Scholar
  36. Sakai T (2018b) Topic set size design for paired and unpaired data. In: Proceedings of ACM ICTIR 2018Google Scholar
  37. Sakai T, Lin CY (2010) Ranking retrieval systems without relevance assessments: revisited. In: Proceedings of EVIA 2010, pp 25–33Google Scholar
  38. Sakai T, Robertson S (2008) Modelling a user population for designing information retrieval metrics. In: Proceedings of EVIA 2008, pp 30–41Google Scholar
  39. Sakai T, Song R (2011) Evaluating diversified search results using per-intent graded relevance. In: Proceedings of ACM SIGIR 2011, pp 1043–1052Google Scholar
  40. Sakai T, Dou Z, Yamamoto T, Liu Y, Zhang M, Song R, Kato MP, Iwata M (2013) Overview of the NTCIR-10 INTENT-2 task. In: Proceedings of NTCIR-10, pp 94–123Google Scholar
  41. Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of ACM CIKM 2007, pp 623–632Google Scholar
  42. Sparck Jones K, van Rijsbergen CJ (1975) Report on the need for and provision of an ‘ideal’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266Google Scholar
  43. Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. Inf Process Manag 36:697–716CrossRefGoogle Scholar
  44. Voorhees EM (2002) The philosophy of information retrieval evaluation. In: Proceedings of ECIR 2002 (LNCS), vol 2406, pp 355–370Google Scholar
  45. Webber W, Moffat A, Zobel J (2008) Statistical power in retrieval experimentation. In: Proceedings of ACM CIKM 2008, pp 571–580Google Scholar
  46. Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of ACM CIKM 2006, pp 102–111Google Scholar
  47. Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp 307–314CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Waseda UniversityTokyoJapan

Personalised recommendations