The Evolution of Cranfield

Voorhees, Ellen M.

doi:10.1007/978-3-030-22948-1_2

Ellen M. Voorhees⁹

Part of the book series: The Information Retrieval Series ((INRE,volume 41))

787 Accesses
8 Citations
2 Altmetric

Abstract

Evaluating search system effectiveness is a foundational hallmark of information retrieval research. Doing so requires infrastructure appropriate for the task at hand, which generally follows the Cranfield paradigm: test collections and associated evaluation measures. A primary purpose of Information Retrieval (IR) evaluation campaigns such as Text REtrieval Conference (TREC) and Conference and Labs of the Evaluation Forum (CLEF) is to build this infrastructure. The first TREC collections targeted the same task as the original Cranfield tests and used measures that were familiar to test collection users of the time. But as evaluation tasks have multiplied and diversified, test collection construction techniques and evaluation measure definitions have also been forced to evolve. This chapter examines how the Cranfield paradigm has been adapted to meet the changing requirements for search systems enabling it to continue to support a vibrant research community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Al-Maskari A, Sanderson M, Clough P, Airio E (2008) The good and the bad system: does the test collection predict users’ effectiveness? In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 59–66
Chapter Google Scholar
Allan J (2003) HARD Track overview in TREC 2003: high accuracy retrieval from documents. In: Proceedings of the twelfth Text REtrieval Conference (TREC 2003)
Google Scholar
Aslam J, Ekstrand-Abueg M, Pavlu V, Diaz F, McCreadie R, Sakai T (2014) TREC 2014 temporal summarization track overview. In: Proceedings of the twenty-third Text REtrieval Conference (TREC 2014)
Google Scholar
Banks D, Over P, Zhang NF (1999) Blind men and elephants: six approaches to TREC data. Inf Retr 1:7–34
Article Google Scholar
Bellot P, Bogers T, Geva S, Hall MA, Huurdeman HC, Kamps J, Kazai G, Koolen M, Moriceau V, Mothe J, Preminger M, SanJuan E, Schenkel R, Skov M, Tannier X, Walsh D (2014) Overview of INEX 2014. In: Kanoulas E, Lupu M, Clough P, Sanderson M, Hall M, Hanbury A, Toms E (eds) Information access evaluation – multilinguality, multimodality, and interaction. Proceedings of the fifth international conference of the CLEF initiative (CLEF 2014). Lecture notes in computer science (LNCS), vol 8685. Springer, Heidelberg, pp 212–228
Google Scholar
Broder A (2002) A taxonomy of web search. SIGIR Forum 36(2):3–10
Article Google Scholar
Buckley C (2001) The TREC-9 query track. In: Voorhees E, Harman D (eds) Proceedings of the ninth Text REtreival Conference (TREC-9), pp 81–85
Google Scholar
Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2000, pp 33–40
Google Scholar
Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 25–32
Google Scholar
Buckley C, Voorhees EM (2005) Retrieval system evaluation. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 3, pp 53–75
Google Scholar
Buckley C, Dimmick D, Soboroff I, Voorhees E (2007) Bias and the limits of pooling for large collections. Inf Retr 10:491–508
Article Google Scholar
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, ICML ’05. ACM, New York, pp 89–96
Chapter Google Scholar
Burgin R (1992) Variations in relevance judgments and the evaluation of retrieval performance. Inf Process Manag 28(5):619–627
Article Google Scholar
Carterette B (2011) System effectiveness, user models, and user utility: a conceptual framework for investigation. In: Proceedings of the 34th International ACM SIGIR conference on research and development in information retrieval (SIGIR’11). ACM, New York, pp 903–912
Google Scholar
Carterette BA (2012) Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans Inf Syst 30(1):4:1–4:34
Article Google Scholar
Carterette B (2015) The best published result is random: Sequential testing and its effect on reported effectiveness. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’15, pp 747–750
Google Scholar
Carterette B, Allan J, Sitaraman R (2006) Minimal test collection for retrieval evaluation. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 268–275
Google Scholar
Chapelle O, Ji S, Liao C, Velipasaoglu E, Lai L, Wu SL (2011) Intent-based diversification of web search results: metrics and algorithms. Inf Retr 14(6):572–592
Article Google Scholar
Clarke CL, Kolla M, Cormack GV, Vechtomova O, Ashkan A, Büttcher S, MacKinnon I (2008) Novelty and diversity in information retrieval evaluation. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 659–666
Chapter Google Scholar
Clarke CL, Craswell N, Soboroff I, Ashkan A (2011) A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of the fourth ACM international conference on web search and data mining, WSDM ’11. ACM, New York, pp 75–84
Chapter Google Scholar
Cleverdon CW (1967) The Cranfield tests on index language devices. In: Aslib proceedings, vol 19, pp 173–192, (Reprinted in Readings in Information Retrieval, K. Spärck-Jones and P. Willett, editors, Morgan Kaufmann, 1997)
Google Scholar
Cleverdon CW (1970) The effect of variations in relevance assessments in comparative experimental tests of index languages. Tech. Rep. Cranfield Library Report No. 3, Cranfield Institute of Technology, Cranfield, UK
Google Scholar
Cleverdon CW (1991) The significance of the Cranfield tests on index languages. In: Proceedings of the fourteenth annual international ACM/SIGIR conference on research and development in information retrieval, pp 3–12
Google Scholar
Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, New York, pp 282–289
Chapter Google Scholar
Cuadra CA, Katter RV (1967) Opening the black box of relevance. J Doc 23(4):291–303
Article Google Scholar
Gilbert H, Spärck Jones K (1979) Statistical bases of relevance assessment for the ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/
Google Scholar
Guiver J, Mizzaro S, Robertson S (2009) A few good topics: experiments in topic set reduction for retrieval evaluation. ACM Trans Inf Syst 27(4):21:1–21:26
Article Google Scholar
Harman D (1996) Overview of the fourth Text REtrieval Conference (TREC-4). In: Harman DK (ed) Proceedings of the fourth Text REtrieval Conference (TREC-4), pp 1–23, nIST Special Publication 500-236
Google Scholar
Harter SP (1996) Variations in relevance assessments and the measurement of retrieval effectiveness. J Am Soc Inf Sci 47(1):37–49
Article Google Scholar
Hawking D, Craswell N (2005) The very large collection and web tracks. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 9, pp 199–231
Google Scholar
Hersh W, Turpin A, Price S, Chan B, Kraemer D, Sacherek L, Olson D (2000) Do batch and user evaluations give the same results? In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2000, pp 17–24
Google Scholar
Hofmann K, Li L, Radlinski F (2016) Online evaluation for information retrieval. Found Trends Inf Retr 10(1):1–117
Article Google Scholar
Hosseini M, Cox IJ, Milic-Frayling N, Shokouhi M, Yilmaz E (2012) An uncertainty-aware query selection model for evaluation of IR systems. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’12. ACM, New York, pp 901–910
Chapter Google Scholar
Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th annual international ACM/SIGIR conference on research and development in information retrieval, SIGIR ’93. ACM, New York, pp 329–338
Google Scholar
Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446
Article Google Scholar
Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H, Hidaka S (1999) Overview of IR tasks at the first NTCIR workshop. In: Proceedings of the first NTCIR workshop on research in Japanese text retrieval and term recognition, pp 11–44
Google Scholar
Keen EM (1966) Measures and averaging methods used in performance testing of indexing systems. Tech. rep., The College of Aeronautics, Cranfield, England. Available at http://sigir.org/resources/museum/
Kelly D (2009) Methods for evaluating interactive information retrieval systems with users. Found Trends Inf Retr 3(1–2):1–224
Google Scholar
Kutlu M, Elsayed T, Lease M (2018) Learning to effectively select topics for information retrieval test collections. Inf Process Manag 54(1):37–59
Article Google Scholar
Lalmas M, Tombros A (2007) Evaluating XML retrieval effectiveness at INEX. SIGIR Forum 41(1):40–57
Article Google Scholar
Ledwith R (1992) On the difficulties of applying the results of information retrieval research to aid in the searching of large scientific databases. Inf Process Manag 28(4):451–455
Article Google Scholar
Lesk ME (1967) SIG – the significance programs for testing the evaluation output. In: Information storage and retrieval, Scientific Report No. ISR-12, National Science Foundation, chap II
Google Scholar
Lesk M, Salton G (1969) Relevance assessments and retrieval system evaluation. Inf Storage Retr 4:343–359
Article Google Scholar
Losada DE, Parapar J, Barreiro A (2016) Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In: Proceedings of the 31st annual ACM symposium on applied computing, SAC ’16. ACM, New York, pp 1027–1034
Chapter Google Scholar
Mandl T, Womser-Hacker C (2003) Linguistic and statistical analysis of the clef topics. In: Peters C, Braschler M, Gonzalo J, Kluck M (eds) Advances in cross-language information retrieval: third workshop of the cross-language evaluation forum (CLEF 2002) revised papers. Lecture notes in computer science (LNCS), vol 2785. Springer, Heidelberg, pp 505–511
Chapter Google Scholar
Moffat A, Zobel J (2008) Rank-biased precision for measurement of retrieval effectiveness. ACM Trans Inf Syst 27(1):Article 2
Article Google Scholar
Robertson S (2008) On the history of evaluation in IR. J Inf Sci 34(4):439–456
Article Google Scholar
Robertson S, Callan J (2005) Routing and filtering. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 5, pp 99–121
Google Scholar
Robertson S, Hancock-Beaulieu M (1992) On the evaluation of IR systems. Inf Process Manag 28(4):457–466
Article Google Scholar
Sakai T (2006) Evaluating evaluation metrics based on the bootstrap. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06. ACM, New York, pp 525–532
Chapter Google Scholar
Sakai T (2008a) Comparing metrics across TREC and NTCIR: the robustness to pool depth bias. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 691–692
Google Scholar
Sakai T (2008b) Comparing metrics across TREC and NTCIR: the robustness to system bias. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 581–590
Google Scholar
Sakai T (2014) Metrics, statistics, tests. In: Ferro N (ed) 2013 PROMISE winter school: bridging between information retrieval and databases. Lecture notes in computer science (LNCS), vol 8173 . Springer, Heidelberg, pp 116–163
Google Scholar
Sakai T (2016) Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS, 2006-2015. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’16. ACM, New York, pp 5–14
Google Scholar
Sakai T, Kando N (2008) On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf Retr 11:447–470
Article Google Scholar
Sanderson M (2010) Test collection based evaluation of information retrieval systems. Found Trends Inf Retr 4(4):247–375
Article Google Scholar
Sanderson M, Zobel J (2005) Information retrieval system evaluation: effort, sensitivity, and reliability. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05. ACM, New York, pp 162–169
Chapter Google Scholar
Savoy J (1997) Statistical inference in retrieval effectiveness evaluation. Inf Process Manag 33(4):495–512
Article Google Scholar
Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, CIKM ’07. ACM, New York, pp 623–632
Chapter Google Scholar
Soboroff I, Robertson S (2003) Building a filtering test collection for trec 2002. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03. ACM, New York, pp 243–250
Google Scholar
Spärck Jones K (1974) Automatic indexing. J Doc 30:393–432
Article Google Scholar
Spärck Jones K (2001) Automatic language and information processing: rethinking evaluation. Nat Lang Eng 7(1):29–46
Google Scholar
Spärck Jones K, Bates RG (1977) Report on a design study for the ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/
Google Scholar
Spärck Jones K, Van Rijsbergen C (1975) Report on the need for and provision for and ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/
Google Scholar
Taube M (1965) A note on the pseudomathematics of relevance. Am Doc 16(2):69–72
Article Google Scholar
Tomlinson S, Hedin B (2011) Measuring effectiveness in the TREC legal track. In: Lupu M, Mayer K, Tait J, Trippe A (eds) Current challenges in patent information retrieval. The information retrieval series, vol 29. Springer, Berlin, pp 167–180
Chapter Google Scholar
Trotman A, Geva S, Kamps J, Lalmas M, Murdock V (2010) Current research in focused retrieval and result aggregation. Inf Retr 13(5):407–411
Article Google Scholar
Turpin AH, Hersh W (2001) Why batch and user evaluations do not give the same results. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01, pp 225–231
Google Scholar
Van Rijsbergen C (1979) Evaluation, 2nd edn. Butterworths, London, chap 7
Google Scholar
Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. Inf Process Process 36:697–716
Article Google Scholar
Voorhees EM (2005) Question answering in TREC. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 10, pp 233–257
Google Scholar
Voorhees EM (2014) The effect of sampling strategy on inferred measures. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, SIGIR ’14. ACM, New York, pp 1119–1122
Google Scholar
Voorhees EM, Buckley C (2002) The effect of topic set size on retrieval experiment error. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’02. ACM, New York, pp 316–323
Chapter Google Scholar
Voorhees EM, Harman DK (2005) The Text REtrieval Conference. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 1, pp 3–19
Google Scholar
Yilmaz E, Aslam JA (2008) Estimating average precision when judgments are incomplete. Knowl Inf Syst 16:173–211
Article Google Scholar
Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 603–610
Google Scholar
Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Croft WB, Moffat A, van Rijsbergen C, Wilkinson R, Zobel J (eds) Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, Melbourne, Australia. ACM Press, New York, pp 307–314
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Standards and Technology, Gaithersburg, MD, USA
Ellen M. Voorhees

Authors

Ellen M. Voorhees
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ellen M. Voorhees .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Padova , Padova, Italy
Nicola Ferro
Consiglio Nazionale delle Ricerche, Istituto di Scienza e Tecnologie dell’Informazione, Pisa, Italy
Carol Peters

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Voorhees, E.M. (2019). The Evolution of Cranfield. In: Ferro, N., Peters, C. (eds) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-22948-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-22948-1_2
Published: 14 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22947-4
Online ISBN: 978-3-030-22948-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics