Skip to main content

The Evolution of Cranfield

  • Chapter
  • First Online:
Book cover Information Retrieval Evaluation in a Changing World

Part of the book series: The Information Retrieval Series ((INRE,volume 41))

Abstract

Evaluating search system effectiveness is a foundational hallmark of information retrieval research. Doing so requires infrastructure appropriate for the task at hand, which generally follows the Cranfield paradigm: test collections and associated evaluation measures. A primary purpose of Information Retrieval (IR) evaluation campaigns such as Text REtrieval Conference (TREC) and Conference and Labs of the Evaluation Forum (CLEF) is to build this infrastructure. The first TREC collections targeted the same task as the original Cranfield tests and used measures that were familiar to test collection users of the time. But as evaluation tasks have multiplied and diversified, test collection construction techniques and evaluation measure definitions have also been forced to evolve. This chapter examines how the Cranfield paradigm has been adapted to meet the changing requirements for search systems enabling it to continue to support a vibrant research community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Al-Maskari A, Sanderson M, Clough P, Airio E (2008) The good and the bad system: does the test collection predict users’ effectiveness? In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 59–66

    Chapter  Google Scholar 

  • Allan J (2003) HARD Track overview in TREC 2003: high accuracy retrieval from documents. In: Proceedings of the twelfth Text REtrieval Conference (TREC 2003)

    Google Scholar 

  • Aslam J, Ekstrand-Abueg M, Pavlu V, Diaz F, McCreadie R, Sakai T (2014) TREC 2014 temporal summarization track overview. In: Proceedings of the twenty-third Text REtrieval Conference (TREC 2014)

    Google Scholar 

  • Banks D, Over P, Zhang NF (1999) Blind men and elephants: six approaches to TREC data. Inf Retr 1:7–34

    Article  Google Scholar 

  • Bellot P, Bogers T, Geva S, Hall MA, Huurdeman HC, Kamps J, Kazai G, Koolen M, Moriceau V, Mothe J, Preminger M, SanJuan E, Schenkel R, Skov M, Tannier X, Walsh D (2014) Overview of INEX 2014. In: Kanoulas E, Lupu M, Clough P, Sanderson M, Hall M, Hanbury A, Toms E (eds) Information access evaluation – multilinguality, multimodality, and interaction. Proceedings of the fifth international conference of the CLEF initiative (CLEF 2014). Lecture notes in computer science (LNCS), vol 8685. Springer, Heidelberg, pp 212–228

    Google Scholar 

  • Broder A (2002) A taxonomy of web search. SIGIR Forum 36(2):3–10

    Article  Google Scholar 

  • Buckley C (2001) The TREC-9 query track. In: Voorhees E, Harman D (eds) Proceedings of the ninth Text REtreival Conference (TREC-9), pp 81–85

    Google Scholar 

  • Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2000, pp 33–40

    Google Scholar 

  • Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 25–32

    Google Scholar 

  • Buckley C, Voorhees EM (2005) Retrieval system evaluation. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 3, pp 53–75

    Google Scholar 

  • Buckley C, Dimmick D, Soboroff I, Voorhees E (2007) Bias and the limits of pooling for large collections. Inf Retr 10:491–508

    Article  Google Scholar 

  • Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, ICML ’05. ACM, New York, pp 89–96

    Chapter  Google Scholar 

  • Burgin R (1992) Variations in relevance judgments and the evaluation of retrieval performance. Inf Process Manag 28(5):619–627

    Article  Google Scholar 

  • Carterette B (2011) System effectiveness, user models, and user utility: a conceptual framework for investigation. In: Proceedings of the 34th International ACM SIGIR conference on research and development in information retrieval (SIGIR’11). ACM, New York, pp 903–912

    Google Scholar 

  • Carterette BA (2012) Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans Inf Syst 30(1):4:1–4:34

    Article  Google Scholar 

  • Carterette B (2015) The best published result is random: Sequential testing and its effect on reported effectiveness. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’15, pp 747–750

    Google Scholar 

  • Carterette B, Allan J, Sitaraman R (2006) Minimal test collection for retrieval evaluation. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 268–275

    Google Scholar 

  • Chapelle O, Ji S, Liao C, Velipasaoglu E, Lai L, Wu SL (2011) Intent-based diversification of web search results: metrics and algorithms. Inf Retr 14(6):572–592

    Article  Google Scholar 

  • Clarke CL, Kolla M, Cormack GV, Vechtomova O, Ashkan A, Büttcher S, MacKinnon I (2008) Novelty and diversity in information retrieval evaluation. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 659–666

    Chapter  Google Scholar 

  • Clarke CL, Craswell N, Soboroff I, Ashkan A (2011) A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of the fourth ACM international conference on web search and data mining, WSDM ’11. ACM, New York, pp 75–84

    Chapter  Google Scholar 

  • Cleverdon CW (1967) The Cranfield tests on index language devices. In: Aslib proceedings, vol 19, pp 173–192, (Reprinted in Readings in Information Retrieval, K. Spärck-Jones and P. Willett, editors, Morgan Kaufmann, 1997)

    Google Scholar 

  • Cleverdon CW (1970) The effect of variations in relevance assessments in comparative experimental tests of index languages. Tech. Rep. Cranfield Library Report No. 3, Cranfield Institute of Technology, Cranfield, UK

    Google Scholar 

  • Cleverdon CW (1991) The significance of the Cranfield tests on index languages. In: Proceedings of the fourteenth annual international ACM/SIGIR conference on research and development in information retrieval, pp 3–12

    Google Scholar 

  • Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, New York, pp 282–289

    Chapter  Google Scholar 

  • Cuadra CA, Katter RV (1967) Opening the black box of relevance. J Doc 23(4):291–303

    Article  Google Scholar 

  • Gilbert H, Spärck Jones K (1979) Statistical bases of relevance assessment for the ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/

    Google Scholar 

  • Guiver J, Mizzaro S, Robertson S (2009) A few good topics: experiments in topic set reduction for retrieval evaluation. ACM Trans Inf Syst 27(4):21:1–21:26

    Article  Google Scholar 

  • Harman D (1996) Overview of the fourth Text REtrieval Conference (TREC-4). In: Harman DK (ed) Proceedings of the fourth Text REtrieval Conference (TREC-4), pp 1–23, nIST Special Publication 500-236

    Google Scholar 

  • Harter SP (1996) Variations in relevance assessments and the measurement of retrieval effectiveness. J Am Soc Inf Sci 47(1):37–49

    Article  Google Scholar 

  • Hawking D, Craswell N (2005) The very large collection and web tracks. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 9, pp 199–231

    Google Scholar 

  • Hersh W, Turpin A, Price S, Chan B, Kraemer D, Sacherek L, Olson D (2000) Do batch and user evaluations give the same results? In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2000, pp 17–24

    Google Scholar 

  • Hofmann K, Li L, Radlinski F (2016) Online evaluation for information retrieval. Found Trends Inf Retr 10(1):1–117

    Article  Google Scholar 

  • Hosseini M, Cox IJ, Milic-Frayling N, Shokouhi M, Yilmaz E (2012) An uncertainty-aware query selection model for evaluation of IR systems. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’12. ACM, New York, pp 901–910

    Chapter  Google Scholar 

  • Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th annual international ACM/SIGIR conference on research and development in information retrieval, SIGIR ’93. ACM, New York, pp 329–338

    Google Scholar 

  • Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446

    Article  Google Scholar 

  • Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H, Hidaka S (1999) Overview of IR tasks at the first NTCIR workshop. In: Proceedings of the first NTCIR workshop on research in Japanese text retrieval and term recognition, pp 11–44

    Google Scholar 

  • Keen EM (1966) Measures and averaging methods used in performance testing of indexing systems. Tech. rep., The College of Aeronautics, Cranfield, England. Available at http://sigir.org/resources/museum/

  • Kelly D (2009) Methods for evaluating interactive information retrieval systems with users. Found Trends Inf Retr 3(1–2):1–224

    Google Scholar 

  • Kutlu M, Elsayed T, Lease M (2018) Learning to effectively select topics for information retrieval test collections. Inf Process Manag 54(1):37–59

    Article  Google Scholar 

  • Lalmas M, Tombros A (2007) Evaluating XML retrieval effectiveness at INEX. SIGIR Forum 41(1):40–57

    Article  Google Scholar 

  • Ledwith R (1992) On the difficulties of applying the results of information retrieval research to aid in the searching of large scientific databases. Inf Process Manag 28(4):451–455

    Article  Google Scholar 

  • Lesk ME (1967) SIG – the significance programs for testing the evaluation output. In: Information storage and retrieval, Scientific Report No. ISR-12, National Science Foundation, chap II

    Google Scholar 

  • Lesk M, Salton G (1969) Relevance assessments and retrieval system evaluation. Inf Storage Retr 4:343–359

    Article  Google Scholar 

  • Losada DE, Parapar J, Barreiro A (2016) Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In: Proceedings of the 31st annual ACM symposium on applied computing, SAC ’16. ACM, New York, pp 1027–1034

    Chapter  Google Scholar 

  • Mandl T, Womser-Hacker C (2003) Linguistic and statistical analysis of the clef topics. In: Peters C, Braschler M, Gonzalo J, Kluck M (eds) Advances in cross-language information retrieval: third workshop of the cross-language evaluation forum (CLEF 2002) revised papers. Lecture notes in computer science (LNCS), vol 2785. Springer, Heidelberg, pp 505–511

    Chapter  Google Scholar 

  • Moffat A, Zobel J (2008) Rank-biased precision for measurement of retrieval effectiveness. ACM Trans Inf Syst 27(1):Article 2

    Article  Google Scholar 

  • Robertson S (2008) On the history of evaluation in IR. J Inf Sci 34(4):439–456

    Article  Google Scholar 

  • Robertson S, Callan J (2005) Routing and filtering. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 5, pp 99–121

    Google Scholar 

  • Robertson S, Hancock-Beaulieu M (1992) On the evaluation of IR systems. Inf Process Manag 28(4):457–466

    Article  Google Scholar 

  • Sakai T (2006) Evaluating evaluation metrics based on the bootstrap. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06. ACM, New York, pp 525–532

    Chapter  Google Scholar 

  • Sakai T (2008a) Comparing metrics across TREC and NTCIR: the robustness to pool depth bias. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 691–692

    Google Scholar 

  • Sakai T (2008b) Comparing metrics across TREC and NTCIR: the robustness to system bias. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 581–590

    Google Scholar 

  • Sakai T (2014) Metrics, statistics, tests. In: Ferro N (ed) 2013 PROMISE winter school: bridging between information retrieval and databases. Lecture notes in computer science (LNCS), vol 8173 . Springer, Heidelberg, pp 116–163

    Google Scholar 

  • Sakai T (2016) Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS, 2006-2015. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’16. ACM, New York, pp 5–14

    Google Scholar 

  • Sakai T, Kando N (2008) On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf Retr 11:447–470

    Article  Google Scholar 

  • Sanderson M (2010) Test collection based evaluation of information retrieval systems. Found Trends Inf Retr 4(4):247–375

    Article  Google Scholar 

  • Sanderson M, Zobel J (2005) Information retrieval system evaluation: effort, sensitivity, and reliability. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05. ACM, New York, pp 162–169

    Chapter  Google Scholar 

  • Savoy J (1997) Statistical inference in retrieval effectiveness evaluation. Inf Process Manag 33(4):495–512

    Article  Google Scholar 

  • Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, CIKM ’07. ACM, New York, pp 623–632

    Chapter  Google Scholar 

  • Soboroff I, Robertson S (2003) Building a filtering test collection for trec 2002. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03. ACM, New York, pp 243–250

    Google Scholar 

  • Spärck Jones K (1974) Automatic indexing. J Doc 30:393–432

    Article  Google Scholar 

  • Spärck Jones K (2001) Automatic language and information processing: rethinking evaluation. Nat Lang Eng 7(1):29–46

    Google Scholar 

  • Spärck Jones K, Bates RG (1977) Report on a design study for the ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/

    Google Scholar 

  • Spärck Jones K, Van Rijsbergen C (1975) Report on the need for and provision for and ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/

    Google Scholar 

  • Taube M (1965) A note on the pseudomathematics of relevance. Am Doc 16(2):69–72

    Article  Google Scholar 

  • Tomlinson S, Hedin B (2011) Measuring effectiveness in the TREC legal track. In: Lupu M, Mayer K, Tait J, Trippe A (eds) Current challenges in patent information retrieval. The information retrieval series, vol 29. Springer, Berlin, pp 167–180

    Chapter  Google Scholar 

  • Trotman A, Geva S, Kamps J, Lalmas M, Murdock V (2010) Current research in focused retrieval and result aggregation. Inf Retr 13(5):407–411

    Article  Google Scholar 

  • Turpin AH, Hersh W (2001) Why batch and user evaluations do not give the same results. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01, pp 225–231

    Google Scholar 

  • Van Rijsbergen C (1979) Evaluation, 2nd edn. Butterworths, London, chap 7

    Google Scholar 

  • Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. Inf Process Process 36:697–716

    Article  Google Scholar 

  • Voorhees EM (2005) Question answering in TREC. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 10, pp 233–257

    Google Scholar 

  • Voorhees EM (2014) The effect of sampling strategy on inferred measures. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, SIGIR ’14. ACM, New York, pp 1119–1122

    Google Scholar 

  • Voorhees EM, Buckley C (2002) The effect of topic set size on retrieval experiment error. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’02. ACM, New York, pp 316–323

    Chapter  Google Scholar 

  • Voorhees EM, Harman DK (2005) The Text REtrieval Conference. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 1, pp 3–19

    Google Scholar 

  • Yilmaz E, Aslam JA (2008) Estimating average precision when judgments are incomplete. Knowl Inf Syst 16:173–211

    Article  Google Scholar 

  • Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 603–610

    Google Scholar 

  • Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Croft WB, Moffat A, van Rijsbergen C, Wilkinson R, Zobel J (eds) Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, Melbourne, Australia. ACM Press, New York, pp 307–314

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ellen M. Voorhees .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Voorhees, E.M. (2019). The Evolution of Cranfield. In: Ferro, N., Peters, C. (eds) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-22948-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-22948-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-22947-4

  • Online ISBN: 978-3-030-22948-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics