Skip to main content

Evaluation for Multilingual Information Retrieval Systems

  • Chapter
  • First Online:
Multilingual Information Retrieval

Abstract

This chapter discusses IR system evaluation with particular reference to the multilingual context, and presents the most commonly used measures and models. The main focus is on system performance from the viewpoint of retrieval effectiveness. However, we also discuss evaluation from a user-oriented perspective and address questions such as how to assess whether the system satisfies the requirements of its users. The objective is to give the reader a working knowledge of how to set about MLIR/CLIR system evaluation. In addition, we report on some of the ways in which evaluation experiments and evaluation campaigns have helped to achieve greater understanding of the issues involved in MLIR/CLIR system development and have contributed to advancing the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Of the order |documents| x |topics|

  2. 2.

    The recent introduction of alternative methods, such as the Amazon Mechanical Turk crowd sourcing service, can help considerably to reduce the effort required to create test collections, see Section 5.2.5.

  3. 3.

    http://trec.nist.gov/

  4. 4.

    http://research.nii.ac.jp/ntcir/

  5. 5.

    http://www.clef-campaign.org/

  6. 6.

    http://www.isical.ac.in/~clia/index.html

  7. 7.

    The term ‘ad hoc’ (often written ‘ad-hoc’) reflects the arbitrary subject of the search and its short duration; other typical examples of this type of search are web searches.

  8. 8.

    More details on Known Item Retrieval are given in Section 5.4.

  9. 9.

    The document retrieval track in CLEF typically offers tasks for monolingual retrieval where the topics and the target collection are in the same language, so-called bilingual retrieval where the topics are in a different language from the target collection, and multilingual retrieval where topics in one language are used to find relevant documents in a target collection in a number of different languages.

  10. 10.

    Two multilingual tasks were offered in CLEF 2003; participants could test their system using queries in one language (from a choice of 12 languages) against multilingual collections in either four or eight languages.

  11. 11.

    http://www.benchathlon.net/

  12. 12.

    http://www-nlpir.nist.gov/projects/trecvid/

  13. 13.

    http://www.imageval.org/

  14. 14.

    A common way to measure the effectiveness of a CLIR system is to compare performance against a monolingual base-line.

  15. 15.

    This term was coined to describe tasks that were outsourced to a large group of people rather than being performed by an individual on-site (Howe 2006).

  16. 16.

    All these measures are only defined for the cases when there is at least one relevant document; it is meaningless to assess ranking performance when no document is relevant to the query.

  17. 17.

    The following paragraphs draw extensively from the description in (Schäuble 1997). We thank Peter Schäuble for his kind permission to base our description on his work.

  18. 18.

    Assumed to be non-empty.

  19. 19.

    In practice, D non(q i ) is rarely determined by an exhaustive relevance assessment process. After an approximation of D rel(q i ) is obtained using the pooling method, it is instead assumed that D non(q i ) = D\D rel(q i )

  20. 20.

    Note that theoretically this is not an issue if systems produce exhaustive result lists that rank every item in the document collection; in such cases, inevitably, under the assumption that every query has at least one relevant document, the relevant documents will turn up at some point in the ranking, leading to an average precision which may be very small, but is greater than zero.

  21. 21.

    Please note that P(Data|H0) is not the same as P(H0|Data), i.e., the probability that H0 is true given the data. As a consequence, if P(Data|H0) is high, we can only conclude that there is not enough evidence for a difference, not that the two runs behave the same.

  22. 22.

    See Borlund (2003) for examples of the application of simulated work task situations.

  23. 23.

    These scenarios were used in the MultiMatch project which developed a multilingual multimedia search engine prototype for access, and personalised presentation of cultural heritage information (http://www.multimatch.eu/).

  24. 24.

    A Latin square is an n × n table filled with n different symbols in such a way that each symbol occurs exactly once in each row and exactly once in each column.

  25. 25.

    The trec_eval package developed by Chris Buckley of Sabir Research and used in both TREC and CLEF evaluations is available free of charge: http://trec.nist.gov/trec_eval/

References

  • Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. ACM SIGIR Forum. 42(2): 9–15

    Article  Google Scholar 

  • Aslam JA, Yilmaz E, Pavlu V (2005) A geometric interpretation of R-precision and its correlation with average precision. In: Proc. 28th ACM SIGIR conference on research and development in information retrieval (SIGIR 2005). ACM Press: 573–574

    Google Scholar 

  • Bailey P, Craswell N, Soboroff I, Thomas P, de Vries AP, Yilmaz E (2008) Relevance assessment: Are judges exchangeable and does it matter? In: Proc. 31st ACM SIGIR conference on research and development in information retrieval. ACM Press: 667–674

    Google Scholar 

  • Borlund P (2003) The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. J Inf. Res. 8(3)

    Google Scholar 

  • Borlund P (2009) User-centred evaluation of information retrieval systems. Chapter 2. In: Göker A, Davies J (eds.) Information Retrieval: Searching in the 21st Century. John Wiley & Sons: 21–37

    Google Scholar 

  • Braschler M (2004a) CLEF 2003 – Overview of results. In: Comparative evaluation of multilingual information access systems. 4th workshop of the Cross-Language Evaluation Forum, CLEF 2003, Springer LNCS 3237: 44–63

    Google Scholar 

  • Braschler M (2004b) Robust multilingual information retrieval. Doctoral Thesis, Institut interfacultaire d’informatique, Université de Neuchâtel

    Google Scholar 

  • Braschler M, Gonzalo J (2009) Best practices in system and user-oriented multilingual information access. TrebleCLEF Project: http://www.trebleclef.eu/

  • Braschler M, Peters C. (eds.) (2004) Cross-language evaluation forum. J Inf.Retr. 7(1–2) 2004

    Google Scholar 

  • Buckley C, Voorhees E (2004) Retrieval evaluation with incomplete information. In Proc. 27th ACM SIGIR conference on research and development in information retrieval (SIGIR 2004). ACM Press: 25–32

    Google Scholar 

  • Carterette B, Pavlu V, Kanoulas E, Aslam JA, Allan J (2008) Evaluation over thousands of queries. In: Proc. 31st ACM SIGIR conference on research and development in information retrieval (SIGIR 2009). ACM Press: 651–658

    Google Scholar 

  • Cleverdon CW (1967) The Cranfield tests on index language devices. In: Aslib Proc. 19(6): 173–192

    Article  Google Scholar 

  • Cleverdon CW (1970) The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical Report No. 3, Cranfield Institute of Technology, Cranfield, UK, 1970

    Google Scholar 

  • Clough P, Sanderson M, Müller H (2004) The CLEF cross language image retrieval track (ImageCLEF) 2004. In: Image and Video Retrieval (CIVR 2004), Springer, LNCS 3115: 243–251

    Google Scholar 

  • Clough P, Gonzalo J, Karlgren J (2010a) Creating re-useable log files for interactive CLIR, In SIGIR 2010 Workshop on the Simulation of Interaction (SimInt), Geneva, Switzerland, 23 July 2010

    Google Scholar 

  • Clough P, Müller H, Sanderson M (2010b) Seven years of image retrieval evaluation. In: Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) (2010) ImageCLEF experimental evaluation in visual information retrieval. The Information Retrieval Series. Springer 32: 3–18

    Google Scholar 

  • Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Proc. 21st ACM SIGIR conference on research and development in information retrieval (SIGIR 1998). ACM Press: 282–289

    Google Scholar 

  • Croft B, Metzler D, Strohman T (2009) Evaluating search engines. In: Search engines: Information retrieval in practice 1st ed., Addison Wesley: 269–307

    Google Scholar 

  • Dunlop M (2000) Reflections on MIRA: Interactive evaluation in information retrieval. J. Am. Soc. for Inf. Sci. 51(14): 1269–1274

    Article  Google Scholar 

  • Geva S, Kamps J, Peters C, Sakai T, Trotman A, Voorhees E (eds.) (2009) Proc. SIGIR 2009 workshop on the future of IR evaluation. SIGIR 2009, Boston USA. http://staff.science.uva.nl/~kamps/publications/2009/geva:futu09.pdf

  • Gonzalo J, Oard DW (2003). The CLEF 2002 interactive track. In: Advances in cross-language information retrieval. 3rd workshop of the Cross-Language Evaluation Forum, CLEF 2002. Springer LNCS 2785: 372–382

    Google Scholar 

  • Hansen P (1998) Evaluation of IR user interface – Implications for user interface design. Human IT. http://www.hb.se/bhs/it

  • Harman DK (ed.) (1992) Evaluation issues in information retrieval. J. Inform. Process. & Manag. 28(4)

    Google Scholar 

  • Harman DK (2005) The TREC test collections. In: Voorhees EM, Harman DK (eds.) TREC: experiment and evaluation in information retrieval, MIT Press, 2005

    Google Scholar 

  • Howe J (2006). The rise of crowdsourcing. Wired, June 2006. http://www.wired.com/magazine/

  • Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proc. 16th ACM SIGIR conference on research and development in information retrieval (SIGIR 1993). ACM Press: 329–338

    Google Scholar 

  • Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proc. 23rd ACM SIGIR conference on research and development in information retrieval (SIGIR 2000). ACM Press: 41–48

    Google Scholar 

  • Jansen BJ, Spink A (2006) How are we searching the World Wide Web? A comparison of nine search engine transaction logs. J. Inform. Process. & Manag. 42(1): 248–263

    Article  Google Scholar 

  • Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H, Hidaka S (1999) Overview of IR tasks at the first NTCIR workshop. In Proc. First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition: 11–44

    Google Scholar 

  • Karlgren J, Gonzalo J (2010) Interactive image retrieval. In Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) ImageCLEF. Experimental evaluation in visual information retrieval. The Information Retrieval Series. Springer 32: 117–139

    Google Scholar 

  • Kelly D. (2009) Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval. 228 p.

    Google Scholar 

  • Lesk ME, Salton G. (1969) Relevance assessments and retrieval system evaluation. J Inf. Storage and Retr. 4: 343–359

    Article  Google Scholar 

  • Leung C, Ip H (2000) Benchmarking for content-based visual information search. In Fourth International Conference on Visual Information Systems (VISUAL’2000), LNCS 1929, Springer-Verlag: 442–456

    Google Scholar 

  • Mandl T (2009) Easy tasks dominate information retrieval evaluation results. In Lecture Notes in Informatics. Datenbanksysteme in Business, Technologie und Web (BTW): 107–116

    Google Scholar 

  • Mandl T, Womser-Hacker C, Di Nunzio GM, Ferro N (2008) How robust are multilingual information retrieval systems? Proc. SAC 2008 - ACM Symposium on Applied Computing: 1132–1136

    Google Scholar 

  • Mizzaro S (1998) Relevance: The whole history. J. of Am. Soc. for Inf. Sci. 48(9): 810–832

    Article  Google Scholar 

  • Müller H, Müller W, Squire DM, Marchand-Maillet S, Pun T (2001) Performance evaluation in content-based image retrieval: overview and proposals. Pattern Recognit. Lett. 22(5): 593–601

    Article  MATH  Google Scholar 

  • Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) (2010) ImageCLEF. Experimental evaluation in visual information retrieval. The Information Retrieval Series, Springer 32: 495 p.

    Google Scholar 

  • Peters C (ed.) (2000) First results of the CLEF 2000 cross-language text retrieval system evaluation campaign. Working notes for the CLEF 2000 workshop, ERCIM-00-W01: 142

    Google Scholar 

  • Petrelli D (2007) On the role of user-centred evaluation in the advancement of interactive information retrieval. J. Inform. Process & Manag. 44(1): 22–38

    Article  Google Scholar 

  • van Rijsbergen CJ (1979) Evaluation. In: Information Retrieval 2nd ed., Butterworths

    Google Scholar 

  • Robertson S (2006) On GMAP: and other transformations. Proc. 15th ACM international conference on information and knowledge management (CIKM ’06). ACM Press: 78–83

    Google Scholar 

  • Robertson S (2008) On the history of evaluation in IR. J. of Inf. Sci. 34(4): 439–456

    Article  Google Scholar 

  • Salton G (1968) Automatic information organization and retrieval. McGraw Hill Text

    Google Scholar 

  • Sanderson M, Braschler M (2009) Best practices for test collection creation and information retrieval system evaluation, TrebleCLEF Project: http://www.trebleclef.eu

  • Sanderson M, Zobel J (2005) Information retrieval system evaluation: effort, sensitivity, and reliability. In: Proc. 28th ACM SIGIR conference on research and development in information retrieval (SIGIR 2005). ACM Press: 162–169

    Google Scholar 

  • Sanderson M, Paramita M, Clough P, Kanoulas E (2010) Do user preferences and evaluation measures line up?In: Proc. 33rd ACM SIGIR conference on research and development in information retrieval (SIGIR 2010). ACM Press: 555–562

    Google Scholar 

  • Saracevic T (1995) Evaluation of evaluation in information retrieval. In: 18th ACM SIGIR conference on research and development in information retrieval (SIGIR 1995), ACM Press: 138–146

    Google Scholar 

  • Schäuble P (1997) Multimedia information retrieval: Content-based information retrieval from large text and audio databases, Kluwer Academic Publishers

    Google Scholar 

  • Schamber L (1994) Relevance and information behaviour. Annual Review of Information Science and Technology, 29: 3–48

    Google Scholar 

  • Smith JR (1998) Image retrieval evaluation. In: Proc IEEE Workshop on Content-based Access of Image and Video Libraries:112–113

    Google Scholar 

  • Spärck Jones K (1981) Information Retrieval Experiment. Butterworth-Heinemann Ltd. 352 p.

    Google Scholar 

  • Spärck Jones K, van Rijsbergen CJ (1975) Report on the need for and provision of an “ideal” information retrieval test collection. British Library Research and Development report 5266. Cambridge: Computer Laboratory, University of Cambridge

    Google Scholar 

  • Su LT (1992) Evaluation measures for interactive information retrieval. J. Inform. Process & Manag. 28(4): 503–516

    Article  Google Scholar 

  • Tague-Sutcliffe J (1992) The pragmatics of information retrieval experimentation, revisited. J. Inform. Process & Manag. 28(4): 467–490

    Article  Google Scholar 

  • Tague-Sutcliffe J (ed.) (1996) Evaluation of information retrieval systems, J. Am. Soc. for Inf. Sci. 47(1)

    Google Scholar 

  • Tomlinson S (2010) Sampling precision to depth 10000 at CLEF 2009. In: Multilingual information access evaluation Part I: Text retrieval experiments: 10th workshop of the Cross-Language Evaluation Forum, CLEF 2009, Springer, LNCS 6241: 78–85

    Google Scholar 

  • Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. J Inform. Process. & Manag. 36(5): 697–716

    Article  Google Scholar 

  • Voorhees EM (2002) The philosophy of information retrieval evaluation. In: Evaluation of cross-language information retrieval systems: 2nd workshop of the Cross-Language Evaluation Forum, CLEF 2001, Springer, LNCS 2406: 355–370

    Google Scholar 

  • Voorhees EM (2006) Overview of TREC 2006. Fifteenth text retrieval conference (TREC 2006) Proc. NIST Special Publication SP500-272. http://trec.nist.gov/pubs/trec15/

  • Voorhees EM, Buckley C (2002) The effect of topic set size on retrieval experiment error. In: Proc. 25th ACM SIGIR conference on research and development in information retrieval (SIGIR 2002). ACM Press: 316–323

    Google Scholar 

  • Voorhees EM, Garofolo JS (2005) Retrieving noisy text. In Voorhees EM, Harman DK (eds.). TREC. experimentation and evaluation in information retrieval. MIT Press: 183–197

    Google Scholar 

  • Voorhees EM, Harman DK (eds.) (2005) TREC: experiment and evaluation in information retrieval. The MIT Press, Cambridge, MA

    Google Scholar 

  • Womser-Hacker C (2002) Multilingual topic generation within the CLEF 2001 experiments. In: Evaluation of cross-language information retrieval systems: 2nd workshop of the Cross-Language Evaluation Forum, CLEF 2001, Springer, LNCS 2406: 255–262

    Google Scholar 

  • Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Proc. 21st ACM SIGIR conference on research and development in information retrieval (SIGIR 1998). ACM Press: 307–314

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carol Peters .

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Peters, C., Braschler, M., Clough, P. (2012). Evaluation for Multilingual Information Retrieval Systems. In: Multilingual Information Retrieval. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23008-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23008-0_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23007-3

  • Online ISBN: 978-3-642-23008-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics