Abstract
This chapter discusses IR system evaluation with particular reference to the multilingual context, and presents the most commonly used measures and models. The main focus is on system performance from the viewpoint of retrieval effectiveness. However, we also discuss evaluation from a user-oriented perspective and address questions such as how to assess whether the system satisfies the requirements of its users. The objective is to give the reader a working knowledge of how to set about MLIR/CLIR system evaluation. In addition, we report on some of the ways in which evaluation experiments and evaluation campaigns have helped to achieve greater understanding of the issues involved in MLIR/CLIR system development and have contributed to advancing the state-of-the-art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Of the order |documents| x |topics|
- 2.
The recent introduction of alternative methods, such as the Amazon Mechanical Turk crowd sourcing service, can help considerably to reduce the effort required to create test collections, see Section 5.2.5.
- 3.
- 4.
- 5.
- 6.
- 7.
The term ‘ad hoc’ (often written ‘ad-hoc’) reflects the arbitrary subject of the search and its short duration; other typical examples of this type of search are web searches.
- 8.
More details on Known Item Retrieval are given in Section 5.4.
- 9.
The document retrieval track in CLEF typically offers tasks for monolingual retrieval where the topics and the target collection are in the same language, so-called bilingual retrieval where the topics are in a different language from the target collection, and multilingual retrieval where topics in one language are used to find relevant documents in a target collection in a number of different languages.
- 10.
Two multilingual tasks were offered in CLEF 2003; participants could test their system using queries in one language (from a choice of 12 languages) against multilingual collections in either four or eight languages.
- 11.
- 12.
- 13.
- 14.
A common way to measure the effectiveness of a CLIR system is to compare performance against a monolingual base-line.
- 15.
This term was coined to describe tasks that were outsourced to a large group of people rather than being performed by an individual on-site (Howe 2006).
- 16.
All these measures are only defined for the cases when there is at least one relevant document; it is meaningless to assess ranking performance when no document is relevant to the query.
- 17.
The following paragraphs draw extensively from the description in (Schäuble 1997). We thank Peter Schäuble for his kind permission to base our description on his work.
- 18.
Assumed to be non-empty.
- 19.
In practice, D non(q i ) is rarely determined by an exhaustive relevance assessment process. After an approximation of D rel(q i ) is obtained using the pooling method, it is instead assumed that D non(q i ) = D\D rel(q i )
- 20.
Note that theoretically this is not an issue if systems produce exhaustive result lists that rank every item in the document collection; in such cases, inevitably, under the assumption that every query has at least one relevant document, the relevant documents will turn up at some point in the ranking, leading to an average precision which may be very small, but is greater than zero.
- 21.
Please note that P(Data|H0) is not the same as P(H0|Data), i.e., the probability that H0 is true given the data. As a consequence, if P(Data|H0) is high, we can only conclude that there is not enough evidence for a difference, not that the two runs behave the same.
- 22.
See Borlund (2003) for examples of the application of simulated work task situations.
- 23.
These scenarios were used in the MultiMatch project which developed a multilingual multimedia search engine prototype for access, and personalised presentation of cultural heritage information (http://www.multimatch.eu/).
- 24.
A Latin square is an n × n table filled with n different symbols in such a way that each symbol occurs exactly once in each row and exactly once in each column.
- 25.
The trec_eval package developed by Chris Buckley of Sabir Research and used in both TREC and CLEF evaluations is available free of charge: http://trec.nist.gov/trec_eval/
References
Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. ACM SIGIR Forum. 42(2): 9–15
Aslam JA, Yilmaz E, Pavlu V (2005) A geometric interpretation of R-precision and its correlation with average precision. In: Proc. 28th ACM SIGIR conference on research and development in information retrieval (SIGIR 2005). ACM Press: 573–574
Bailey P, Craswell N, Soboroff I, Thomas P, de Vries AP, Yilmaz E (2008) Relevance assessment: Are judges exchangeable and does it matter? In: Proc. 31st ACM SIGIR conference on research and development in information retrieval. ACM Press: 667–674
Borlund P (2003) The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. J Inf. Res. 8(3)
Borlund P (2009) User-centred evaluation of information retrieval systems. Chapter 2. In: Göker A, Davies J (eds.) Information Retrieval: Searching in the 21st Century. John Wiley & Sons: 21–37
Braschler M (2004a) CLEF 2003 – Overview of results. In: Comparative evaluation of multilingual information access systems. 4th workshop of the Cross-Language Evaluation Forum, CLEF 2003, Springer LNCS 3237: 44–63
Braschler M (2004b) Robust multilingual information retrieval. Doctoral Thesis, Institut interfacultaire d’informatique, Université de Neuchâtel
Braschler M, Gonzalo J (2009) Best practices in system and user-oriented multilingual information access. TrebleCLEF Project: http://www.trebleclef.eu/
Braschler M, Peters C. (eds.) (2004) Cross-language evaluation forum. J Inf.Retr. 7(1–2) 2004
Buckley C, Voorhees E (2004) Retrieval evaluation with incomplete information. In Proc. 27th ACM SIGIR conference on research and development in information retrieval (SIGIR 2004). ACM Press: 25–32
Carterette B, Pavlu V, Kanoulas E, Aslam JA, Allan J (2008) Evaluation over thousands of queries. In: Proc. 31st ACM SIGIR conference on research and development in information retrieval (SIGIR 2009). ACM Press: 651–658
Cleverdon CW (1967) The Cranfield tests on index language devices. In: Aslib Proc. 19(6): 173–192
Cleverdon CW (1970) The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical Report No. 3, Cranfield Institute of Technology, Cranfield, UK, 1970
Clough P, Sanderson M, Müller H (2004) The CLEF cross language image retrieval track (ImageCLEF) 2004. In: Image and Video Retrieval (CIVR 2004), Springer, LNCS 3115: 243–251
Clough P, Gonzalo J, Karlgren J (2010a) Creating re-useable log files for interactive CLIR, In SIGIR 2010 Workshop on the Simulation of Interaction (SimInt), Geneva, Switzerland, 23 July 2010
Clough P, Müller H, Sanderson M (2010b) Seven years of image retrieval evaluation. In: Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) (2010) ImageCLEF experimental evaluation in visual information retrieval. The Information Retrieval Series. Springer 32: 3–18
Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Proc. 21st ACM SIGIR conference on research and development in information retrieval (SIGIR 1998). ACM Press: 282–289
Croft B, Metzler D, Strohman T (2009) Evaluating search engines. In: Search engines: Information retrieval in practice 1st ed., Addison Wesley: 269–307
Dunlop M (2000) Reflections on MIRA: Interactive evaluation in information retrieval. J. Am. Soc. for Inf. Sci. 51(14): 1269–1274
Geva S, Kamps J, Peters C, Sakai T, Trotman A, Voorhees E (eds.) (2009) Proc. SIGIR 2009 workshop on the future of IR evaluation. SIGIR 2009, Boston USA. http://staff.science.uva.nl/~kamps/publications/2009/geva:futu09.pdf
Gonzalo J, Oard DW (2003). The CLEF 2002 interactive track. In: Advances in cross-language information retrieval. 3rd workshop of the Cross-Language Evaluation Forum, CLEF 2002. Springer LNCS 2785: 372–382
Hansen P (1998) Evaluation of IR user interface – Implications for user interface design. Human IT. http://www.hb.se/bhs/it
Harman DK (ed.) (1992) Evaluation issues in information retrieval. J. Inform. Process. & Manag. 28(4)
Harman DK (2005) The TREC test collections. In: Voorhees EM, Harman DK (eds.) TREC: experiment and evaluation in information retrieval, MIT Press, 2005
Howe J (2006). The rise of crowdsourcing. Wired, June 2006. http://www.wired.com/magazine/
Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proc. 16th ACM SIGIR conference on research and development in information retrieval (SIGIR 1993). ACM Press: 329–338
Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proc. 23rd ACM SIGIR conference on research and development in information retrieval (SIGIR 2000). ACM Press: 41–48
Jansen BJ, Spink A (2006) How are we searching the World Wide Web? A comparison of nine search engine transaction logs. J. Inform. Process. & Manag. 42(1): 248–263
Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H, Hidaka S (1999) Overview of IR tasks at the first NTCIR workshop. In Proc. First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition: 11–44
Karlgren J, Gonzalo J (2010) Interactive image retrieval. In Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) ImageCLEF. Experimental evaluation in visual information retrieval. The Information Retrieval Series. Springer 32: 117–139
Kelly D. (2009) Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval. 228 p.
Lesk ME, Salton G. (1969) Relevance assessments and retrieval system evaluation. J Inf. Storage and Retr. 4: 343–359
Leung C, Ip H (2000) Benchmarking for content-based visual information search. In Fourth International Conference on Visual Information Systems (VISUAL’2000), LNCS 1929, Springer-Verlag: 442–456
Mandl T (2009) Easy tasks dominate information retrieval evaluation results. In Lecture Notes in Informatics. Datenbanksysteme in Business, Technologie und Web (BTW): 107–116
Mandl T, Womser-Hacker C, Di Nunzio GM, Ferro N (2008) How robust are multilingual information retrieval systems? Proc. SAC 2008 - ACM Symposium on Applied Computing: 1132–1136
Mizzaro S (1998) Relevance: The whole history. J. of Am. Soc. for Inf. Sci. 48(9): 810–832
Müller H, Müller W, Squire DM, Marchand-Maillet S, Pun T (2001) Performance evaluation in content-based image retrieval: overview and proposals. Pattern Recognit. Lett. 22(5): 593–601
Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) (2010) ImageCLEF. Experimental evaluation in visual information retrieval. The Information Retrieval Series, Springer 32: 495 p.
Peters C (ed.) (2000) First results of the CLEF 2000 cross-language text retrieval system evaluation campaign. Working notes for the CLEF 2000 workshop, ERCIM-00-W01: 142
Petrelli D (2007) On the role of user-centred evaluation in the advancement of interactive information retrieval. J. Inform. Process & Manag. 44(1): 22–38
van Rijsbergen CJ (1979) Evaluation. In: Information Retrieval 2nd ed., Butterworths
Robertson S (2006) On GMAP: and other transformations. Proc. 15th ACM international conference on information and knowledge management (CIKM ’06). ACM Press: 78–83
Robertson S (2008) On the history of evaluation in IR. J. of Inf. Sci. 34(4): 439–456
Salton G (1968) Automatic information organization and retrieval. McGraw Hill Text
Sanderson M, Braschler M (2009) Best practices for test collection creation and information retrieval system evaluation, TrebleCLEF Project: http://www.trebleclef.eu
Sanderson M, Zobel J (2005) Information retrieval system evaluation: effort, sensitivity, and reliability. In: Proc. 28th ACM SIGIR conference on research and development in information retrieval (SIGIR 2005). ACM Press: 162–169
Sanderson M, Paramita M, Clough P, Kanoulas E (2010) Do user preferences and evaluation measures line up?In: Proc. 33rd ACM SIGIR conference on research and development in information retrieval (SIGIR 2010). ACM Press: 555–562
Saracevic T (1995) Evaluation of evaluation in information retrieval. In: 18th ACM SIGIR conference on research and development in information retrieval (SIGIR 1995), ACM Press: 138–146
Schäuble P (1997) Multimedia information retrieval: Content-based information retrieval from large text and audio databases, Kluwer Academic Publishers
Schamber L (1994) Relevance and information behaviour. Annual Review of Information Science and Technology, 29: 3–48
Smith JR (1998) Image retrieval evaluation. In: Proc IEEE Workshop on Content-based Access of Image and Video Libraries:112–113
Spärck Jones K (1981) Information Retrieval Experiment. Butterworth-Heinemann Ltd. 352 p.
Spärck Jones K, van Rijsbergen CJ (1975) Report on the need for and provision of an “ideal” information retrieval test collection. British Library Research and Development report 5266. Cambridge: Computer Laboratory, University of Cambridge
Su LT (1992) Evaluation measures for interactive information retrieval. J. Inform. Process & Manag. 28(4): 503–516
Tague-Sutcliffe J (1992) The pragmatics of information retrieval experimentation, revisited. J. Inform. Process & Manag. 28(4): 467–490
Tague-Sutcliffe J (ed.) (1996) Evaluation of information retrieval systems, J. Am. Soc. for Inf. Sci. 47(1)
Tomlinson S (2010) Sampling precision to depth 10000 at CLEF 2009. In: Multilingual information access evaluation Part I: Text retrieval experiments: 10th workshop of the Cross-Language Evaluation Forum, CLEF 2009, Springer, LNCS 6241: 78–85
Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. J Inform. Process. & Manag. 36(5): 697–716
Voorhees EM (2002) The philosophy of information retrieval evaluation. In: Evaluation of cross-language information retrieval systems: 2nd workshop of the Cross-Language Evaluation Forum, CLEF 2001, Springer, LNCS 2406: 355–370
Voorhees EM (2006) Overview of TREC 2006. Fifteenth text retrieval conference (TREC 2006) Proc. NIST Special Publication SP500-272. http://trec.nist.gov/pubs/trec15/
Voorhees EM, Buckley C (2002) The effect of topic set size on retrieval experiment error. In: Proc. 25th ACM SIGIR conference on research and development in information retrieval (SIGIR 2002). ACM Press: 316–323
Voorhees EM, Garofolo JS (2005) Retrieving noisy text. In Voorhees EM, Harman DK (eds.). TREC. experimentation and evaluation in information retrieval. MIT Press: 183–197
Voorhees EM, Harman DK (eds.) (2005) TREC: experiment and evaluation in information retrieval. The MIT Press, Cambridge, MA
Womser-Hacker C (2002) Multilingual topic generation within the CLEF 2001 experiments. In: Evaluation of cross-language information retrieval systems: 2nd workshop of the Cross-Language Evaluation Forum, CLEF 2001, Springer, LNCS 2406: 255–262
Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Proc. 21st ACM SIGIR conference on research and development in information retrieval (SIGIR 1998). ACM Press: 307–314
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Peters, C., Braschler, M., Clough, P. (2012). Evaluation for Multilingual Information Retrieval Systems. In: Multilingual Information Retrieval. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23008-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-23008-0_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23007-3
Online ISBN: 978-3-642-23008-0
eBook Packages: Computer ScienceComputer Science (R0)