Evaluation for Multilingual Information Retrieval Systems

Peters, Carol; Braschler, Martin; Clough, Paul

doi:10.1007/978-3-642-23008-0_5

Carol Peters⁴,
Martin Braschler⁵ &
Paul Clough⁶

1431 Accesses
1 Citations

Abstract

This chapter discusses IR system evaluation with particular reference to the multilingual context, and presents the most commonly used measures and models. The main focus is on system performance from the viewpoint of retrieval effectiveness. However, we also discuss evaluation from a user-oriented perspective and address questions such as how to assess whether the system satisfies the requirements of its users. The objective is to give the reader a working knowledge of how to set about MLIR/CLIR system evaluation. In addition, we report on some of the ways in which evaluation experiments and evaluation campaigns have helped to achieve greater understanding of the issues involved in MLIR/CLIR system development and have contributed to advancing the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Of the order |documents| x |topics|
2.
The recent introduction of alternative methods, such as the Amazon Mechanical Turk crowd sourcing service, can help considerably to reduce the effort required to create test collections, see Section 5.2.5.
3.
http://trec.nist.gov/
4.
http://research.nii.ac.jp/ntcir/
5.
http://www.clef-campaign.org/
6.
http://www.isical.ac.in/~clia/index.html
7.
The term ‘ad hoc’ (often written ‘ad-hoc’) reflects the arbitrary subject of the search and its short duration; other typical examples of this type of search are web searches.
8.
More details on Known Item Retrieval are given in Section 5.4.
9.
The document retrieval track in CLEF typically offers tasks for monolingual retrieval where the topics and the target collection are in the same language, so-called bilingual retrieval where the topics are in a different language from the target collection, and multilingual retrieval where topics in one language are used to find relevant documents in a target collection in a number of different languages.
10.
Two multilingual tasks were offered in CLEF 2003; participants could test their system using queries in one language (from a choice of 12 languages) against multilingual collections in either four or eight languages.
11.
http://www.benchathlon.net/
12.
http://www-nlpir.nist.gov/projects/trecvid/
13.
http://www.imageval.org/
14.
A common way to measure the effectiveness of a CLIR system is to compare performance against a monolingual base-line.
15.
This term was coined to describe tasks that were outsourced to a large group of people rather than being performed by an individual on-site (Howe 2006).
16.
All these measures are only defined for the cases when there is at least one relevant document; it is meaningless to assess ranking performance when no document is relevant to the query.
17.
The following paragraphs draw extensively from the description in (Schäuble 1997). We thank Peter Schäuble for his kind permission to base our description on his work.
18.
Assumed to be non-empty.
19.
In practice, D ^non(q _i) is rarely determined by an exhaustive relevance assessment process. After an approximation of D ^rel(q _i) is obtained using the pooling method, it is instead assumed that D ^non(q _i) = D\D ^rel(q _i)
20.
Note that theoretically this is not an issue if systems produce exhaustive result lists that rank every item in the document collection; in such cases, inevitably, under the assumption that every query has at least one relevant document, the relevant documents will turn up at some point in the ranking, leading to an average precision which may be very small, but is greater than zero.
21.
Please note that P(Data|H₀) is not the same as P(H₀|Data), i.e., the probability that H₀ is true given the data. As a consequence, if P(Data|H₀) is high, we can only conclude that there is not enough evidence for a difference, not that the two runs behave the same.
22.
See Borlund (2003) for examples of the application of simulated work task situations.
23.
These scenarios were used in the MultiMatch project which developed a multilingual multimedia search engine prototype for access, and personalised presentation of cultural heritage information (http://www.multimatch.eu/).
24.
A Latin square is an n × n table filled with n different symbols in such a way that each symbol occurs exactly once in each row and exactly once in each column.
25.
The trec_eval package developed by Chris Buckley of Sabir Research and used in both TREC and CLEF evaluations is available free of charge: http://trec.nist.gov/trec_eval/

References

Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. ACM SIGIR Forum. 42(2): 9–15
Article Google Scholar
Aslam JA, Yilmaz E, Pavlu V (2005) A geometric interpretation of R-precision and its correlation with average precision. In: Proc. 28th ACM SIGIR conference on research and development in information retrieval (SIGIR 2005). ACM Press: 573–574
Google Scholar
Bailey P, Craswell N, Soboroff I, Thomas P, de Vries AP, Yilmaz E (2008) Relevance assessment: Are judges exchangeable and does it matter? In: Proc. 31st ACM SIGIR conference on research and development in information retrieval. ACM Press: 667–674
Google Scholar
Borlund P (2003) The IIR evaluation model: a framework for evaluation of interactive information retrieval systems. J Inf. Res. 8(3)
Google Scholar
Borlund P (2009) User-centred evaluation of information retrieval systems. Chapter 2. In: Göker A, Davies J (eds.) Information Retrieval: Searching in the 21st Century. John Wiley & Sons: 21–37
Google Scholar
Braschler M (2004a) CLEF 2003 – Overview of results. In: Comparative evaluation of multilingual information access systems. 4th workshop of the Cross-Language Evaluation Forum, CLEF 2003, Springer LNCS 3237: 44–63
Google Scholar
Braschler M (2004b) Robust multilingual information retrieval. Doctoral Thesis, Institut interfacultaire d’informatique, Université de Neuchâtel
Google Scholar
Braschler M, Gonzalo J (2009) Best practices in system and user-oriented multilingual information access. TrebleCLEF Project: http://www.trebleclef.eu/
Braschler M, Peters C. (eds.) (2004) Cross-language evaluation forum. J Inf.Retr. 7(1–2) 2004
Google Scholar
Buckley C, Voorhees E (2004) Retrieval evaluation with incomplete information. In Proc. 27th ACM SIGIR conference on research and development in information retrieval (SIGIR 2004). ACM Press: 25–32
Google Scholar
Carterette B, Pavlu V, Kanoulas E, Aslam JA, Allan J (2008) Evaluation over thousands of queries. In: Proc. 31st ACM SIGIR conference on research and development in information retrieval (SIGIR 2009). ACM Press: 651–658
Google Scholar
Cleverdon CW (1967) The Cranfield tests on index language devices. In: Aslib Proc. 19(6): 173–192
Article Google Scholar
Cleverdon CW (1970) The effect of variations in relevance assessments in comparative experimental tests of index languages. Technical Report No. 3, Cranfield Institute of Technology, Cranfield, UK, 1970
Google Scholar
Clough P, Sanderson M, Müller H (2004) The CLEF cross language image retrieval track (ImageCLEF) 2004. In: Image and Video Retrieval (CIVR 2004), Springer, LNCS 3115: 243–251
Google Scholar
Clough P, Gonzalo J, Karlgren J (2010a) Creating re-useable log files for interactive CLIR, In SIGIR 2010 Workshop on the Simulation of Interaction (SimInt), Geneva, Switzerland, 23 July 2010
Google Scholar
Clough P, Müller H, Sanderson M (2010b) Seven years of image retrieval evaluation. In: Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) (2010) ImageCLEF experimental evaluation in visual information retrieval. The Information Retrieval Series. Springer 32: 3–18
Google Scholar
Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Proc. 21st ACM SIGIR conference on research and development in information retrieval (SIGIR 1998). ACM Press: 282–289
Google Scholar
Croft B, Metzler D, Strohman T (2009) Evaluating search engines. In: Search engines: Information retrieval in practice 1st ed., Addison Wesley: 269–307
Google Scholar
Dunlop M (2000) Reflections on MIRA: Interactive evaluation in information retrieval. J. Am. Soc. for Inf. Sci. 51(14): 1269–1274
Article Google Scholar
Geva S, Kamps J, Peters C, Sakai T, Trotman A, Voorhees E (eds.) (2009) Proc. SIGIR 2009 workshop on the future of IR evaluation. SIGIR 2009, Boston USA. http://staff.science.uva.nl/~kamps/publications/2009/geva:futu09.pdf
Gonzalo J, Oard DW (2003). The CLEF 2002 interactive track. In: Advances in cross-language information retrieval. 3rd workshop of the Cross-Language Evaluation Forum, CLEF 2002. Springer LNCS 2785: 372–382
Google Scholar
Hansen P (1998) Evaluation of IR user interface – Implications for user interface design. Human IT. http://www.hb.se/bhs/it
Harman DK (ed.) (1992) Evaluation issues in information retrieval. J. Inform. Process. & Manag. 28(4)
Google Scholar
Harman DK (2005) The TREC test collections. In: Voorhees EM, Harman DK (eds.) TREC: experiment and evaluation in information retrieval, MIT Press, 2005
Google Scholar
Howe J (2006). The rise of crowdsourcing. Wired, June 2006. http://www.wired.com/magazine/
Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proc. 16th ACM SIGIR conference on research and development in information retrieval (SIGIR 1993). ACM Press: 329–338
Google Scholar
Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: Proc. 23rd ACM SIGIR conference on research and development in information retrieval (SIGIR 2000). ACM Press: 41–48
Google Scholar
Jansen BJ, Spink A (2006) How are we searching the World Wide Web? A comparison of nine search engine transaction logs. J. Inform. Process. & Manag. 42(1): 248–263
Article Google Scholar
Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H, Hidaka S (1999) Overview of IR tasks at the first NTCIR workshop. In Proc. First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition: 11–44
Google Scholar
Karlgren J, Gonzalo J (2010) Interactive image retrieval. In Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) ImageCLEF. Experimental evaluation in visual information retrieval. The Information Retrieval Series. Springer 32: 117–139
Google Scholar
Kelly D. (2009) Methods for evaluating interactive information retrieval systems with users. Foundations and Trends in Information Retrieval. 228 p.
Google Scholar
Lesk ME, Salton G. (1969) Relevance assessments and retrieval system evaluation. J Inf. Storage and Retr. 4: 343–359
Article Google Scholar
Leung C, Ip H (2000) Benchmarking for content-based visual information search. In Fourth International Conference on Visual Information Systems (VISUAL’2000), LNCS 1929, Springer-Verlag: 442–456
Google Scholar
Mandl T (2009) Easy tasks dominate information retrieval evaluation results. In Lecture Notes in Informatics. Datenbanksysteme in Business, Technologie und Web (BTW): 107–116
Google Scholar
Mandl T, Womser-Hacker C, Di Nunzio GM, Ferro N (2008) How robust are multilingual information retrieval systems? Proc. SAC 2008 - ACM Symposium on Applied Computing: 1132–1136
Google Scholar
Mizzaro S (1998) Relevance: The whole history. J. of Am. Soc. for Inf. Sci. 48(9): 810–832
Article Google Scholar
Müller H, Müller W, Squire DM, Marchand-Maillet S, Pun T (2001) Performance evaluation in content-based image retrieval: overview and proposals. Pattern Recognit. Lett. 22(5): 593–601
Article MATH Google Scholar
Müller H, Clough P, Deselaers Th, Caputo, B. (eds.) (2010) ImageCLEF. Experimental evaluation in visual information retrieval. The Information Retrieval Series, Springer 32: 495 p.
Google Scholar
Peters C (ed.) (2000) First results of the CLEF 2000 cross-language text retrieval system evaluation campaign. Working notes for the CLEF 2000 workshop, ERCIM-00-W01: 142
Google Scholar
Petrelli D (2007) On the role of user-centred evaluation in the advancement of interactive information retrieval. J. Inform. Process & Manag. 44(1): 22–38
Article Google Scholar
van Rijsbergen CJ (1979) Evaluation. In: Information Retrieval 2nd ed., Butterworths
Google Scholar
Robertson S (2006) On GMAP: and other transformations. Proc. 15th ACM international conference on information and knowledge management (CIKM ’06). ACM Press: 78–83
Google Scholar
Robertson S (2008) On the history of evaluation in IR. J. of Inf. Sci. 34(4): 439–456
Article Google Scholar
Salton G (1968) Automatic information organization and retrieval. McGraw Hill Text
Google Scholar
Sanderson M, Braschler M (2009) Best practices for test collection creation and information retrieval system evaluation, TrebleCLEF Project: http://www.trebleclef.eu
Sanderson M, Zobel J (2005) Information retrieval system evaluation: effort, sensitivity, and reliability. In: Proc. 28th ACM SIGIR conference on research and development in information retrieval (SIGIR 2005). ACM Press: 162–169
Google Scholar
Sanderson M, Paramita M, Clough P, Kanoulas E (2010) Do user preferences and evaluation measures line up?In: Proc. 33rd ACM SIGIR conference on research and development in information retrieval (SIGIR 2010). ACM Press: 555–562
Google Scholar
Saracevic T (1995) Evaluation of evaluation in information retrieval. In: 18^th ACM SIGIR conference on research and development in information retrieval (SIGIR 1995), ACM Press: 138–146
Google Scholar
Schäuble P (1997) Multimedia information retrieval: Content-based information retrieval from large text and audio databases, Kluwer Academic Publishers
Google Scholar
Schamber L (1994) Relevance and information behaviour. Annual Review of Information Science and Technology, 29: 3–48
Google Scholar
Smith JR (1998) Image retrieval evaluation. In: Proc IEEE Workshop on Content-based Access of Image and Video Libraries:112–113
Google Scholar
Spärck Jones K (1981) Information Retrieval Experiment. Butterworth-Heinemann Ltd. 352 p.
Google Scholar
Spärck Jones K, van Rijsbergen CJ (1975) Report on the need for and provision of an “ideal” information retrieval test collection. British Library Research and Development report 5266. Cambridge: Computer Laboratory, University of Cambridge
Google Scholar
Su LT (1992) Evaluation measures for interactive information retrieval. J. Inform. Process & Manag. 28(4): 503–516
Article Google Scholar
Tague-Sutcliffe J (1992) The pragmatics of information retrieval experimentation, revisited. J. Inform. Process & Manag. 28(4): 467–490
Article Google Scholar
Tague-Sutcliffe J (ed.) (1996) Evaluation of information retrieval systems, J. Am. Soc. for Inf. Sci. 47(1)
Google Scholar
Tomlinson S (2010) Sampling precision to depth 10000 at CLEF 2009. In: Multilingual information access evaluation Part I: Text retrieval experiments: 10th workshop of the Cross-Language Evaluation Forum, CLEF 2009, Springer, LNCS 6241: 78–85
Google Scholar
Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. J Inform. Process. & Manag. 36(5): 697–716
Article Google Scholar
Voorhees EM (2002) The philosophy of information retrieval evaluation. In: Evaluation of cross-language information retrieval systems: 2nd workshop of the Cross-Language Evaluation Forum, CLEF 2001, Springer, LNCS 2406: 355–370
Google Scholar
Voorhees EM (2006) Overview of TREC 2006. Fifteenth text retrieval conference (TREC 2006) Proc. NIST Special Publication SP500-272. http://trec.nist.gov/pubs/trec15/
Voorhees EM, Buckley C (2002) The effect of topic set size on retrieval experiment error. In: Proc. 25th ACM SIGIR conference on research and development in information retrieval (SIGIR 2002). ACM Press: 316–323
Google Scholar
Voorhees EM, Garofolo JS (2005) Retrieving noisy text. In Voorhees EM, Harman DK (eds.). TREC. experimentation and evaluation in information retrieval. MIT Press: 183–197
Google Scholar
Voorhees EM, Harman DK (eds.) (2005) TREC: experiment and evaluation in information retrieval. The MIT Press, Cambridge, MA
Google Scholar
Womser-Hacker C (2002) Multilingual topic generation within the CLEF 2001 experiments. In: Evaluation of cross-language information retrieval systems: 2nd workshop of the Cross-Language Evaluation Forum, CLEF 2001, Springer, LNCS 2406: 255–262
Google Scholar
Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Proc. 21st ACM SIGIR conference on research and development in information retrieval (SIGIR 1998). ACM Press: 307–314
Google Scholar

Download references

Author information

Authors and Affiliations

Consiglio Nazionale delle Ricerche Istituto di Scienza e Tecnologie dell’Informazione, Via G. Moruzzi 1, 56124, Pisa, Italy
Carol Peters
Institute of Applied Information Technology, Zurich University of Applied Sciences, Steinberggasse 13, 8401, Winterthur, Switzerland
Martin Braschler
Information School, University of Sheffield, 211 Portobello Street Regent Court, Sheffield, S1 4DP, UK
Paul Clough

Authors

Carol Peters
View author publications
You can also search for this author in PubMed Google Scholar
Martin Braschler
View author publications
You can also search for this author in PubMed Google Scholar
Paul Clough
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carol Peters .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Peters, C., Braschler, M., Clough, P. (2012). Evaluation for Multilingual Information Retrieval Systems. In: Multilingual Information Retrieval. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23008-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-23008-0_5
Published: 28 September 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23007-3
Online ISBN: 978-3-642-23008-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics