Abstract
In modern large information retrieval (IR) environments, the number of documents relevant to a request may easily exceed the number of documents a user is willing to examine. Therefore it is desirable to rank highly relevant documents first in search results. To develop retrieval methods for this purpose requires evaluating retrieval methods accordingly. However, the most IR method evaluations are based on rather liberal and binary relevance assessments. Therefore differences between sloppy and excellent IR methods may not be observed in evaluation. An alternative is to employ graded relevance assessments in evaluation. The present paper discusses graded relevance, test collections providing graded assessments, evaluation metrics based on graded relevance assessments. We shall also examine the effects of using graded relevance assessments in retrieval evaluation, and some evaluation results based on graded relevance. We find that graded relevance provides new insight into IR phenomena and affects the relative merits of IR methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blair, D.C., Maron, M.E.: An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM 28(3), 289–299 (1985)
Harman, D.: Private communication on TREC relevance judgments (February 1, 2001)
Hawking, D.: Overview of the TREC-9 Web Track. In: Voorhees, E., Harman, D. (eds.) The Ninth Text REtrieval Conference, TREC 9 (2011), http://trec.nist.gov/pubs/trec9/t9_proceedings.html (visited April 10, 2011)
Hersh, W.R., Hickam, D.H.: An evaluation of interactive Boolean and natural language searching with an online medical textbook. Journal of the American Society for Information Science 46(7), 478–489 (1995)
Ingwersen, P., Järvelin, K.: The Turn: Integration of Information Seeking and Retrieval in Context. Springer (2005)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002)
Järvelin, K., Kekäläinen, J.: Discounted Cumulated Gain. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems. Springer (2009)
Järvelin, K., Price, S.L., Delcambre, L.M.L., Nielsen, M.L.: Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 4–15. Springer, Heidelberg (2008)
Järvelin, K.: Interactive Relevance Feedback with Graded Relevance and Sentence Extraction: Simulated User Experiments. In: Cheung, D., et al. (eds.) Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 2053–2056. ACM, New York (2009)
Kekäläinen, J.: Binary and graded relevance in IR evaluations - Comparison of the effects on ranking of IR systems. Information Processing & Management 41(5), 1019–1033 (2005)
Kekäläinen, J., Järvelin, K.: Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology 53(13), 1120–1129 (2002)
Keskustalo, H., Järvelin, K., Pirkola, A.: The Effects of Relevance Feedback Quality and Quantity in Interactive Relevance Feedback: A Simulation Based on User Modeling. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 191–204. Springer, Heidelberg (2006)
Keskustalo, H., Järvelin, K., Pirkola, A.: Evaluating the Effectiveness of Relevance Feedback Based on a User Simulation Model: Effects of a User Scenario on Cumulated Gain Value. Information Retrieval 11(5), 209–228 (2008)
Keskustalo, H., Järvelin, K., Pirkola, A., Kekäläinen, J.: Intuition-Supporting Visualization of User’s Performance Based on Explicit Negative Higher-Order Relevance. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 675–682. ACM, New York (2008)
Lehtokangas, R., Keskustalo, H., Järvelin, K.: Dictionary-based CLIR loses highly relevant documents. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 421–432. Springer, Heidelberg (2005)
Lehtokangas, R., Keskustalo, H., Järvelin, K.: Experiments with Transitive Dictionary Translation and Pseudo-Relevance Feedback Using Graded Relevance Assessments. Journal of the American Society for Information Science and Technology 59(3), 476–488 (2008)
NTCIR-4 WEB test collection, http://research.nii.ac.jp/ntcir/permission/ntcir-4/perm-en-WEB.html (visited April 10, 2011)
Pirkola, A.: The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of the 21st Annual International ACM Sigir Conference on Research and Development in Information Retrieval, pp. 55–63. ACM, New York (1998)
Saracevic, T., Kantor, P., Chamis, A., Trivison, D.: A study of information seeking and retrieving. I. Background and methodology. Journal of the American Society for Information Science 39(3), 161–176 (1988)
Sormunen, E.: A Method for measuring Wide Range Performance of Boolean Queries in Full-Text Databases. University of Tampere, Acta Electronica Universitatis Tamperensis (2000), http://acta.uta.fi/pdf/951-44-4732-8.pdf (visited April 10, 2011)
Sormunen, E.: Liberal relevance criteria of TREC—Counting on negligible documents? In: Beaulieu, M., Baeza-Yates, R., Myaeng, S.H., Järvelin, K. (eds.) Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 324–330. ACM, Tampere (2002)
Tang, R., Shaw, W.M., Vevea, J.L.: Towards the identification of the optimal number of relevance categories. Journal of the American Society for Information Science 50(3), 254–264 (1999)
Vakkari, P., Sormunen, E.: The Influence of Relevance Levels on the Effectiveness of Interactive Information Retrieval. Journal of the American Society for Information Science 55(11), 963–969 (2004)
Voorhees, E.: Evaluation by Highly Relevant Documents. In: Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74–82. ACM, New Orleans (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Järvelin, K. (2013). Test Collections and Evaluation Metrics Based on Graded Relevance. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-40087-2_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)