Abstract
This paper compares 14 metrics designed for information retrieval evaluation with graded relevance, together with 10 traditional metrics based on binary relevance, in terms of reliability and resemblance of system rankings. More specifically, we use two test collections with submitted runs from the Chinese IR and English IR tasks in the NTCIR-3 CLIR track to examine the metrics using methods proposed by Buckley/Voorhees and Voorhees/Buckley as well as Kendall’s rank correlation. Our results show that AnDCG l and nDCG l ((Average) Normalised Discounted Cumulative Gain at Document cut-off l) are good metrics, provided that l is large. However, if one wants to avoid the parameter l altogether, or if one requires a metric that closely resembles TREC Average Precision, then Q-measure appears to be the best choice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Buckley, C., Voorhees, E.M.: Evaluating Evaluation Measure Stability. In: ACM SIGIR 2000 Proceedings, pp. 33–40 (2000)
Chen, K.-H., et al.: Overview of CLIR Task at the Third NTCIR Workshop. In: NTCIR-3 Proceedings (2003)
Della Mea, V., Mizzaro, S.: Measuring Retrieval Effectiveness: A New Proposal and a First Experimental Validation. Journal of the American Society for Information Science and Technology 55(6), 530–543 (2004)
Järvelin, K., Kekäläinen, J.: Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002)
Kekäläinen, J.: Binary and Graded Relevance in IR Evaluations – Comparison of the Effects on Ranking of IR Systems. Information Processing and Management 41, 1019–1033 (2005)
Sakai, T.: Average Gain Ratio: A Simple Retrieval Performance Measure for Evaluation with Multiple Relevance Levels. In: ACM SIGIR 2003 Proceedings, pp. 417–418 (2003)
Sakai, T.: New Performance Metrics based on Multigrade Relevance: Their Application to Question Answering. In: NTCIR-4 Proceedings (2004), http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/OPEN/NTCIR4-OPEN-SakaiTrev.pdf
Sakai, T.: Ranking the NTCIR Systems based on Multigrade Relevance. In: Myaeng, S.-H., Zhou, M., Wong, K.-F., Zhang, H.-J. (eds.) AIRS 2004. LNCS, vol. 3411, pp. 251–262. Springer, Heidelberg (2005)
Sakai, T.: A Note on the Reliability of Japanese Question Answering Evaluation. IPSJ SIG Technical Reports FI-77-7, 57–64 (2004)
Sakai, T.: The Effect of Topic Sampling in Sensitivity Comparisons of Information Retrieval Metrics. IPSJ SIG Technical Reports FI-80/NL-169 (2005) (to appear)
Soboroff, I., Voorhees, E.: private communication (2005)
Voorhees, E.M., Buckley, C.: The Effect of Topic Set Size on Retrieval Experiment Error. In: ACM SIGIR 2002 Proceedings, pp. 316–323 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sakai, T. (2005). The Reliability of Metrics Based on Graded Relevance. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_1
Download citation
DOI: https://doi.org/10.1007/11562382_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)