Abstract
Retrieval evaluation is always an important aspect in information retrieval (web search) and metrics are a key factor that needs to be carefully considered. In this paper, we propose a new method of measuring stability and discrimination power of a metric. The problem is initiated by Buckley and Voorhees. The advantage of the proposed method is that we are able to measure both aspects together in a systematic manner. Five metrics are tested in the study. They are average precision over all relevant documents, recall-level precision, normalized discount cumulative gain, precision at 10 documents level, and reciprocal rank. Experimental results show that normalized discount cumulative gain is the best, which is followed by average precision over all relevant documents, recall-level precision, precision at 10 documents level, while reciprocal rank is the worst.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of ACM SIGIR Conference, Athens, Greece, pp. 33–40 (July 2000)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 442–446 (2002)
Lin, W., Hauptmann, A.: Revisiting the effect of topic set size on retrieval error. In: Proceedings of ACM SIGIR Conference, Salvador, Brazil, pp. 637–638 (August 2005)
Robertson, S., Kanoulas, E.: On per-topic variance in IR evaluation. In: Proceedings of ACM SIGIR Conference, Portland, USA, pp. 891–900 (August 2012)
Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proceedings of ACM SIGIR Conference, Seattle, USA, pp. 525–532 (August 2006)
Sakai, T.: On the reliability of information retrieval metrics based on graded relevance. Information Processing & Management 43(2), 531–548 (2007)
Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR Conference, Tampere, Finland, pp. 316–323 (August 2002)
Zhou, K., Cummins, R., Lalmas, M., Jose, J.: Evaluating aggregated search pages. In: Proceedings of ACM SIGIR Conference, Portland, USA, pp. 115–124 (August 2012)
Zobel, J.: How reliable are the results of large-scale information retrieval experiments. In: Proceedings of ACM SIGIR Conference, Melbourne, Australia, pp. 307–314 (August 1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shi, H., Tan, Y., Zhu, X., Wu, S. (2013). Measuring Stability and Discrimination Power of Metrics in Information Retrieval Evaluation. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2013. IDEAL 2013. Lecture Notes in Computer Science, vol 8206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41278-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-41278-3_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41277-6
Online ISBN: 978-3-642-41278-3
eBook Packages: Computer ScienceComputer Science (R0)