Journal of Intelligent Information Systems

, Volume 34, Issue 3, pp 227–248 | Cite as

Evaluating information retrieval system performance based on user preference

  • Bing ZhouEmail author
  • Yiyu Yao


One of the challenges of modern information retrieval is to rank the most relevant documents at the top of the large system output. This calls for choosing the proper methods to evaluate the system performance. The traditional performance measures, such as precision and recall, are based on binary relevance judgment and are not appropriate for multi-grade relevance. The main objective of this paper is to propose a framework for system evaluation based on user preference of documents. It is shown that the notion of user preference is general and flexible for formally defining and interpreting multi-grade relevance. We review 12 evaluation methods and compare their similarities and differences. We find that the normalized distance performance measure is a good choice in terms of the sensitivity to document rank order and gives higher credits to systems for their ability to retrieve highly relevant documents.


Multi-grade relevance Evaluation methods User preference 



The authors are grateful for the financial support from NSERC Canada, constructive comments from professor Zbigniew W. Ras during the ISMIS 2008 conference in Toronto, and for the valuable suggestions from anonymous reviewers.


  1. Bollmann, P., & Wong, S. K. M. (1987). Adaptive linear information retrieval models. In SIGIR (pp. 157–163).Google Scholar
  2. Borda, J. C. (1781). Memoire sur les elections au scrutin. In Histoire de l’Academie Royale des Sciences. Google Scholar
  3. Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 33–40).Google Scholar
  4. Champney, H., & Marshall, H. (1939). Optimal refinement of the rating scale. Journal of Applied Psychology, 23, 323–331.CrossRefGoogle Scholar
  5. Cleverdon, C. (1962). Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. Cranfield: Cranfield Coll. of Aeronautics.Google Scholar
  6. Cleverdon, C., Mills, J., & Keen, M. (1966). Factors dermnining the performance of indexing systems. Cranfield: Aslib Cranfield Research Project.Google Scholar
  7. Cooper, W. S. (1968). Expected search length: A single measure of retrieval effectiveness based on weak ordering action of retrieval systems. Journal of the American Society for Information Science, 19(1), 30–41.CrossRefGoogle Scholar
  8. Cox, E. P. (1980). The optimal number of response alternatives for a scale: A review. Journal of Marketing Research, 407–422.Google Scholar
  9. Cuadra, C. A., & Katter, R. V. (1967). Experimental studies of relevance judgments: Final report. Santa Monica: System Development.Google Scholar
  10. Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the web. In WWW ’01: Proceedings of the 10th international conference on world wide web (pp. 613–622).Google Scholar
  11. Eisenberg, M. (1988). Measuring relevance judgments. Information Processing and Management, 24(4), 373–389.CrossRefMathSciNetGoogle Scholar
  12. Eisenberg, M., & Hu, X. (1987). Dichotomous relevance judgments and the evaluation of information systems. In Proceeding of the american scoiety for information science, 50th annual meeting. Medford.Google Scholar
  13. Fishburn, F. C. (1970). Utility theory for decision making. New York: Wiley.zbMATHGoogle Scholar
  14. Frei, H. P., & Schsuble, P. (1991). Determine the effectiveness of retrieval algorithms. Information Processing and Management, 27, 153–164.CrossRefGoogle Scholar
  15. Fuhr, N. (1989). Optimum polynomial retrieval functions based on probability ranking principle. ACM Transactions on Information System, 3, 183–204.CrossRefGoogle Scholar
  16. Jacoby, J., & Matell, M. S. (1971). Three point likert scales are good enough. Journal of Marketing Research, 8, 495–500.CrossRefGoogle Scholar
  17. Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd annual international acm sigir conference on research and development in information retrieval.Google Scholar
  18. Jarvelin, K., & Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20, 422–446.CrossRefGoogle Scholar
  19. Kando, N., Kuriyams, K., & Yoshioka, M. (2001). Information retrieval system evaluation using multi-grade relevance judgments: Discussion on averageable single-numbered measures. In JPSJ SIG Notes (pp. 105–112).Google Scholar
  20. Katter, R. V. (1968). The influence of scale form on relevance judgments. Information Storage and Retrieval, 4(1), 1–11.CrossRefGoogle Scholar
  21. Kemeny, J. G., & Snell, J. L. (1962). Mathematical models in the social science. New York: Blaisdell.Google Scholar
  22. Kendall, M. (1938). A new measure of rank correlation. Biometrika, 30, 81–89.zbMATHMathSciNetGoogle Scholar
  23. Kendall, M. (1945). The treatment of ties in rank problems. Biometrika, 33, 239–251.zbMATHCrossRefMathSciNetGoogle Scholar
  24. Maglaughlin, K. L., & Sonnenwald, D. H. (2002). User perspectives on relevance criteria: A comparison among relevant, partially relevant, and not-relevant judgments. Journal of the American Society for Information Science and Technology, 53(5), 327–342.CrossRefGoogle Scholar
  25. Maron, M. E., & Kuhns, J. L. (1970). On relevance, probabilistic indexing and information retrieval. In T. Saracevis (Ed.), Introduction to information science (pp. 295–311). New York: R.R. Bowker.Google Scholar
  26. Mizzaro, S. (2001). A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall). International workshop on information retrieval (pp. 43–52).Google Scholar
  27. Myers, J. L., & Arnold, D. W. (2003). Research design and statistical analysis. Hove: Lawrence Erlbaum.Google Scholar
  28. Pollack, S. M. (1968). Measures for the comparison of information retrieval system. American Documentation, 19(4), 387–397.CrossRefGoogle Scholar
  29. Rasmay, J. O. (1973). The effect of number of categories in rating scales on precision of estimation of scale values. Psychometrika, 38(4), 513–532.CrossRefGoogle Scholar
  30. Rees, A. M., & Schultz, D. G. (1967). A field experimental approch to the study of relevance assessments in relation to document searching. Cleverland: Case Western Reserve University.Google Scholar
  31. Robertson, S. E. (1977). The probability ranking principle. In IR journal of documentation (Vol. 33, No. 4, pp. 294–304).Google Scholar
  32. Rocchio, J. J. (1971). Performance indices for document retrieval. In G. Salton (Ed.), The SMART retrieval system-experiments in automatic document processing (pp. 57–67).Google Scholar
  33. Sagara, Y. (2002). Performance measures for ranked output retrieval systems. Journal of Japan Society of Information and Knowledge, 12(2), 22–36.Google Scholar
  34. Sakai, T. (2003). Average gain ratio: A simple retrieval performance measure for evaluation with multiple relevance levels. Proceedings of ACM SIGIR (pp. 417–418).Google Scholar
  35. Sakai, T. (2004). New performance matrics based on multi-grade relevance: Their application to question answering. In NTCIR-4 proceedings.Google Scholar
  36. Spearman, C. (1904). General intelligence: Objectively determined and measured. American Journal of Psychology, 15, 201–293.CrossRefGoogle Scholar
  37. Spink, A., Greisdorf, H., & Bateman, J. (1999). From highly relevant to not relevant: Examining different regions of relevance. Information Processing & Management, 34(4), 599–621.Google Scholar
  38. Stuart, A. (1953). The estimation and comparison of strengths of association in contingency tables. Biometrika, 40, 105–10.zbMATHCrossRefMathSciNetGoogle Scholar
  39. Tang, R., Vevea, J. L., & Shaw, W. M. (1999). Towards the identification of optimal number of relevance categories. Journal of American Society for Information Science (JASIS), 50(3), 254–264.CrossRefGoogle Scholar
  40. van Rijsbergen, C. J. (1979). Information retrieval. Newton: Butterworth-Heinemann.Google Scholar
  41. Voorhees, E. M. (2005). Overview of TREC 2004. In E. Voorhees, & L. Buckland (Eds.), Proceedings of the 13th text retrieval conference. Gaithersburg.Google Scholar
  42. Wong, S. K. M., & Yao, Y. Y. (1990). Query formulation in linear retrieval models. Journal of the American Society for Information Science, 41, 334–341.CrossRefGoogle Scholar
  43. Wong, S. K. M., Yao, Y. Y., & Bollmann, P. (1988). Linear structure in information retrieval. In Proceedings of the 11th annual international acmsigir conference on research and development in information retrieval (Vol. 2, pp. 19–232).Google Scholar
  44. Yao, Y. Y. (1995). Measuring retrieval effectiveness bsed on user preference of documents. Journal of the American Society for Information Science, 46(2), 133–145.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of ReginaReginaCanada

Personalised recommendations