Information Retrieval Journal

, Volume 22, Issue 6, pp 581–619 | Cite as

A comparison of filtering evaluation metrics based on formal constraints

  • Enrique Amigó
  • Julio GonzaloEmail author
  • Felisa Verdejo
  • Damiano Spina


Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. Our goal is to compare metrics from a formal point of view, in order to understand whether each metric is appropriate, why and when, in order to achieve a better understanding of the similarities and differences between metrics. Our formal study leads to a typology of measures for document filtering which is based on (1) a formal constraint that must be satisfied by any suitable evaluation measure, and (2) a set of three (mutually exclusive) formal properties which help to understand the fundamental differences between measures and determining which ones are more appropriate depending on the application scenario. As far as we know, this is the first in-depth study on how filtering metrics can be categorized according to their appropriateness for different scenarios. Two main findings derive from our study. First, not every measure satisfies the basic constraint; but problematic measures can be adapted using smoothing techniques that and makes them compliant with the basic constraint while preserving their original properties. Our second finding is that all metrics (except one) can be grouped in three families, each satisfying one out of three formal properties which are mutually exclusive. In cases where the application scenario is clearly defined, this classification of metrics should help choosing an adequate evaluation measure. The exception is the Reliability/Sensitivity metric pair, which does not fit into any of the three families, but has two valuable empirical properties: it is strict (i.e. a good result according to reliability/sensitivity ensures a good result according to all other metrics) and has more robustness that all other measures considered in our study.


Document filtering Evaluation metrics Evaluation methodologies 



Funding was provided by Secretaría de Estado de Investigación, Desarrollo e Innovación, Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. TIN2015-71785-R, project Vemodalen).


  1. Agresti, A., & Hitchcock, D. B. (2005). Bayesian inference for categorical data analysis: A survey. Technical report.Google Scholar
  2. Amigó, E., Artiles, J., Gonzalo, J., Spina, D., Liu, B., & Corujo, A. (2010). WePS3 evaluation campaign: Overview of the on-line reputation management task. In 2nd Web people search evaluation workshop (WePS 2010), CLEF 2010 conference, Padova Italy.Google Scholar
  3. Amigó, E., Corujo, A., Gonzalo, J., Meij, E., & de Rijke, M. (2012). Overview of RepLab 2012: Evaluating online reputation management systems. In CLEF (online working notes/labs/workshop).Google Scholar
  4. Amigo, E., Fang, H., Mizzaro, S., & Zhai, C. (2017). Axiomatic thinking for information retrieval and related tasks. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’17, pp. 1419–1420, New York, 2017. ACM.Google Scholar
  5. Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4), 461–486.CrossRefGoogle Scholar
  6. Amigó, E., Gonzalo, J, & Verdejo, F. (2013). A generic measure for document organization tasks. In Proceedings of ACM SIGIR, pp. 643–652. ACM Press.Google Scholar
  7. Amigó, E., Spina, D., & Carrillo-de-Albornoz, J. (2018). An axiomatic analysis of diversity evaluation metrics: Introducing the rank-biased utility metric. In CoRR, abs/1805.02334.Google Scholar
  8. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., & Spyropoulos, C. D. (2000). An evaluation of naive bayesian anti-spam filtering. In CoRR, cs.CL/0006013.Google Scholar
  9. Bradley, Andrew P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145–1159.CrossRefGoogle Scholar
  10. Busin, L., & Mizzaro, S. (2013). Axiometrics: An axiomatic approach to information retrieval effectiveness metrics. In Proceedings of the 2013 conference on the theory of information retrieval, ICTIR ’13, pp. 8:22–8:29, New York, NY, 2013. ACM.Google Scholar
  11. Callan, J. (1996). Document filtering with inference networks. In Proceedings of the nineteenth annual international ACM SIGIR conference on research and development in information retrieval, pp. 262–269.Google Scholar
  12. Caruana, R., & Niculescu-Mizil, A. (2005). An empirical comparison of supervised learning algorithms using different performance metrics. In Proceedings of 23rd international conference machine learning (ICML06), pp. 161–168.Google Scholar
  13. Clinchant, S., & Gaussier, E. (2011). Is document frequency important for PRF? In Advances in information retrieval theory, pp. 89–100. Springer.Google Scholar
  14. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37.CrossRefGoogle Scholar
  15. Cormack, G., & Lynam, T. (2005). TREC 2005 spam track overview. In Proceedings of the fourteenth text retrieval conference 8TREC 2005).Google Scholar
  16. Fang, H. (2008). A re-examination of query expansion using lexical resources. In ACL, vol. 2008, pp. 139–147. Citeseer.Google Scholar
  17. Fang, H., Tao, T., & Zhai, C. X. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49–56. ACM.Google Scholar
  18. Fang, H., & Zhai, C. X. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp. 115–122. ACM.Google Scholar
  19. Fang, H, & Zhai, C. X. (2014). Axiomatic analysis and optimization of information retrieval models. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’14, pp. 1288–1288, New York, NY, 2014. ACM.Google Scholar
  20. Fawcett, T., & Niculescu-Mizil, A. (2007). PAV and the ROC convex hull. Machine Learning, 68, 97–106.CrossRefGoogle Scholar
  21. Ferri, C., Hernández-Orallo, J., & Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30(1), 27–38.CrossRefGoogle Scholar
  22. Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society Series B (Methodological), 14, 107–114.MathSciNetCrossRefGoogle Scholar
  23. Hedin, B., Tomlinson, S., Baron, J. R., & Oard, D. W. (2009). Overview of the TREC 2009 legal track.Google Scholar
  24. Hoashi, K., Matsumoto, K., Inoue, N., & Hashimoto, K. (2000). Document filtering method using non-relevant information profile. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’00, pp. 176–183, New York, NY, 2000. ACM.Google Scholar
  25. Hull, David A. (1997). The TREC-6 filtering track: Description and analysis. Proceedings of the TREC, 6, 33–56.Google Scholar
  26. Hull, D. A. (1998). The TREC-7 filtering track: Description and analysis. In E. M. Voorhees and D. K. Harman, editors, Proceedings of TREC-7, 7th text retrieval conference, pp. 33–56, Gaithersburg, US, 1998. National Institute of Standards and Technology, Gaithersburg, US.Google Scholar
  27. Karimzadehgan, M., & Zhai, C. X. (2012). Axiomatic analysis of translation language model for information retrieval. In Advances in information retrieval, pp. 268–280. Springer, Berlin.Google Scholar
  28. Karon, B. P., & Alexander, I. E. (1958). Association and estimation in contingency tables. Journal of the American Statistical Association, 23(2), 1–28.MathSciNetGoogle Scholar
  29. Krishnamurthy, B., Gill, P., & Arlitt, M. (2008). A few chirps about twitter. In WOSP ’08: Proceedings of the first workshop on online social networks, pp. 19–24, New York, NY, 2008. ACM.Google Scholar
  30. Le, A., Ajot, J., Przybocki, M., & Strassel, S. (2010). Document image collection using Amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, pp. 45–52, Los Angeles, June 2010. Association for Computational Linguistics.Google Scholar
  31. Ling, C. X., Huang, J., & Zhang, H. (2003). AUC: A statistically consistent and more discriminating measure than accuracy. In IJCAI, pp. 519–526.Google Scholar
  32. Lv, Y., & Zhai, C. X. (2011). Lower-bounding term frequency normalization. In Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11, pp. 7–16, New York, NY, 2011. ACM.Google Scholar
  33. Mitchell, T. M. (1997). Machine learning. New York: McGraw Hill.zbMATHGoogle Scholar
  34. Persin, Michael. (1994). Document filtering for fast ranking. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’94, pp. 339–348, New York, NY, 1994. Springer, New York.CrossRefGoogle Scholar
  35. Provost, F. J., & Fawcett, T. (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In Knowledge discovery and data mining, pp. 43–48.Google Scholar
  36. Qi, Haoliang, Yang, Muyun, He, Xiaoning, & Li, Sheng. (2010). Re-examination on lam% in spam filtering. In Proceedings of the SIGIR 2010 conference, Geneva, Switzerland.Google Scholar
  37. Robertson, S., & Hull, D. A. (2001). The TREC-9 filtering track final report. In Proceedings of TREC-9, pp. 25–40.Google Scholar
  38. Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of ACM SIGIR, pp. 215–223. ACM Press.Google Scholar
  39. Sebastiani, F. (2015). An axiomatically derived measure for the evaluation of classification algorithms. In ICTIR, pp. 11–20.Google Scholar
  40. Sokolova, M. (2006). Assessing invariance properties of evaluation measures. In Proceedings of NIPS’06 workshop on testing deployable learning and decision systems.Google Scholar
  41. Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. AI 2006: Advances in artificial intelligence, pp. 1015–1021.Google Scholar
  42. Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327–352.CrossRefGoogle Scholar
  43. Van Rijsbergen, C. (1974). Foundation of evaluation. Journal of Documentation, 30(4), 365–373.CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.NLP & IR Research Group, UNEDMadridSpain
  2. 2.Computer Science and Information TechnologiesRMITMelbourneAustralia

Personalised recommendations