Document Performance Prediction for Automatic Text Classification

  • Gustavo Penha
  • Raphael Campos
  • Sérgio Canuto
  • Marcos André Gonçalves
  • Rodrygo L. T. SantosEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11438)


Query performance prediction (QPP) is a fundamental task in information retrieval, which concerns predicting the effectiveness of a ranking model for a given query in the absence of relevance information. Despite being an active research area, this task has not yet been explored in the context of automatic text classification. In this paper, we study the task of predicting the effectiveness of a classifier for a given document, which we refer to as document performance prediction (DPP). Our experiments on several text classification datasets for both categorization and sentiment analysis attest the effectiveness and complementarity of several DPP inspired by related QPP approaches. Finally, we also explore the usefulness of DPP for improving the classification itself, by using them as additional features in a classification ensemble.


Performance prediction Automatic text classification 



Work partially funded by project MASWeb (FAPEMIG APQ-01400-14) and by the authors’ individual grants from CNPq and FAPEMIG.


  1. 1.
    Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)MathSciNetGoogle Scholar
  2. 2.
    Bashir, S.: Combining pre-retrieval query quality predictors using genetic programming. Appl. Intell. 40(3), 525–535 (2014)CrossRefGoogle Scholar
  3. 3.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  4. 4.
    Campos, R., Canuto, S., Salles, T., de Sá, C.C., Gonçalves, M.A.: Stacking bagged and boosted forests for effective automated classification. In: Proceedings of SIGIR, pp. 105–114 (2017)Google Scholar
  5. 5.
    Carmel, D., Yom-Tov, E.: Estimating the query difficulty for information retrieval. Synth. Lect. Inf. Concepts Retrieval Serv. 2(1), 1–89 (2010)zbMATHGoogle Scholar
  6. 6.
    Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of SIGKDD, pp. 785–794. ACM (2016)Google Scholar
  7. 7.
    Chifu, A.G., Laporte, L., Mothe, J., Ullah, M.Z.: Query performance prediction focused on summarized LETOR features. In: Proceedings of SIGIR, pp. 1177–1180 (2018)Google Scholar
  8. 8.
    Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: Proceedings of SIGIR, pp. 299–306 (2002)Google Scholar
  9. 9.
    Cummins, R., Jose, J., O’Riordan, C.: Improved query performance prediction using standard deviation. In: Proceedings of SIGIR, pp. 1089–1090 (2011)Google Scholar
  10. 10.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)zbMATHGoogle Scholar
  11. 11.
    Gopal, S., Yang, Y.: Multilabel classification with meta-level features. In: Proceedings of SIGIR, pp. 315–322 (2010)Google Scholar
  12. 12.
    Hauff, C.: Predicting the effectiveness of queries and retrieval systems. Ph.D. thesis. EEMCS (2010)Google Scholar
  13. 13.
    Hauff, C., Azzopardi, L., Hiemstra, D.: The combination and evaluation of query performance prediction methods. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 301–312. Springer, Heidelberg (2009). Scholar
  14. 14.
    Hauff, C., Hiemstra, D., de Jong, F.: A survey of pre-retrieval query performance predictors. In: Proceedings of CIKM, pp. 1419–1420 (2008)Google Scholar
  15. 15.
    He, B., Ounis, I.: Inferring query performance using pre-retrieval predictors. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 43–54. Springer, Heidelberg (2004). Scholar
  16. 16.
    Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)CrossRefGoogle Scholar
  17. 17.
    Kurland, O., Shtok, A., Carmel, D., Hummel, S.: A unified framework for post-retrieval query-performance prediction. In: Amati, G., Crestani, F. (eds.) ICTIR 2011. LNCS, vol. 6931, pp. 15–26. Springer, Heidelberg (2011). Scholar
  18. 18.
    Macdonald, C., Santos, R.L.T., Ounis, I.: On the usefulness of query features for learning to rank. In: Proceedings of CIKM, pp. 2559–2562 (2012)Google Scholar
  19. 19.
    Mizzaro, S., Mothe, J., Roitero, K., Ullah, M.Z.: Query performance prediction and effectiveness evaluation without relevance judgments: two sides of the same coin. In: Proceedings of SIGIR, pp. 1233–1236 (2018)Google Scholar
  20. 20.
    Mothe, J., Tanguy, L.: Linguistic features to predict query difficulty. In: Proceedings of QP Workshop at SIGIR, pp. 7–10 (2005)Google Scholar
  21. 21.
    Pang, G., Jin, H., Jiang, S.: CenKNN: a scalable and effective text classifier. Data Min. Knowl. Discov. 29(3), 593–625 (2015)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Pérez-Iglesias, J., Araujo, L.: Standard deviation as a query hardness estimator. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 207–212. Springer, Heidelberg (2010). Scholar
  23. 23.
    Raiber, F., Kurland, O.: Using document-quality measures to predict web-search effectiveness. In: Serdyukov, P., et al. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 134–145. Springer, Heidelberg (2013). Scholar
  24. 24.
    Raiber, F., Kurland, O.: Query-performance prediction: setting the expectations straight. In: Proceedings of SIGIR, pp. 13–22 (2014)Google Scholar
  25. 25.
    Roitman, H.: Query performance prediction using passage information. In: Proceedings of SIGIR, pp. 893–896. ACM (2018)Google Scholar
  26. 26.
    Roitman, H., Erera, S., Weiner, B.: Robust standard deviation estimation for query performance prediction. In: Proceedings of ICTIR, pp. 245–248 (2017)Google Scholar
  27. 27.
    Roitman, H., Hummel, S., Kurland, O.: Using the cross-entropy method to re-rank search results. In: Proceedings of SIGIR, pp. 839–842 (2014)Google Scholar
  28. 28.
    Salles, T., Gonçalves, M., Rodrigues, V., Rocha, L.: BROOF: exploiting out-of-bag errors, boosting and random forests for effective automated classification. In: Proceedings of SIGIR, pp. 353–362 (2015)Google Scholar
  29. 29.
    Shtok, A., Kurland, O., Carmel, D.: Predicting query performance by query-drift estimation. In: Azzopardi, L., et al. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 305–312. Springer, Heidelberg (2009). Scholar
  30. 30.
    Shtok, A., Kurland, O., Carmel, D.: Using statistical decision theory and relevance models for query-performance prediction. In: Proceedings of SIGIR, pp. 259–266 (2010)Google Scholar
  31. 31.
    Tao, Y., Wu, S.: Query performance prediction by considering score magnitude and variance together. In: Proceedings of CIKM, pp. 1891–1894 (2014)Google Scholar
  32. 32.
    Zamani, H., Croft, W.B., Culpepper, J.S.: Neural query performance prediction using weak supervision from multiple signals. In: Proceedings of SIGIR, pp. 105–114 (2018)Google Scholar
  33. 33.
    Zhang, H.: The optimality of Naive Bayes. AA 1(2), 3 (2004)Google Scholar
  34. 34.
    Zhao, Y., Scholer, F., Tsegay, Y.: Effective pre-retrieval query performance prediction using similarity and variability evidence. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 52–64. Springer, Heidelberg (2008). Scholar
  35. 35.
    Zhou, Y., Croft, W.B.: Query performance prediction in web search environments. In: Proceedings of SIGIR, pp. 543–550 (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Gustavo Penha
    • 1
  • Raphael Campos
    • 2
  • Sérgio Canuto
    • 2
  • Marcos André Gonçalves
    • 2
  • Rodrygo L. T. Santos
    • 2
    Email author
  1. 1.Delft University of TechnologyDelftThe Netherlands
  2. 2.Computer Science DepartmentUniversidade Federal de Minas GeraisBelo HorizonteBrazil

Personalised recommendations