Document Performance Prediction for Automatic Text Classification

Penha, Gustavo; Campos, Raphael; Canuto, Sérgio; Gonçalves, Marcos André; Santos, Rodrygo L. T.

doi:10.1007/978-3-030-15719-7_17

Gustavo Penha²⁰,
Raphael Campos²¹,
Sérgio Canuto²¹,
Marcos André Gonçalves²¹ &
…
Rodrygo L. T. Santos²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11438))

Included in the following conference series:

European Conference on Information Retrieval

1834 Accesses

Abstract

Query performance prediction (QPP) is a fundamental task in information retrieval, which concerns predicting the effectiveness of a ranking model for a given query in the absence of relevance information. Despite being an active research area, this task has not yet been explored in the context of automatic text classification. In this paper, we study the task of predicting the effectiveness of a classifier for a given document, which we refer to as document performance prediction (DPP). Our experiments on several text classification datasets for both categorization and sentiment analysis attest the effectiveness and complementarity of several DPP inspired by related QPP approaches. Finally, we also explore the usefulness of DPP for improving the classification itself, by using them as additional features in a classification ensemble.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)
MathSciNet Google Scholar
Bashir, S.: Combining pre-retrieval query quality predictors using genetic programming. Appl. Intell. 40(3), 525–535 (2014)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Campos, R., Canuto, S., Salles, T., de Sá, C.C., Gonçalves, M.A.: Stacking bagged and boosted forests for effective automated classification. In: Proceedings of SIGIR, pp. 105–114 (2017)
Google Scholar
Carmel, D., Yom-Tov, E.: Estimating the query difficulty for information retrieval. Synth. Lect. Inf. Concepts Retrieval Serv. 2(1), 1–89 (2010)
MATH Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of SIGKDD, pp. 785–794. ACM (2016)
Google Scholar
Chifu, A.G., Laporte, L., Mothe, J., Ullah, M.Z.: Query performance prediction focused on summarized LETOR features. In: Proceedings of SIGIR, pp. 1177–1180 (2018)
Google Scholar
Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: Proceedings of SIGIR, pp. 299–306 (2002)
Google Scholar
Cummins, R., Jose, J., O’Riordan, C.: Improved query performance prediction using standard deviation. In: Proceedings of SIGIR, pp. 1089–1090 (2011)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Gopal, S., Yang, Y.: Multilabel classification with meta-level features. In: Proceedings of SIGIR, pp. 315–322 (2010)
Google Scholar
Hauff, C.: Predicting the effectiveness of queries and retrieval systems. Ph.D. thesis. EEMCS (2010)
Google Scholar
Hauff, C., Azzopardi, L., Hiemstra, D.: The combination and evaluation of query performance prediction methods. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 301–312. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00958-7_28
Chapter Google Scholar
Hauff, C., Hiemstra, D., de Jong, F.: A survey of pre-retrieval query performance predictors. In: Proceedings of CIKM, pp. 1419–1420 (2008)
Google Scholar
He, B., Ounis, I.: Inferring query performance using pre-retrieval predictors. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 43–54. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30213-1_5
Chapter MATH Google Scholar
Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J., Scholkopf, B.: Support vector machines. IEEE Intell. Syst. Appl. 13(4), 18–28 (1998)
Article Google Scholar
Kurland, O., Shtok, A., Carmel, D., Hummel, S.: A unified framework for post-retrieval query-performance prediction. In: Amati, G., Crestani, F. (eds.) ICTIR 2011. LNCS, vol. 6931, pp. 15–26. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23318-0_4
Chapter Google Scholar
Macdonald, C., Santos, R.L.T., Ounis, I.: On the usefulness of query features for learning to rank. In: Proceedings of CIKM, pp. 2559–2562 (2012)
Google Scholar
Mizzaro, S., Mothe, J., Roitero, K., Ullah, M.Z.: Query performance prediction and effectiveness evaluation without relevance judgments: two sides of the same coin. In: Proceedings of SIGIR, pp. 1233–1236 (2018)
Google Scholar
Mothe, J., Tanguy, L.: Linguistic features to predict query difficulty. In: Proceedings of QP Workshop at SIGIR, pp. 7–10 (2005)
Google Scholar
Pang, G., Jin, H., Jiang, S.: CenKNN: a scalable and effective text classifier. Data Min. Knowl. Discov. 29(3), 593–625 (2015)
Article MathSciNet Google Scholar
Pérez-Iglesias, J., Araujo, L.: Standard deviation as a query hardness estimator. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 207–212. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16321-0_21
Chapter Google Scholar
Raiber, F., Kurland, O.: Using document-quality measures to predict web-search effectiveness. In: Serdyukov, P., et al. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 134–145. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36973-5_12
Chapter Google Scholar
Raiber, F., Kurland, O.: Query-performance prediction: setting the expectations straight. In: Proceedings of SIGIR, pp. 13–22 (2014)
Google Scholar
Roitman, H.: Query performance prediction using passage information. In: Proceedings of SIGIR, pp. 893–896. ACM (2018)
Google Scholar
Roitman, H., Erera, S., Weiner, B.: Robust standard deviation estimation for query performance prediction. In: Proceedings of ICTIR, pp. 245–248 (2017)
Google Scholar
Roitman, H., Hummel, S., Kurland, O.: Using the cross-entropy method to re-rank search results. In: Proceedings of SIGIR, pp. 839–842 (2014)
Google Scholar
Salles, T., Gonçalves, M., Rodrigues, V., Rocha, L.: BROOF: exploiting out-of-bag errors, boosting and random forests for effective automated classification. In: Proceedings of SIGIR, pp. 353–362 (2015)
Google Scholar
Shtok, A., Kurland, O., Carmel, D.: Predicting query performance by query-drift estimation. In: Azzopardi, L., et al. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 305–312. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04417-5_30
Chapter Google Scholar
Shtok, A., Kurland, O., Carmel, D.: Using statistical decision theory and relevance models for query-performance prediction. In: Proceedings of SIGIR, pp. 259–266 (2010)
Google Scholar
Tao, Y., Wu, S.: Query performance prediction by considering score magnitude and variance together. In: Proceedings of CIKM, pp. 1891–1894 (2014)
Google Scholar
Zamani, H., Croft, W.B., Culpepper, J.S.: Neural query performance prediction using weak supervision from multiple signals. In: Proceedings of SIGIR, pp. 105–114 (2018)
Google Scholar
Zhang, H.: The optimality of Naive Bayes. AA 1(2), 3 (2004)
Google Scholar
Zhao, Y., Scholer, F., Tsegay, Y.: Effective pre-retrieval query performance prediction using similarity and variability evidence. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 52–64. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_8
Chapter Google Scholar
Zhou, Y., Croft, W.B.: Query performance prediction in web search environments. In: Proceedings of SIGIR, pp. 543–550 (2007)
Google Scholar

Download references

Acknowledgements

Work partially funded by project MASWeb (FAPEMIG APQ-01400-14) and by the authors’ individual grants from CNPq and FAPEMIG.

Author information

Authors and Affiliations

Delft University of Technology, Delft, The Netherlands
Gustavo Penha
Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Raphael Campos, Sérgio Canuto, Marcos André Gonçalves & Rodrygo L. T. Santos

Authors

Gustavo Penha
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Campos
View author publications
You can also search for this author in PubMed Google Scholar
Sérgio Canuto
View author publications
You can also search for this author in PubMed Google Scholar
Marcos André Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar
Rodrygo L. T. Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rodrygo L. T. Santos .

Editor information

Editors and Affiliations

University of Strathclyde, Glasgow, UK
Leif Azzopardi
Bauhaus Universität Weimar, Weimar, Germany
Benno Stein
Universität Duisburg-Essen, Duisburg, Germany
Norbert Fuhr
GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany
Philipp Mayr
Delft University of Technology, Delft, The Netherlands
Claudia Hauff
University of Twente, Enschede, The Netherlands
Djoerd Hiemstra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Penha, G., Campos, R., Canuto, S., Gonçalves, M.A., Santos, R.L.T. (2019). Document Performance Prediction for Automatic Text Classification. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds) Advances in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science(), vol 11438. Springer, Cham. https://doi.org/10.1007/978-3-030-15719-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-15719-7_17
Published: 07 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-15718-0
Online ISBN: 978-3-030-15719-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics