Abstract
Automatic language processing tools typically assign to terms so-called ‘weights’ corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the ‘POS contexts’ in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aslam, J.A., Pavlu, V.: Query hardness estimation using jensen-shannon divergence among multiple scoring functions. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 198–209. Springer, Heidelberg (2007)
Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3), 121–131 (1996)
Bas, A., Denison, D., Keizer, E., Popova, G. (eds.): Fuzzy Grammar, a Reader. Oxford University Press, Oxford (2004)
Bookstein, A., Swanson, D.: Probabilistic models for automatic indexing. JASIS 25, 312–318 (1974)
Brookes, B.C.: The measure of information retrieval effectiveness proposed by Swets. Journal of Documentation 24, 41–54 (1968)
Brown, P.F., Della Pietra, V.J., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Buckley, C., Singhal, A., Mitra, M.: New retrieval approaches using Smart: TREC 4. In: TREC-4, pp. 25–48 (1995)
Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995)
Cooper, W.S., Chen, A., Gey, F.: Full text retrieval based on probalistic equations with coefficients fitted by logistic regression. In: TREC-2, pp. 57–66 (1993)
Corston-Oliver, S., Ringer, E., Gamon, M., Campbell, R.: Task-focused summarization of email. In: Text Summarization Branches Out: Proceedings of the ACL 2004 Workshop, pp. 43–50 (2004)
Craswell, N., Robertson, S.E., Zaragoza, H., Taylor, M.J.: Relevance weighting for query independent evidence. In: SIGIR, pp. 416–423 (2005)
Croft, B., Lafferty, J.: Language Modeling for Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)
Harter, S.P.: A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. JASIS 26(4), 197–206 (1975)
Hwa, R., Resnik, P., Weinberg, A., Kolak, O.: Evaluating translational correspondence using annotation projection. In: ACL, pp. 392–399 (2002)
Jespersen, O.: The Philosophy of Grammar. Allen and Unwin (1929)
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing (4), 401–412 (2003)
Lioma, C., Ounis, I.: Light syntactically-based index pruning for information retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 88–100. Springer, Heidelberg (2007)
Lioma, C., van Rijsbergen, C.J.K.: Part of speech n-grams and information retrieval. RFLA 8, 9–22 (2008)
Lyons, J.: Semantics. 2. Cambridge University Press, Cambridge (1977)
Margulis, E.L.: N-Poisson document modelling. In: SIGIR, pp. 177–189 (1992)
Mikk, J.: Prior knowledge of text content and values of text characteristics. Journal of Quantitative Linguistics 8(1), 67–80 (2001)
Monz, C.: Model tree learning for query term weighting in question answering. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 589–596. Springer, Heidelberg (2007)
Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of Web searching: an exploratory study. Inf. Process. Manage. 40(2), 319–345 (2004)
Papineni, K.: Why inverse document frequency? In: NAACL, pp. 25–33 (2001)
Pasca, M.: High-Performance Open-Domain Question Answering from Large Text Collections. PhD thesis, Southern Methodist University (2001)
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: EMNLP, pp. 130–142 (1996)
Rennie, J.D.M., Jaakkola, T.: Using term informativeness for named entity detection. In: SIGIR, pp. 353–360
Robertson, S., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society of Information Science 27, 129–146 (1976)
Robertson, S., Walker, S.: Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241. Springer, Heidelberg (1994)
Santini, M., Power, R., Evans, R.: Implementing a characterization of genre for automatic genre identification of Web pages. In: COLING/ACL, pp. 699–706 (2006)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: New Methods in Language Processing Studies (1997)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR, pp. 21–29. ACM Press, New York (1996)
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)
Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR, pp. 295–302. ACM, New York (2007)
Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)
Wagner, J., Foster, J., van Genabith, J.: A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors. In: EMNLP-CoNLL, pp. 112–121 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lioma, C., Blanco, R. (2009). Part of Speech Based Term Weighting for Information Retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds) Advances in Information Retrieval. ECIR 2009. Lecture Notes in Computer Science, vol 5478. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00958-7_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-00958-7_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00957-0
Online ISBN: 978-3-642-00958-7
eBook Packages: Computer ScienceComputer Science (R0)