Abstract
In the information retrieval (IR) research community, it is commonly accepted that independence assumptions in probabilistic IR models are inaccurate. The need for modeling term dependencies has been stressed in the literature. However, little or nothing has been said on the statistical nature of these dependencies. We investigate statistical measures of term-to-query and document term-to-term pairs dependence, using several test collections. We show that document entropy is highly correlated to dependence, but that high ratios of linearly uncorrelated pairs, do not necessarily mean independent pairs. A robust IR model should then consider both dependence and independence phenomena.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This independence assumption is not a marginal one, since the probability of a term, given the knowledge of relevance and the query, is not obtained from the summation over the marginal terms of the joint distribution (see [11] for details). It is unclear however, whether the assumption refers to a pairwise- or a mutually- independence hypothesis.
- 2.
References
Bendersky, M., Croft, W.B.: Modeling higher-order term dependencies in information retrieval using query hypergraphs. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 941–950. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348408
Choi, S., Choi, J., Yoo, S., Kim, H., Lee, Y.: Semantic concept-enriched dependence model for medical information retrieval. J. Biomed. Inform. 47, 18–27 (2014)
Galton, F.: Regression towards mediocrity in hereditary stature. J. Anthropol. Inst. G. B. Irel. 15, 246–263 (1886). http://dx.doi.org/10.2307/2841583
Huston, S., Culpepper, J.S., Croft, W.B.: Indexing word sequences for ranked retrieval. ACM Trans. Inf. Syst. (TOIS) 32(1), 3 (2014)
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage. 36(6), 779–808 (2000). http://dx.doi.org/10.1016/S0306-4573(00)00015-7
Lu, W., Robertson, S., MacFarlane, A.: Field-weighted XML retrieval based on BM25. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 161–171. Springer, Heidelberg (2006)
Margulis, E.L.: N-poisson document modelling. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1992, pp. 177–189. ACM, New York (1992). http://doi.acm.org/10.1145/133160.133195
Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2005, pp. 472–479. ACM, New York (2005). http://doi.acm.org/10.1145/1076034.1076115
Mittendorf, E., Mateev, B., Schäuble, P.: Using the co-occurrence of words for retrieval weighting. Inf. Retr. 3(3), 243–251 (2000). http://dx.doi.org/10.1023/A:1026520926673
Rijsbergen, C.V.: A theoretical basis for the use of cooccurrence data in information retrieval. J. Documentation 33(2), 106–119 (1977). http://dx.doi.org/10.1108/eb026637
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3(4), 333–389 (2009). http://dx.doi.org/10.1561/1500000019
Roelleke, T.: Information Retrieval Models: Foundations & Relationships. Synthesis Lectures on Information Concepts, Retrieval, and Services, Morgan & Claypool Publishers (2013). http://dx.doi.org/10.2200/S00494ED1V01Y201304ICR027
Roelleke, T., Wang, J., Robertson, S.: Probabilistic retrieval models and binary independence retrieval bir model. In: Liu, L., Zsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2156–2160. Springer, US (2009)
Saini, B., Singh, V., Kumar, S.: Information retrieval models and searching methodologies: Survey. Information Retrieval 1(2) (2014)
Salton, G., Buckley, C., Yu, C.T.: An evaluation of term dependence models in information retrieval. In: Salton, G., Schneider, H.-J. (eds.) SIGIR 1982. lncs, vol. 146, pp. 151–173. Springer, Heidelberg (1982)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988). http://dx.doi.org/10.1016/0306-4573(88)90021-0
Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures, 4th edn. Chapman & Hall/CRC, New York (2007)
Song, R., Yu, L., Wen, J.R., Hon, H.W.: A proximity probabilistic model for information retrieval. Technical report, Citeseer (2011)
Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15, 88–103 (1904)
Acknowledgement
This research was partially supported by the Consejo Nacional de Ciencia y Tecnologia (CONACYT) through the scholarship grant No. 296232.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Fernández-Reyes, F.C., Valadez, J.H., Suárez, Y.G. (2015). Term Dependence Statistical Measures for Information Retrieval Tasks. In: Sidorov, G., Galicia-Haro, S. (eds) Advances in Artificial Intelligence and Soft Computing. MICAI 2015. Lecture Notes in Computer Science(), vol 9413. Springer, Cham. https://doi.org/10.1007/978-3-319-27060-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-27060-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27059-3
Online ISBN: 978-3-319-27060-9
eBook Packages: Computer ScienceComputer Science (R0)