Term Dependence Statistical Measures for Information Retrieval Tasks

Fernández-Reyes, Francis C.; Valadez, Jorge Hermosillo; Suárez, Yasel Garcés

doi:10.1007/978-3-319-27060-9_7

Francis C. Fernández-Reyes¹⁵,
Jorge Hermosillo Valadez¹⁵ &
Yasel Garcés Suárez¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9413))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1122 Accesses

Abstract

In the information retrieval (IR) research community, it is commonly accepted that independence assumptions in probabilistic IR models are inaccurate. The need for modeling term dependencies has been stressed in the literature. However, little or nothing has been said on the statistical nature of these dependencies. We investigate statistical measures of term-to-query and document term-to-term pairs dependence, using several test collections. We show that document entropy is highly correlated to dependence, but that high ratios of linearly uncorrelated pairs, do not necessarily mean independent pairs. A robust IR model should then consider both dependence and independence phenomena.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This independence assumption is not a marginal one, since the probability of a term, given the knowledge of relevance and the query, is not obtained from the summation over the marginal terms of the joint distribution (see [11] for details). It is unclear however, whether the assumption refers to a pairwise- or a mutually- independence hypothesis.
2.
http://ir.dcs.gla.ac.uk/resources/test_collections/.

References

Bendersky, M., Croft, W.B.: Modeling higher-order term dependencies in information retrieval using query hypergraphs. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2012, pp. 941–950. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348408
Choi, S., Choi, J., Yoo, S., Kim, H., Lee, Y.: Semantic concept-enriched dependence model for medical information retrieval. J. Biomed. Inform. 47, 18–27 (2014)
Article Google Scholar
Galton, F.: Regression towards mediocrity in hereditary stature. J. Anthropol. Inst. G. B. Irel. 15, 246–263 (1886). http://dx.doi.org/10.2307/2841583
Google Scholar
Huston, S., Culpepper, J.S., Croft, W.B.: Indexing word sequences for ranked retrieval. ACM Trans. Inf. Syst. (TOIS) 32(1), 3 (2014)
Article Google Scholar
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage. 36(6), 779–808 (2000). http://dx.doi.org/10.1016/S0306-4573(00)00015-7
Article Google Scholar
Lu, W., Robertson, S., MacFarlane, A.: Field-weighted XML retrieval based on BM25. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 161–171. Springer, Heidelberg (2006)
Google Scholar
Margulis, E.L.: N-poisson document modelling. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1992, pp. 177–189. ACM, New York (1992). http://doi.acm.org/10.1145/133160.133195
Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2005, pp. 472–479. ACM, New York (2005). http://doi.acm.org/10.1145/1076034.1076115
Mittendorf, E., Mateev, B., Schäuble, P.: Using the co-occurrence of words for retrieval weighting. Inf. Retr. 3(3), 243–251 (2000). http://dx.doi.org/10.1023/A:1026520926673
Article MATH Google Scholar
Rijsbergen, C.V.: A theoretical basis for the use of cooccurrence data in information retrieval. J. Documentation 33(2), 106–119 (1977). http://dx.doi.org/10.1108/eb026637
Article Google Scholar
Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3(4), 333–389 (2009). http://dx.doi.org/10.1561/1500000019
Article Google Scholar
Roelleke, T.: Information Retrieval Models: Foundations & Relationships. Synthesis Lectures on Information Concepts, Retrieval, and Services, Morgan & Claypool Publishers (2013). http://dx.doi.org/10.2200/S00494ED1V01Y201304ICR027
Roelleke, T., Wang, J., Robertson, S.: Probabilistic retrieval models and binary independence retrieval bir model. In: Liu, L., Zsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 2156–2160. Springer, US (2009)
Google Scholar
Saini, B., Singh, V., Kumar, S.: Information retrieval models and searching methodologies: Survey. Information Retrieval 1(2) (2014)
Google Scholar
Salton, G., Buckley, C., Yu, C.T.: An evaluation of term dependence models in information retrieval. In: Salton, G., Schneider, H.-J. (eds.) SIGIR 1982. lncs, vol. 146, pp. 151–173. Springer, Heidelberg (1982)
Chapter Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988). http://dx.doi.org/10.1016/0306-4573(88)90021-0
Article Google Scholar
Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures, 4th edn. Chapman & Hall/CRC, New York (2007)
MATH Google Scholar
Song, R., Yu, L., Wen, J.R., Hon, H.W.: A proximity probabilistic model for information retrieval. Technical report, Citeseer (2011)
Google Scholar
Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15, 88–103 (1904)
Google Scholar

Download references

Acknowledgement

This research was partially supported by the Consejo Nacional de Ciencia y Tecnologia (CONACYT) through the scholarship grant No. 296232.

Author information

Authors and Affiliations

Universidad Autónoma del Estado de Morelos, 62209, Cuernavaca, Mexico
Francis C. Fernández-Reyes, Jorge Hermosillo Valadez & Yasel Garcés Suárez

Authors

Francis C. Fernández-Reyes
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Hermosillo Valadez
View author publications
You can also search for this author in PubMed Google Scholar
Yasel Garcés Suárez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Hermosillo Valadez .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Grigori Sidorov
Facultad de ciencias, Universidad Autónoma Nacional, México, Distrito Federal, Mexico
Sofía N. Galicia-Haro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fernández-Reyes, F.C., Valadez, J.H., Suárez, Y.G. (2015). Term Dependence Statistical Measures for Information Retrieval Tasks. In: Sidorov, G., Galicia-Haro, S. (eds) Advances in Artificial Intelligence and Soft Computing. MICAI 2015. Lecture Notes in Computer Science(), vol 9413. Springer, Cham. https://doi.org/10.1007/978-3-319-27060-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-27060-9_7
Published: 30 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27059-3
Online ISBN: 978-3-319-27060-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics