Advertisement

Classical Retrieval Models

  • Donald MetzlerEmail author
Chapter
  • 754 Downloads
Part of the The Information Retrieval Series book series (INRE, volume 27)

Abstract

This chapter provides a detailed treatment of classical information retrieval models. A distinction is made between bag of words models and those models that go beyond the bag of words assumption. The bag of words models covered include the binary independence retrieval model, the 2-Poisson model, the BM25 model, unigram language models, and several other bag of words models. The models covered that go beyond the bag of words assumption include n-gram language models, the Indri inference network model, as well as several other previously proposed models. The chapter concludes with a discussion of the current state-of-the-art retrieval models, including their pros and cons.

References

  1. Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389. CrossRefGoogle Scholar
  2. Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In Proc. 22nd ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 222–229). CrossRefGoogle Scholar
  3. Blei, D., Griffiths, T., Jordan, M., & Tenenbaum, J. (2003a). Hierarchical topic models and the nested Chinese restaurant process. In Proc. 16th conf. of advances in neural information processing systems. Google Scholar
  4. Blei, D., Ng, A., & Jordan, M. (2003b). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. zbMATHGoogle Scholar
  5. Büttcher, S., Clarke, C. L. A., & Lushman, B. (2006a). Term proximity scoring for ad-hoc retrieval on very large text collections. In Proc. 29th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 621–622). CrossRefGoogle Scholar
  6. Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3), 462–467. MathSciNetzbMATHCrossRefGoogle Scholar
  7. Clarke, C., Cormack, G., & Burkowski, F. (1995). Shortest substring ranking (MultiText experiments for TREC-4). In Proc. 4th intl. conf. on World Wide Web. Google Scholar
  8. Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In Proc. 14th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 57–61). CrossRefGoogle Scholar
  9. Craswell, N., Robertson, S., Zaragoza, H., & Taylor, M. (2005b). Relevance weighting for query independent evidence. In Proc. 28th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 416–423). CrossRefGoogle Scholar
  10. Croft, W. B. (1986). Boolean queries and term dependencies in probabilistic retrieval models. Journal of the American Society for Information Science, 37(4), 71–77. Google Scholar
  11. Croft, W. B., & Harper, D. (1979). Using probabilistic models of information retrieval without relevance information. Journal of Documentation, 35(4), 285–295. CrossRefGoogle Scholar
  12. Croft, W. B., Turtle, H., & Lewis, D. (1991). The use of phrases and structured queries in information retrieval. In Proc. 14th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 32–45). CrossRefGoogle Scholar
  13. de Kretser, O., & Moffat, A. (1999). Effective document presentation with a locality-based similarity heuristic. In Proc. 22nd ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 113–120). CrossRefGoogle Scholar
  14. Deerwester, S., Dumais, S., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 391–407. CrossRefGoogle Scholar
  15. Diaz, F. (2005). Regularizing ad hoc retrieval scores. In Proc. 14th intl. conf. on information and knowledge management (pp. 672–679). Google Scholar
  16. Diaz, F., & Metzler, D. (2006). Improving the estimation of relevance models using large external corpora. In Proc. 29th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 154–161). CrossRefGoogle Scholar
  17. Fagan, J. (1987). Automatic phrase indexing for document retrieval: An examination of syntactic and non-syntactic methods. In Proc. tenth ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 91–101). CrossRefGoogle Scholar
  18. Fang, H., & Zhai, C. (2005). An exploration of axiomatic approaches to information retrieval. In Proc. 28th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 480–487). CrossRefGoogle Scholar
  19. Fang, H., & Zhai, C. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proc. 29th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 115–122). CrossRefGoogle Scholar
  20. Gao, J., Nie, J., Wu, G., & Cao, G. (2004). Dependence language model for information retrieval. In Proc. 27th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 170–177). Google Scholar
  21. Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2005). Integrating topics and syntax. In Proc. 17th conf. of advances in neural information processing systems (pp. 537–544). Google Scholar
  22. Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science, 26(5), 197–206 and 280–289. CrossRefGoogle Scholar
  23. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proc. 22nd ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 50–57). CrossRefGoogle Scholar
  24. Kraaij, W., Westerveld, T., & Hiemstra, D. (2002). The importance of prior probabilities for entry page search. In Proc. 25th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 27–34). CrossRefGoogle Scholar
  25. Kurland, O., & Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In Proc. 27th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 194–201). Google Scholar
  26. Lavrenko, V. (2004). A generative theory of relevance. PhD thesis, University of Massachusetts, Amherst, MA. Google Scholar
  27. Lavrenko, V., & Croft, W. B. (2001). Relevance-based language models. In Proc. 24th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 120–127). CrossRefGoogle Scholar
  28. Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proc. 27th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 186–193). Google Scholar
  29. Losee, R. Jr. (1994). Term dependence: truncating the Bahadur Lazarsfeld expansion. Information Processing & Management, 30(2), 293–303. CrossRefGoogle Scholar
  30. Metzler, D., & Croft, W. B. (2004). Combining the language model and inference network approaches to retrieval. Information Processing & Management, 40(5), 735–750. CrossRefGoogle Scholar
  31. Metzler, D., & Manmatha, R. (2004). An inference network approach to image retrieval. In Proc. 3rd intl. conf. on image and video retrieval (pp. 42–50). CrossRefGoogle Scholar
  32. Metzler, D., Lavrenko, V., & Croft, W. B. (2004a). Formal multiple Bernoulli models for language modeling. In Proc. 27th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 540–541). Google Scholar
  33. Metzler, D., Strohman, T., Turtle, H., & Croft, W. B. (2004b). Indri at TREC 2004: Terabyte track. In Proc. 13th intl. conf. on World Wide Web. Google Scholar
  34. Metzler, D., Diaz, F., Strohman, T., & Croft, W. B. (2005a). UMass robust 2005: Using mixtures of relevance models for query expansion. In Proc. 14th intl. conf. on World Wide Web. Google Scholar
  35. Metzler, D., Strohman, T., Zhou, Y., & Croft, W. B. (2005b). Indri at TREC 2005: terabyte track. In Proc. 14th intl. conf. on World Wide Web. Google Scholar
  36. Metzler, D., Strohman, T., & Croft, W. B. (2006). Lessons learned from three terabyte tracks. In Proc. 15th intl. conf. on World Wide Web. Google Scholar
  37. Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. In Proc. 29th European conf. on information retrieval (pp. 16–27). Google Scholar
  38. Mishne, G., & de Rijke, M. (2005). Boosting web retrieval through query operations. In Proc. 27th European conf. on information retrieval (pp. 502–516). Google Scholar
  39. Nallapati, R., & Allan, J. (2002). Capturing term dependencies using a language model based on sentence trees. In Proc. 11th intl. conf. on information and knowledge management (pp. 383–390). Google Scholar
  40. Ponte, J., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proc. 21st ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 275–281). CrossRefGoogle Scholar
  41. Robertson, S. (1977). The probability ranking principle in IR. Journal of Documentation, 33(4), 294–303. CrossRefGoogle Scholar
  42. Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation, 60(5), 503–520. CrossRefGoogle Scholar
  43. Robertson, S., & Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proc. 17th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 232–241). Google Scholar
  44. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1994). Okapi at TREC-3. In Proc. 3rd intl. conf. on World Wide Web (pp. 109–126). Google Scholar
  45. Robertson, S. E., & Spärck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146. CrossRefGoogle Scholar
  46. Robertson, S. E., & Walker, S. (1997). On relevance weights with little relevance information. In Proc. 20th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 16–24). CrossRefGoogle Scholar
  47. Robertson, S. E., van Rijsbergen, C. J., & Porter, M. F. (1980). Probabilistic models of indexing and searching. In Proc. 3rd ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 35–56). Google Scholar
  48. Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8), 1270–1278. CrossRefGoogle Scholar
  49. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. CrossRefGoogle Scholar
  50. Song, F., & Croft, W. B. (1999). A general language model for information retrieval. In Proc. 8th intl. conf. on information and knowledge management (pp. 316–321). Google Scholar
  51. Spärck Jones, K. (1971). Automatic keyword classification for information retrieval. Stoneham: Butterworths. Google Scholar
  52. Srikanth, M., & Srihari, R. (2002). Biterm language models for document retrieval. In Proc. 25th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 425–426). CrossRefGoogle Scholar
  53. Strohman, T., Metzler, D., Turtle, H., & Croft, W. B. (2004). Indri: A language model-based search engine for complex queries. In Proceedings of the international conference on intelligence analysis. Google Scholar
  54. Turtle, H., & Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), 187–222. CrossRefGoogle Scholar
  55. van Rijsbergen, C. J. (1977). A theoretical basis for the use of cooccurrence data in information retrieval. Journal of Documentation, 33(2), 106–119. CrossRefGoogle Scholar
  56. Voorhees, E. (1999). The TREC-8 question answering track report. In Proc. 8th intl. conf. on World Wide Web (pp. 77–82). Google Scholar
  57. Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. In Proc. 29th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 178–185). CrossRefGoogle Scholar
  58. Wei, X., & Croft, W. B. (2007). Modeling term associations for ad-hoc retrieval performance within language modeling framework. In Proc. 29th European conf. on information retrieval (pp. 52–63). Google Scholar
  59. Yu, C. T., Buckley, C., Lam, K., & Salton, G. (1983). A generalized term dependence model in information retrieval (Technical report). Cornell University. Google Scholar
  60. Zhai, C., & Lafferty, J. (2001a). Model-based feedback in the language modeling approach to information retrieval. In Proc. 10th intl. conf. on information and knowledge management (pp. 403–410). Google Scholar
  61. Zhai, C., & Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. 24th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 334–342). CrossRefGoogle Scholar
  62. Zhai, C., & Lafferty, J. (2002). Two-stage language models for information retrieval. In Proc. 25th ann. intl. ACM SIGIR conf. on research and development in information retrieval (pp. 49–56). CrossRefGoogle Scholar
  63. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214. CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  1. 1.Natural Language Group, Information Sciences InstituteUniversity of Southern CaliforniaMarina del ReyUSA

Personalised recommendations