Skip to main content

A Log-Logistic Model-Based Interpretation of TF Normalization of BM25

  • Conference paper
Advances in Information Retrieval (ECIR 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7224))

Included in the following conference series:

Abstract

The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k 1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter k 1 probabilistically, making it hard to optimize the setting of this parameter. In this paper, we provide a novel probabilistic interpretation of the BM25 TF normalization and its parameter k 1 based on a log-logistic model for the probability of seeing a document in the collection with a given level of TF. The proposed interpretation allows us to derive different approaches to estimation of parameter k 1 based solely on the current collection without requiring any training data, thus effectively eliminating one free parameter from BM25. Our experiment results show that the proposed approaches can accurately predict the optimal k 1 without requiring training data and achieve better or comparable retrieval performance to a well-tuned BM25 where k 1 is optimized based on training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 357–389 (2002)

    Article  Google Scholar 

  2. Bendersky, M., Metzler, D., Bruce Croft, W.: Learning concept importance using a weighted dependence model. In: WSDM 2010, pp. 31–40 (2010)

    Google Scholar 

  3. Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1, 163–190 (1995)

    Article  Google Scholar 

  4. Clinchant, S., Gaussier, E.: Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR. In: Azzopardi, L., Kazai, G., Robertson, S., Rüger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 54–65. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Clinchant, S., Gaussier, E.: Information-based models for ad hoc IR. In: SIGIR 2010, pp. 234–241 (2010)

    Google Scholar 

  6. Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: SIGIR 2004, pp. 49–56 (2004)

    Google Scholar 

  7. Harter, S.P.: A Probabilistic Approach to Automatic Keyword Indexing. PhD thesis, The University of Chicago (1974)

    Google Scholar 

  8. He, B., Ounis, I.: On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst. 25 (July 2007)

    Google Scholar 

  9. Hintikka, J.: On Semantic Information. In: Hintikka, J., Suppes, P. (eds.) Information and Inference, pp. 3–27. D. Reidel Pub. (1970)

    Google Scholar 

  10. Sparck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management, 779–840 (2000)

    Google Scholar 

  11. Lease, M., Allan, J., Bruce Croft, W.: Regression Rank: Learning to Meet the Opportunity of Descriptive Queries. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 90–101. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  12. Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1, 309–317 (1957)

    Article  MathSciNet  Google Scholar 

  13. Lv, Y., Zhai, C.: Adaptive term frequency normalization for bm25. In: CIKM 2011, pp. 1985–1988 (2011)

    Google Scholar 

  14. Lv, Y., Zhai, C.: Lower-bounding term frequency normalization. In: CIKM 2011, pp. 7–16 (2011)

    Google Scholar 

  15. Lv, Y., Zhai, C.: When documents are very long, bm25 fails! In: SIGIR 2011, pp. 1103–1104 (2011)

    Google Scholar 

  16. Ponte, J.M., Bruce Croft, W.: A language modeling approach to information retrieval. In: SIGIR 1998, pp. 275–281 (1998)

    Google Scholar 

  17. Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR 1994, pp. 232–241 (1994)

    Google Scholar 

  18. Robertson, S., Zaragoza, H., Taylor, M.: Simple bm25 extension to multiple weighted fields. In: CIKM 2004, pp. 42–49 (2004)

    Google Scholar 

  19. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at trec-3. In: TREC 1994, pp. 109–126 (1994)

    Google Scholar 

  20. Singhal, A.: Modern information retrieval: a brief overview. In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 24 (2001)

    Google Scholar 

  21. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR 1996, pp. 21–29 (1996)

    Google Scholar 

  22. Svore, K.M., Burges, C.J.C.: A machine learning approach for improved bm25 retrieval. In: CIKM 2009, pp. 1811–1814 (2009)

    Google Scholar 

  23. Svore, K.M., Kanani, P.H., Khan, N.: How good is a span of terms?: exploiting proximity to improve web retrieval. In: SIGIR 2010, pp. 154–161 (2010)

    Google Scholar 

  24. Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR 2007, pp. 295–302 (2007)

    Google Scholar 

  25. Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C.: Optimisation methods for ranking functions with multiple parameters. In: CIKM 2006, pp. 585–593 (2006)

    Google Scholar 

  26. Tison, C., Nicolas, J.M., Tupin, F.: Accuracy of fisher distributions and log-moment estimation to describe amplitude distributions of high resolution sar images over urban areas. In: IGARSS 2003, pp. 1999–2001 (2003)

    Google Scholar 

  27. Xu, Z., Akella, R.: A new probabilistic retrieval model based on the dirichlet compound multinomial distribution. In: SIGIR 2008, pp. 427–434 (2008)

    Google Scholar 

  28. Zhai, C., Lafferty, J.D.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: SIGIR 2001, pp. 334–342 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lv, Y., Zhai, C. (2012). A Log-Logistic Model-Based Interpretation of TF Normalization of BM25. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28997-2_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28996-5

  • Online ISBN: 978-3-642-28997-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics