Skip to main content

Term Frequency Normalization via Pareto Distributions

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2291))

Included in the following conference series:

Abstract

We exploit the Feller-Pareto characterization of the classical Pareto distribution to derive a law relating the probability of a given term frequency in a document and its the length. A similar law was derived by Mandelbrot. We exploit the paretian distribution to obtain a term frequency normalization to substitute for the actual term frequency in the probabilistic models of Information Retrieval recently introduced in TREC-10. Preliminary results show that the unique parameter of the framework can be eliminated in favour of the the term frequency normalization derived by the Paretian law.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gianni Amati, Claudio Carpineto, and Giovanni Romano. FUB at TREC 10 web track: a probabilistic framework for topic relevance term weighting. In In Proceedings of the 10th Text Retrieval Conference (TREC-10), Gaithersburg, MD, 2001.

    Google Scholar 

  2. Gianni Amati and Cornelis Joost van Rijsbergen. Probabilistic models of information retrieval based on measuring divergence from randomness. Submitted to TOIS, 2001.

    Google Scholar 

  3. Barry C. Arnold. Pareto distributions. International Co-operative Publishing House, Fairland, Md., 1983.

    Google Scholar 

  4. C. Carpineto, R. De Mori, G. Romano, and B. Bigi. An information theoretic approach to automatic query expansion. ACM Transactions on Information Systems, 19(1):1–27, 2001.

    Article  Google Scholar 

  5. D.G. Champernowne. The theory of income distribution. Econometrica, 5:379–381, 1937.

    Google Scholar 

  6. Mark E. Crovella, Murad S. Taqqu, and Azer Bestavros. Heavy-tailed probability distributions in the world wide web. In R.J. Adler, R.E. Feldman, and M.S. Taqqu, editors, A practical guide to heavy tails. Birkhauser, Boston, Basel and Berlin, 1998.

    Google Scholar 

  7. J.B. Estoup. Gammes Stenographiques. 4th edition, Paris, 1916.

    Google Scholar 

  8. William Feller. An introduction to probability theory and its applications. Vol. I. John Wiley & Sons Inc., New York, third edition, 1968.

    MATH  Google Scholar 

  9. William Feller. An Introduction to Probability Theory and Its Applications, volume II. John Wiley & Sons, New York, second edition, 1971.

    MATH  Google Scholar 

  10. D Hawking. Overview of the trec-9 web track. In In Proceedings of the 9th Text Retrieval Conference (TREC-9), Gaithersburg, MD, 2001.

    Google Scholar 

  11. G. Herdan. Quantitative Linguistics. Butterworths, 1964.

    Google Scholar 

  12. Benoit Mandelbrot. On the theory of word frequencies and on related markovian models of discourse. In Proceedings of Symposia in Applied Mathematics. Vol. XII: Structure of language and its mathematical aspects, pages 190–219. American Mathematical Society, Providence, R.I., 1961. Roman Jakobson, editor.

    Google Scholar 

  13. H. S. Sichel. Parameter estimation for a word frequency distribution based on occupancy theory. Comm. Statist. A—Theory Methods, 15(3):935–949, 1986.

    Article  MATH  MathSciNet  Google Scholar 

  14. H. S. Sichel. Word frequency distributions and type-token characteristics. Math. Sci., 11(1):45–72, 1986.

    MATH  MathSciNet  Google Scholar 

  15. Herbert A. Simon. On a class of skew distribution functions. Biometrika, 42:425–440, 1955.

    MATH  MathSciNet  Google Scholar 

  16. J.C. Willis. Age and area. Cambridge University Press, London and New York, 1922.

    Google Scholar 

  17. G.K. Zipf. Human behavior and the principle of least effort. Addison-Wesley Press, Reading, Massachusetts, 1949.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Amati, G., van Rijsbergen, C.J. (2002). Term Frequency Normalization via Pareto Distributions. In: Crestani, F., Girolami, M., van Rijsbergen, C.J. (eds) Advances in Information Retrieval. ECIR 2002. Lecture Notes in Computer Science, vol 2291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45886-7_13

Download citation

  • DOI: https://doi.org/10.1007/3-540-45886-7_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43343-9

  • Online ISBN: 978-3-540-45886-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics