Skip to main content

g-binary: A New Non-parameterized Code for Improved Inverted File Compression

  • Conference paper
Database and Expert Systems Applications (DEXA 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2736))

Included in the following conference series:

  • 646 Accesses

Abstract

The inverted file is a popular and efficient method for indexing text databases and is being used widely in information retrieval applications. As a result, the research literature is rich in models (global and local) that describe and compress inverted file indexes. Global models compress the entire inverted file index using the same method and can be distinguished in parameterized and non-parameterized ones. The latter utilize fixed codes and are applicable to dynamic collections of documents. Local models are always parameterized in the sense that the method they use makes assumptions about the distribution of each and every word in the document collection of the text database. In the present study, we examine some of the most significant integer compression codes and propose g-binary, a new non-parameterized coding scheme that combines the Golomb codes and the binary representation of integers. The proposed new coding scheme does not introduce any extra computational overhead when compared to the existing non-parameterized codes. With regard to storage utilization efficiency, experimental runs conducted on a number of TREC text database collections reveal an improvement of about 6% over the existing non-parameterized codes. This is an improvement that can make a difference for very large text database collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blandford, D., Blelloch, G.: Index Compression through Document Reordering. In: Proceedings of the Data Compression Conference (2002)

    Google Scholar 

  2. Bookstein, A., Klein, S.T., Raita, T.: Model based concordance compression. In: Storer and Cohn., pp. 82–91 (1992)

    Google Scholar 

  3. Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT–21, 194–203 (1975)

    Article  MathSciNet  Google Scholar 

  4. Fox, E., Harman, D., Baeza-Yates, R., Lee, W.: Inverted Files. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, ch. 3, pp. 28–43. Prentice-Hall, Englewood Cliffs (1992)

    Google Scholar 

  5. Gallager, R.G., van Voorhis, D.C.: Optimal source codes for geometrically distributed alphabets. IEEE Transactions on Information Theory IT–21, 228–230 (1975)

    Article  Google Scholar 

  6. Golomb, S.W.: Run-length Encodings. IEEE Transactions on Information Theory IT–21, 399–401 (1966)

    Article  MathSciNet  Google Scholar 

  7. Huffman, D.A.: A method for the construction of minimum redundancy codes. Procedures IRE 40(9), 1098–1101 (1952)

    Article  Google Scholar 

  8. Moffat, A., Stuiver, L.: Exploiting clustering in inverted file compression. In: Storer and Cohn., 82–91 (1996)

    Google Scholar 

  9. Moffat, A., Zobel, J.: Paremeterised Compression for Sparse Bitmaps. In: 15th Ann Int’l SIGIR, Denmark, pp. 274–285 (1992)

    Google Scholar 

  10. Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of Inverted Indexes for Fast Query Evaluation. In: SIGIR, Finland, pp. 222–229 (2002)

    Google Scholar 

  11. Schuegraf, E.J.: Compression of large inverted files with hyperbolic term distribution. Information Processing and Managemant 12, 377–384 (1976)

    Article  MATH  Google Scholar 

  12. Teuhola, J.: A compression method for clustered bit-vectors. Information Processing Letters 7(2), 308–311 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  13. Williams, H.E., Zobel, J.: Compressing Integers for Fast File Access. The Computer Journal 42, 193–201 (1999)

    Article  Google Scholar 

  14. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. In: Compressing and Indexing Documents and Images. Academic Press, London (1999)

    Google Scholar 

  15. Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted Files Versus Signature Files for Text Indexing. ACM Transactions on Database Systems 23, 369–410 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nitsos, I., Evangelidis, G., Dervos, D. (2003). g-binary: A New Non-parameterized Code for Improved Inverted File Compression. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds) Database and Expert Systems Applications. DEXA 2003. Lecture Notes in Computer Science, vol 2736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45227-0_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45227-0_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40806-2

  • Online ISBN: 978-3-540-45227-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics