g-binary: A New Non-parameterized Code for Improved Inverted File Compression

Nitsos, Ilias; Evangelidis, Georgios; Dervos, Dimitrios

doi:10.1007/978-3-540-45227-0_46

Ilias Nitsos⁷,
Georgios Evangelidis⁷ &
Dimitrios Dervos⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2736))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

646 Accesses

Abstract

The inverted file is a popular and efficient method for indexing text databases and is being used widely in information retrieval applications. As a result, the research literature is rich in models (global and local) that describe and compress inverted file indexes. Global models compress the entire inverted file index using the same method and can be distinguished in parameterized and non-parameterized ones. The latter utilize fixed codes and are applicable to dynamic collections of documents. Local models are always parameterized in the sense that the method they use makes assumptions about the distribution of each and every word in the document collection of the text database. In the present study, we examine some of the most significant integer compression codes and propose g-binary, a new non-parameterized coding scheme that combines the Golomb codes and the binary representation of integers. The proposed new coding scheme does not introduce any extra computational overhead when compared to the existing non-parameterized codes. With regard to storage utilization efficiency, experimental runs conducted on a number of TREC text database collections reveal an improvement of about 6% over the existing non-parameterized codes. This is an improvement that can make a difference for very large text database collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blandford, D., Blelloch, G.: Index Compression through Document Reordering. In: Proceedings of the Data Compression Conference (2002)
Google Scholar
Bookstein, A., Klein, S.T., Raita, T.: Model based concordance compression. In: Storer and Cohn., pp. 82–91 (1992)
Google Scholar
Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT–21, 194–203 (1975)
Article MathSciNet Google Scholar
Fox, E., Harman, D., Baeza-Yates, R., Lee, W.: Inverted Files. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, ch. 3, pp. 28–43. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
Gallager, R.G., van Voorhis, D.C.: Optimal source codes for geometrically distributed alphabets. IEEE Transactions on Information Theory IT–21, 228–230 (1975)
Article Google Scholar
Golomb, S.W.: Run-length Encodings. IEEE Transactions on Information Theory IT–21, 399–401 (1966)
Article MathSciNet Google Scholar
Huffman, D.A.: A method for the construction of minimum redundancy codes. Procedures IRE 40(9), 1098–1101 (1952)
Article Google Scholar
Moffat, A., Stuiver, L.: Exploiting clustering in inverted file compression. In: Storer and Cohn., 82–91 (1996)
Google Scholar
Moffat, A., Zobel, J.: Paremeterised Compression for Sparse Bitmaps. In: 15th Ann Int’l SIGIR, Denmark, pp. 274–285 (1992)
Google Scholar
Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of Inverted Indexes for Fast Query Evaluation. In: SIGIR, Finland, pp. 222–229 (2002)
Google Scholar
Schuegraf, E.J.: Compression of large inverted files with hyperbolic term distribution. Information Processing and Managemant 12, 377–384 (1976)
Article MATH Google Scholar
Teuhola, J.: A compression method for clustered bit-vectors. Information Processing Letters 7(2), 308–311 (1978)
Article MATH MathSciNet Google Scholar
Williams, H.E., Zobel, J.: Compressing Integers for Fast File Access. The Computer Journal 42, 193–201 (1999)
Article Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. In: Compressing and Indexing Documents and Images. Academic Press, London (1999)
Google Scholar
Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted Files Versus Signature Files for Text Indexing. ACM Transactions on Database Systems 23, 369–410 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Informatics, University of Macedonia, 156 Egnatia Str., 54006, Thessaloniki, Greece
Ilias Nitsos & Georgios Evangelidis
Department of Information Technology, TEI, P.O. Box 14561, 54101, Thessaloniki, Greece
Dimitrios Dervos

Authors

Ilias Nitsos
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Evangelidis
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios Dervos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Gerstner Laboratory, Czech Technical University in Prague, Technická 2, 166 27, Prague 6, Czech Republic
Vladimír Mařík
Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria
Werner Retschitzegger
Faculty of Electrical Engineering, The Gerstner Laboratory, Czech Technical University in Prague, Technická 2, 166 27, Prague 6, Czech Republic
Olga Štěpánková

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nitsos, I., Evangelidis, G., Dervos, D. (2003). g-binary: A New Non-parameterized Code for Improved Inverted File Compression. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds) Database and Expert Systems Applications. DEXA 2003. Lecture Notes in Computer Science, vol 2736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45227-0_46

Download citation

DOI: https://doi.org/10.1007/978-3-540-45227-0_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40806-2
Online ISBN: 978-3-540-45227-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics