Access Time Tradeoffs in Archive Compression

Petri, Matthias; Moffat, Alistair; Nagesh, P. C.; Wirth, Anthony

doi:10.1007/978-3-319-28940-3_2

Matthias Petri¹⁹,
Alistair Moffat¹⁹,
P. C. Nagesh¹⁹ &
…
Anthony Wirth¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9460))

Included in the following conference series:

AIRS

801 Accesses
4 Citations
2 Altmetric

Abstract

Web archives, query and proxy logs, and so on, can all be very large and highly repetitive; and are accessed only sporadically and partially, rather than continually and holistically. This type of data is ideal for compression-based archiving, provided that random-access to small fragments of the original data can be achieved without needing to decompress everything. The recent RLZ (relative Lempel Ziv) compression approach uses a semi-static model extracted from the text to be compressed, together with a greedy factorization of the whole text encoded using static integer codes. Here we demonstrate more precisely than before the scenarios in which RLZ excels. We contrast RLZ with alternatives based on block-based adaptive methods, including approaches that “prime” the encoding for each block, and measure a range of implementation options using both hard-disk (HDD) and solid-state disk (SSD) drives. For HDD, the dominant factor affecting access speed is the compression rate achieved, even when this involves larger dictionaries and larger blocks. When the data is on SSD the same effects are present, but not as markedly, and more complex trade-offs apply.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/Cyan4973/lz4, accessed 27 July 2015.
2.
http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm, 27 July 2015.
3.
Our concerns in this regard have been communicated to the authors of [5].

References

Bergman, A., Zohar, E.: Compressing Yahoo mail. In: Proceedings of the DCC, pp. 223–232 (2015)
Google Scholar
Ferrada, H., Gagie, T., Gog, S., Puglisi, S.J.: Relative Lempel-Ziv with constant-time random access. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 13–17. Springer, Heidelberg (2014)
Google Scholar
Fiala, E.R., Greene, D.H.: Data compression with finite windows. Commun. ACM 32(4), 490–505 (1989)
Article Google Scholar
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) sea 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)
Google Scholar
Hoobin, C., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. PVLDB 5(3), 265–273 (2011)
Google Scholar
Moffat, A., Turpin, A.: Compression and Coding Algorithms. Kluwer, Boston (2002)
Book Google Scholar
Tong, J., Wirth, A., Zobel, J.: Principled dictionary pruning for low-memory corpus compression. In: Proceedings of the SIGIR, pp. 283–292 (2014)
Google Scholar
Webber, W., Moffat, A.: In search of reliable retrieval experiments. In: Proceedings of the 10th Australasian Document Computing Symposium, pp. 26–33 (2005)
Google Scholar
Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)
Article Google Scholar

Download references

Acknowledgments

This work was supported under the Australian Research Council’s Discovery Projects scheme (project DP140103256).We have had access to the code of Hoobin et al. while working on this project, and we thank them for making it available.

Author information

Authors and Affiliations

Department of Computing and Information Systems, The University of Melbourne, Victoria, 3010, Australia
Matthias Petri, Alistair Moffat, P. C. Nagesh & Anthony Wirth

Authors

Matthias Petri
View author publications
You can also search for this author in PubMed Google Scholar
Alistair Moffat
View author publications
You can also search for this author in PubMed Google Scholar
P. C. Nagesh
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Wirth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alistair Moffat .

Editor information

Editors and Affiliations

Science and Engineering Faculty, Queensland University of Technology, Brisbane, Australia
Guido Zuccon
Brisbane, Queensland, Australia
Shlomo Geva
University of Tsukuba, Ibaraki, Japan
Hideo Joho
RMIT University, Melbourne, Australia
Falk Scholer
School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Aixin Sun
Tianjin University, China
Peng Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Petri, M., Moffat, A., Nagesh, P.C., Wirth, A. (2015). Access Time Tradeoffs in Archive Compression. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-28940-3_2
Published: 22 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28939-7
Online ISBN: 978-3-319-28940-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics