Indexing Compressed Text

Ferragina, Paolo; Venturini, Rossano

doi:10.1007/978-1-4614-8265-9_1144

Indexing Compressed Text

Paolo Ferragina³ &
Rossano Venturini³

Reference work entry
First Online: 01 January 2018

48 Accesses

Synonyms

Compressed and searchable data format; Compressed full-text indexing; Compressed suffix array; Compressed suffix tree

Definition

Given a text T[1,n], the Compressed Text Indexing problem requires to building an indexing data structure over T that takes space close to the empirical entropy of the input text and answers queries on the occurrences of an arbitrary pattern P[1, p] in T without any significant slowdown with respect to uncompressed indexes. There are three main queries: count(P), which returns the number of pattern occurrences in T; locate(P), which returns the starting positions of all pattern occurrences in T; and extract(i, j), which retrieves the substring T[i, j].

Historical Background

String processing and searching tasks are at the core of modern web search, information retrieval (IR), data base, and data mining applications. Most of text manipulations required by these applications involve, sooner or later, searchingthose (long) texts for (short) patterns...

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Recommended Reading

Ferragina P. String search in external memory: data structures and algorithms. In: Handbook of computational molecular biology. London: Chapman & Hall; 2005.
Google Scholar
Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005;52(4):552–81.
Article MathSciNet MATH Google Scholar
Grossi R, Vitter JS. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J Comput. 2005;35(2):378–407.
Article MathSciNet MATH Google Scholar
Navarro G, Mäkinen V. Compressed full-text indexes. ACM Comput Surv. 2007;39(1).
Article MATH Google Scholar
Ferragina P, Manzini G, Mäkinen V, Navarro G. Compressed representations of sequences and full-text indexes. ACM Trans Algorithm. 2007;3(2).
Article MathSciNet MATH Google Scholar
Grossi R, Gupta A, Vitter JS. High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms; 2003. p. 841–50.
Google Scholar
Arroyuelo D, Navarro G, Sadakane K. Stronger Lempel-Ziv based compressed text indexing. Algorithmica. 2012;62(1–2):54–101.
Article MathSciNet MATH Google Scholar
Belazzougui D. Linear time construction of compressed text indices in compact space. In: Proceedings of the 46th Annual ACM Symposium on Theory of Computing; 2014. p. 148–93.
Google Scholar
Burrows M, Wheeler D. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation; 1994.
Google Scholar
Belazzougui D, Navarro G. Alphabet-independent compressed text indexing. ACM Trans Algorithms. 2014;10(4):Article 23.
Google Scholar
Sadakane K. New text indexing functionalities of the compressed suffix arrays. J Algoritm. 2007;48(2):294–413.
Article MathSciNet MATH Google Scholar
Sadakane K. Compressed suffix trees with full functionality. Theory Comput Syst. 2007;41(4):589–607.
Article MathSciNet MATH Google Scholar
Ferragina P, Venturini R. Compressed cache-oblivious string B-tree. In: Proceedings of the 21st Annual European Symposium on Algorithms; 2013. p. 469–80.
Google Scholar
Ferragina P, Venturini R. The compressed permuterm index. ACM Trans Algorithms. 2010;7(1):10.
Article MathSciNet MATH Google Scholar
Sadakane K. Succinct data structures for flexible text retrieval systems. J Discrete Algorithms. 2007;5(1):12–22.
Article MathSciNet MATH Google Scholar
Ferragina P, Sirén J, Venturini R. Distribution-aware compressed full-text indexes. Algorithmica. 2013;67(4):529–46.
Article MathSciNet MATH Google Scholar
Ferragina P, Grossi R. The string B-tree: a new data structure for string search in external memory and its applications. J ACM. 1999;46(2):236–80.
Article MathSciNet MATH Google Scholar
Ferragina P, González R, Navarro G, Venturini R. Compressed text indexes: from theory to practice. J Exp Algorithmics. 2009;13:1.12–31.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Pisa, Pisa, Italy
Paolo Ferragina & Rossano Venturini

Authors

Paolo Ferragina
View author publications
You can also search for this author in PubMed Google Scholar
Rossano Venturini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Ferragina .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

Dept. of Computing Science, Univ. of Alberta, Edmonton, Alberta, Canada
Mario A. Nascimento

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Ferragina, P., Venturini, R. (2018). Indexing Compressed Text. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_1144

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_1144
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics