Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Indexing Compressed Text

  • Paolo Ferragina
  • Rossano Venturini
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_1144

Synonyms

Compressed and searchable data format; Compressed full-text indexing; Compressed suffix array; Compressed suffix tree

Definition

Given a text T[1,n], the Compressed Text Indexing problem requires to building an indexing data structure over T that takes space close to the empirical entropy of the input text and answers queries on the occurrences of an arbitrary pattern P[1, p] in T without any significant slowdown with respect to uncompressed indexes. There are three main queries: count(P), which returns the number of pattern occurrences in T; locate(P), which returns the starting positions of all pattern occurrences in T; and extract(i, j), which retrieves the substring T[i, j].

Historical Background

String processing and searching tasks are at the core of modern web search, information retrieval (IR), data base, and data mining applications. Most of text manipulations required by these applications involve, sooner or later, searchingthose (long) texts for (short) patterns...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Ferragina P. String search in external memory: data structures and algorithms. In: Handbook of computational molecular biology. London: Chapman & Hall; 2005.Google Scholar
  2. 2.
    Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005;52(4):552–81.MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Grossi R, Vitter JS. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J Comput. 2005;35(2):378–407.MathSciNetzbMATHCrossRefGoogle Scholar
  4. 4.
    Navarro G, Mäkinen V. Compressed full-text indexes. ACM Comput Surv. 2007;39(1).zbMATHCrossRefGoogle Scholar
  5. 5.
    Ferragina P, Manzini G, Mäkinen V, Navarro G. Compressed representations of sequences and full-text indexes. ACM Trans Algorithm. 2007;3(2).MathSciNetzbMATHCrossRefGoogle Scholar
  6. 6.
    Grossi R, Gupta A, Vitter JS. High-order entropy-compressed text indexes. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms; 2003. p. 841–50.Google Scholar
  7. 7.
    Arroyuelo D, Navarro G, Sadakane K. Stronger Lempel-Ziv based compressed text indexing. Algorithmica. 2012;62(1–2):54–101.MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Belazzougui D. Linear time construction of compressed text indices in compact space. In: Proceedings of the 46th Annual ACM Symposium on Theory of Computing; 2014. p. 148–93.Google Scholar
  9. 9.
    Burrows M, Wheeler D. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation; 1994.Google Scholar
  10. 10.
    Belazzougui D, Navarro G. Alphabet-independent compressed text indexing. ACM Trans Algorithms. 2014;10(4):Article 23.Google Scholar
  11. 11.
    Sadakane K. New text indexing functionalities of the compressed suffix arrays. J Algoritm. 2007;48(2):294–413.MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Sadakane K. Compressed suffix trees with full functionality. Theory Comput Syst. 2007;41(4):589–607.MathSciNetzbMATHCrossRefGoogle Scholar
  13. 13.
    Ferragina P, Venturini R. Compressed cache-oblivious string B-tree. In: Proceedings of the 21st Annual European Symposium on Algorithms; 2013. p. 469–80.Google Scholar
  14. 14.
    Ferragina P, Venturini R. The compressed permuterm index. ACM Trans Algorithms. 2010;7(1):10.MathSciNetzbMATHCrossRefGoogle Scholar
  15. 15.
    Sadakane K. Succinct data structures for flexible text retrieval systems. J Discrete Algorithms. 2007;5(1):12–22.MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Ferragina P, Sirén J, Venturini R. Distribution-aware compressed full-text indexes. Algorithmica. 2013;67(4):529–46.MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Ferragina P, Grossi R. The string B-tree: a new data structure for string search in external memory and its applications. J ACM. 1999;46(2):236–80.MathSciNetzbMATHCrossRefGoogle Scholar
  18. 18.
    Ferragina P, González R, Navarro G, Venturini R. Compressed text indexes: from theory to practice. J Exp Algorithmics. 2009;13:1.12–31.MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of PisaPisaItaly

Section editors and affiliations

  • Mario A. Nascimento
    • 1
  1. 1.Dept. of Computing ScienceUniv. of AlbertaEdmontonCanada