Text Indexing, Suffix Sorting, and Data Compression: Common Problems and Techniques

Grossi, Roberto

doi:10.1007/978-3-642-02441-2_4

Roberto Grossi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5577))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

630 Accesses

Abstract

The talk is a guided tour on text indexing data structures, suffix sorting, and data compression. We discuss how they share common problems on text suffixes, showing the interplay among some of the algorithmic techniques that have been devised so far. In the following, given a text T = T[1,n] of n symbols, we denote by s _i its suffix s _i = T[i,n] for 1 ≤ i ≤ n.

A text indexing data structure stores the suffixes s ₁, s ₂, ..., s _n of T at preprocessing time, in a suitable format that can support pattern matching queries over T. For example, given a pattern string P of m symbols, one type of query is that of computing how many times P appears in T, whose O(m + logn) time complexity in the comparison model compares favorably with the O(m + n) cost required by full text scanning [8]. Notable examples of text indexing data structures are suffix trees [10,14] and suffix arrays [9] for usage in main memory, string Btrees [4] and cache-oblivious string B-trees [1] for usage in external and hierarchical memory, to name a few.

Suffix sorting requires to arrange the suffixes s ₁, s ₂, ..., s _n in lexicographic order. This is the major computational bottleneck in suffix-based algorithms, and can be solved in O(n logn) time in the comparison model (e.g. [7]). Having sorted the suffixes, it is not difficult to build a text indexing data structure in (nearly) linear time. Suffix sorting is crucial also in data compression, as witnessed by the importance of the Burrows-Wheeler transform [3]. The techniques adopted in the aforementioned topics converged in several ways into the rich fields of compressed text indexing [5,6,11,13] and succinct data structures [2,12], with some old and new open problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bender, M.A., Farach-Colton, M., Kuszmaul, B.C.: Cache-oblivious string B-trees. In: ACM (ed.) Proc. of the 25th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems, Chicago, IL, USA 2006, June 26–28, 2006, pp. 233–242. ACM Press, New York (2006)
Google Scholar
Brodnik, A., Munro, J.I.: Membership in constant time and almost-minimum space. SIAM Journal on Computing 28(5), 1627–1640 (1999)
Article MathSciNet MATH Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Research Report 124, Digital SRC, Palo Alto, CA, USA (May 1994)
Google Scholar
Ferragina, P., Grossi, R.: The String B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM 46(2), 236–280 (1999)
Article MathSciNet MATH Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput 35(2) (2005)
Google Scholar
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)
Article MathSciNet MATH Google Scholar
Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6, 323–350 (1977)
Article MathSciNet MATH Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2:1–2:61 (2007)
Article MATH Google Scholar
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3(4), 1–43 (2007)
Article MathSciNet MATH Google Scholar
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)
Article MathSciNet MATH Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Università di Pisa, Italy
Roberto Grossi

Authors

Roberto Grossi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LIFL - Bâtiment M3 59655 Villeneuve d’Ascq Cédex,, France
Gregory Kucherov
Department of Computer Science, University of Helsinki,, Gustaf Hällströmin katu 2b, P.O. Box 68, FI-00014, Finland
Esko Ukkonen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grossi, R. (2009). Text Indexing, Suffix Sorting, and Data Compression: Common Problems and Techniques. In: Kucherov, G., Ukkonen, E. (eds) Combinatorial Pattern Matching. CPM 2009. Lecture Notes in Computer Science, vol 5577. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02441-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-02441-2_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02440-5
Online ISBN: 978-3-642-02441-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics