Skip to main content

Text Indexing, Suffix Sorting, and Data Compression: Common Problems and Techniques

  • Conference paper
Combinatorial Pattern Matching (CPM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5577))

Included in the following conference series:

  • 630 Accesses

Abstract

The talk is a guided tour on text indexing data structures, suffix sorting, and data compression. We discuss how they share common problems on text suffixes, showing the interplay among some of the algorithmic techniques that have been devised so far. In the following, given a text T = T[1,n] of n symbols, we denote by s i its suffix s i  = T[i,n] for 1 ≤ i ≤ n.

A text indexing data structure stores the suffixes s 1, s 2, ..., s n of T at preprocessing time, in a suitable format that can support pattern matching queries over T. For example, given a pattern string P of m symbols, one type of query is that of computing how many times P appears in T, whose O(m + logn) time complexity in the comparison model compares favorably with the O(m + n) cost required by full text scanning [8]. Notable examples of text indexing data structures are suffix trees [10,14] and suffix arrays [9] for usage in main memory, string Btrees [4] and cache-oblivious string B-trees [1] for usage in external and hierarchical memory, to name a few.

Suffix sorting requires to arrange the suffixes s 1, s 2, ..., s n in lexicographic order. This is the major computational bottleneck in suffix-based algorithms, and can be solved in O(n logn) time in the comparison model (e.g. [7]). Having sorted the suffixes, it is not difficult to build a text indexing data structure in (nearly) linear time. Suffix sorting is crucial also in data compression, as witnessed by the importance of the Burrows-Wheeler transform [3]. The techniques adopted in the aforementioned topics converged in several ways into the rich fields of compressed text indexing [5,6,11,13] and succinct data structures [2,12], with some old and new open problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bender, M.A., Farach-Colton, M., Kuszmaul, B.C.: Cache-oblivious string B-trees. In: ACM (ed.) Proc. of the 25th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems, Chicago, IL, USA 2006, June 26–28, 2006, pp. 233–242. ACM Press, New York (2006)

    Google Scholar 

  2. Brodnik, A., Munro, J.I.: Membership in constant time and almost-minimum space. SIAM Journal on Computing 28(5), 1627–1640 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  3. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Research Report 124, Digital SRC, Palo Alto, CA, USA (May 1994)

    Google Scholar 

  4. Ferragina, P., Grossi, R.: The String B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM 46(2), 236–280 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  5. Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput 35(2) (2005)

    Google Scholar 

  7. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  8. Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6, 323–350 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  9. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  10. McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM 23(2), 262–272 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  11. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2:1–2:61 (2007)

    Article  MATH  Google Scholar 

  12. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3(4), 1–43 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  13. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  14. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Grossi, R. (2009). Text Indexing, Suffix Sorting, and Data Compression: Common Problems and Techniques. In: Kucherov, G., Ukkonen, E. (eds) Combinatorial Pattern Matching. CPM 2009. Lecture Notes in Computer Science, vol 5577. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02441-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02441-2_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02440-5

  • Online ISBN: 978-3-642-02441-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics