Succinct Suffix Arrays Based on Run-Length Encoding

Mäkinen, Veli; Navarro, Gonzalo

doi:10.1007/11496656_5

Veli Mäkinen¹⁹ &
Gonzalo Navarro²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3537))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

1050 Accesses
24 Citations

Abstract

A succinct full-text self-index is a data structure built on a text T=t ₁ t ₂... t _n, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P=p ₁ p ₂... p _m in T, and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. They usually take O(nH ₀) or O(nH _k) bits, being H _k the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(mlog n).

We present a new self-index, called run-length FM-index (RLFM index), that counts the occurrences of P in T in O(m) time when the alphabet size is \(\sigma=O(\textrm{polylog}(n))\). The index requires nH _klog₂ σ + O(n) bits of space for small k. We then show how to implement the RLFM index in practice, and obtain in passing another implementation with different space-time tradeoffs. We empirically compare ours against the best existing implementations of other indexes and show that ours are fastest among indexes taking less space than the text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Google Scholar
Clark, D.: Compact Pat Trees. PhD thesis, University of Waterloo (1996)
Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. FOCS 2000, pp. 390–398 (2000)
Google Scholar
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: Proc. SODA 2001, pp. 269–278 (2001)
Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: An alphabet-friendly FM-index. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 150–160. Springer, Heidelberg (2004)
Chapter Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Succinct representation of sequences. Technical Report TR/DCC-2004-5, Dept. of CS, Univ. Chile (August 2004)
Google Scholar
González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. To appear in Proc. WEA, poster (2005)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. SODA 2003, pp. 841–850 (2003)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.: When indexing equals compression: Experiments with compressing suffix arrays and applications. In: Proc. SODA 2004, pp. 636–645 (2004)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. STOC 2000, pp. 397–406 (2000)
Google Scholar
Horspool, R.N.: Practical fast searching in strings. Softw. Pract. Exp. 10(6), 501–506 (1980)
Article Google Scholar
Jacobson, G.: Space-efficient static trees and graphs. In: Proc. FOCS 1989, pp. 549–554 (1989)
Google Scholar
Mäkinen, V.: Compact suffix array — a space-efficient full-text index. Fundamenta Informaticae 56(1–2), 191–210 (2003)
MATH MathSciNet Google Scholar
Mäkinen, V., Navarro, G.: Compressed compact suffix arrays. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 420–433. Springer, Heidelberg (2004)
Chapter Google Scholar
Mäkinen, V., Navarro, G.: Run-length FM-index. In: Proc. DIMACS Workshop: The Burrows-Wheeler Transform: Ten Years Later, August 2004, pp. 17–19 (2004); Also in New Search Algorithms and Time/Space Tradeoffs for Succinct Suffix Arrays, Tech. Report. C-2004-20, Univ. Helsinki (April 2004)
Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)
Google Scholar
Navarro, G.: Indexing text using the Ziv-Lempel trie. Journal of Discrete Algorithms 2(1), 87–114 (2004)
Article MATH MathSciNet Google Scholar
Pagh, R.: Low redundancy in dictionaries with O(1) worst case lookup time. In: Wiedermann, J., Van Emde Boas, P., Nielsen, M. (eds.) ICALP 1999. LNCS, vol. 1644, pp. 595–604. Springer, Heidelberg (1999)
Chapter Google Scholar
Raman, R., Raman, V., Srinivasa Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. SODA 2002, pp. 233–242 (2002)
Google Scholar
Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (2000)
Chapter Google Scholar
Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proc. SODA 2002, pp. 225–232 (2002)
Google Scholar
Weiner, P.: Linear pattern matching algorithm. In: Proc. IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar

Download references

Author information

Authors and Affiliations

AG Genominformatik, Technische Fakultät Universität Bielefeld, Germany
Veli Mäkinen
Center for Web Research Dept. of Computer Science, University of Chile,
Gonzalo Navarro

Authors

Veli Mäkinen
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Université Paris-Est, France
Maxime Crochemore
School of Computer Science and Engineering, Seoul National University, 151-742, Seoul, Korea
Kunsoo Park

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mäkinen, V., Navarro, G. (2005). Succinct Suffix Arrays Based on Run-Length Encoding. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_5

Download citation

DOI: https://doi.org/10.1007/11496656_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26201-5
Online ISBN: 978-3-540-31562-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics