Skip to main content

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

  • Conference paper
String Processing and Information Retrieval (SPIRE 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5280))

Included in the following conference series:

Abstract

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in space-efficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided time-efficiently. We show that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use run-length encoding and give empirical evidence that these structures are superior to the current structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 318–329. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report Technical Report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  3. Church, G.M.: Genomes for all. Scientific American 294(1), 47–54 (2006)

    Article  Google Scholar 

  4. Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 560–571. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Ferragina, P., Manzini, G.: Indexing compressed texts. J. of the ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM TALG 3(2) article 20 (2007)

    Google Scholar 

  7. Fischer, J., Mäkinen, V., Navarro, G.: An(other) entropy-bounded compressed suffix tree. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 152–165. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  8. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 841–850 (2003)

    Google Scholar 

  9. Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. on Computing 35(2), 378–407 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  10. Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed data structures: Dictionaries and data-aware measures. In: Proc. 16th DCC, pp. 213–222 (2006)

    Google Scholar 

  11. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  12. Hall, N.: Advanced sequencing technologies and their wider impact in microbiology. The Journal of Experimental Biology 209, 1518–1525 (2007)

    Article  Google Scholar 

  13. Kärkkäinen, J.: Repetition-based text indexes. Technical Report A-1999-4, Department of Computer Science, University of Helsinki, Finland (1999)

    Google Scholar 

  14. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)

    MathSciNet  MATH  Google Scholar 

  15. Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  16. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Run-length compressed indexes for repetitive sequence collections. Technical Report C-2008-42, Department of Computer Science, University of Helsinki, Finland (2008)

    Google Scholar 

  17. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. on Computing 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  18. Manzini, G.: An analysis of the Burrows-Wheeler transform. J. of the ACM 48(3), 407–430 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  19. Navarro, G.: Indexing text using the ziv-lempel trie. J. of Discrete Algorithms (JDA) 2(1), 87–114 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  20. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)

    Google Scholar 

  21. Pennisi, E.: Breakthrough of the year: Human genetic variation. Science 21, 1842–1843 (2007)

    Article  Google Scholar 

  22. Russo, L., Navarro, G., Oliveira, A.: Dynamic fully-compressed suffix trees. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 191–203. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  23. Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 362–373. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  24. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. of Algorithms 48(2), 294–313 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  25. Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G. (2008). Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections. In: Amir, A., Turpin, A., Moffat, A. (eds) String Processing and Information Retrieval. SPIRE 2008. Lecture Notes in Computer Science, vol 5280. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89097-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-89097-3_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-89096-6

  • Online ISBN: 978-3-540-89097-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics