Compressed Suffix Arrays for Massive Data

Sirén, Jouni

doi:10.1007/978-3-642-03784-9_7

Jouni Sirén¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5721))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1268 Accesses
22 Citations

Abstract

We present a fast space-efficient algorithm for constructing compressed suffix arrays (CSA). The algorithm requires O(n logn) time in the worst case, and only O(n) bits of extra space in addition to the CSA. As the basic step, we describe an algorithm for merging two CSAs. We show that the construction algorithm can be parallelized in a symmetric multiprocessor system, and discuss the possibility of a distributed implementation. We also describe a parallel implementation of the algorithm, capable of indexing several gigabytes per hour.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal on Discrete Algorithms 2(1), 53–86 (2004)
Article MathSciNet MATH Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Google Scholar
Chan, H.-L., Hon, W.-K., Lam, T.-W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2), 21 (2007)
Article MathSciNet MATH Google Scholar
Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)
Article MathSciNet MATH Google Scholar
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. Journal of Experimental Algorithms 12, article no. 3.4 (2008)
Google Scholar
Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21(2), 194–203 (1975)
Article MathSciNet MATH Google Scholar
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. Journal of Experimental Algorithms 13, article no. 1.12 (2009)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Gerlach, W.: Dynamic FM-index for a collection of texts with application to space-efficient construction of the compressed suffix array. Master’s thesis, Bielefeld University (2007)
Google Scholar
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT trees and PAT arrays. In: Information retrieval: data structures and algorithms, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
González, R., Navarro, G.: Improved dynamic rank-select entropy-bound structures. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 374–386. Springer, Heidelberg (2008)
Chapter Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2005)
Article MathSciNet MATH Google Scholar
Hon, W.-K., Lam, T.-W., Sadakane, K., Sung, W.-K., Yiu, S.-M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48(1), 23–36 (2007)
Article MathSciNet MATH Google Scholar
Hon, W.-K., Lam, T.-W., Sung, W.-K., Tse, W.-L., Wong, C.-K., Yiu, S.-M.: Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In: ALENEX 2004, pp. 31–38. SIAM, Philadelphia (2004)
Google Scholar
Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. SIAM Journal on Computing 38(6), 2162–2178 (2009)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J.: Fast BWT in small space by blockwise suffix sorting. Theoretical Computer Science 387(3), 249–257 (2007)
Article MathSciNet MATH Google Scholar
Kulla, F., Sanders, P.: Scalable parallel suffix array construction. Parallel Computing 33(9), 605–612 (2007)
Article Google Scholar
Larsson, N.J., Sadakane, K.: Faster suffix sorting. Theoretical Computer Science 387(3), 258–272 (2007)
Article MathSciNet MATH Google Scholar
Lee, S., Park, K.: Dynamic rank-select structures with applications to run-length encoded texts. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 95–106. Springer, Heidelberg (2007)
Chapter Google Scholar
Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms 4(3), 32 (2008)
Article MathSciNet MATH Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)
Google Scholar
Na, J.C., Park, K.: Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space. Theoretical Computer Science 385(1-3), 127–136 (2007)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2 (2007)
Article MATH Google Scholar
Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2), 4 (2007)
Article Google Scholar
Salson, M., Lecroq, T., Léonard, M., Mouchard, L.: Dynamic extended suffix arrays. Accepted to Journal of Discrete Algorithms
Google Scholar
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Finland
Jouni Sirén

Authors

Jouni Sirén
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Swedish Institute of Computer Science, Kista, Sweden
Jussi Karlgren
Department of Computer Science and Engineering, Helsinki University of Technology, P.O. Box 5400, 02015 HUT, Espoo, Finland
Jorma Tarhio
Department of Computer Sciences, University of Tampere, Tampere, Finland
Heikki Hyyrö

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sirén, J. (2009). Compressed Suffix Arrays for Massive Data. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds) String Processing and Information Retrieval. SPIRE 2009. Lecture Notes in Computer Science, vol 5721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03784-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-03784-9_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03783-2
Online ISBN: 978-3-642-03784-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics