Skip to main content

Suffix Arrays on Words

  • Conference paper
Combinatorial Pattern Matching (CPM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4580))

Included in the following conference series:

Abstract

Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at word-boundaries of a text T[1,n], taking O(n) time and O(k) space in addition to T. We propose a class-note solution to this problem that achieves such optimal time and space bounds. Word-based versions of indexes achieving the same time/space bounds were already known for suffix trees [1,2] and (compact) DAWGs [3,4]. Our solution inherits the simplicity and efficiency of suffix arrays, with respect to such other word-indexes, and thus it foresees applications in word-based approaches to data compression [5] and computational linguistics [6]. To support this, we have run a large set of experiments showing that word-based suffix arrays may be constructed twice as fast as their full-text counterparts, and with a working space as low as 20%. The space reduction of the final word-based suffix array impacts also in their query time (i.e. less random access binary-search steps!), being faster by a factor of up to 3.

The first author has been partially supported by the Italian MIUR grant Italy-Israel FIRB “Pattern Discovery Algorithms in Discrete Structures, with Applications to Bioinformatics”, and by the Yahoo! Research grant on “Data compression and indexing in hierarchical memories”. The second autor has been partially funded by the German Research Foundation (DFG, Bioinformatics Initiative).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andersson, A., Larsson, N.J., Swanson, K.: Suffix Trees on Words. Algorithmica 23(3), 246–260 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  2. Inenaga, S., Takeda, M.: On-Line Linear-Time Construction of Word Suffix Trees. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 60–71. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  3. Inenaga, S., Takeda, M.: Sparse Directed Acyclic Word Graphs. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 61–73. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Inenaga, S., Takeda, M.: Sparse compact directed acyclic word graphs. In: Stringology, pp. 197–211 (2006)

    Google Scholar 

  5. Yugo, R., Isal, K., Moffat, A.: Word-based block-sorting text compression. In: Australasian Conference on Computer Science, pp. 92–99. IEEE Press, New York (2001)

    Google Scholar 

  6. Yamamoto, M., Church, K.W: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)

    Article  Google Scholar 

  7. Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  8. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys (to appear), Preliminary version available at http://www.dcc.uchile.cl/~gnavarro/ps/acmcs06.ps.gz

  9. Witten, I.H, Moffat, A., Bell, T.C: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  10. Zobel, J., Moffat, A., Ramamohanarao, K.: Guidelines for Presentation and Comparison of Indexing Techniques. SIGMOD Record 25(3), 10–15 (1996)

    Article  Google Scholar 

  11. Hon, W.K., Sadakane, K., Sung, W.K.: Breaking a Time-and-Space Barrier in Constructing Full-Text Indices. In: Proc. FOCS, pp. 251–260. IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  12. Ukkonen, E.: On-line Construction of Suffix Trees. Algorithmica 14(3), 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  13. Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.I.: The Smallest Automaton Recognizing the Subwords of a Text. Theor. Comput. Sci. 40, 31–55 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  14. Inenaga, S., Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S., Mauri, G., Pavesi, G.: On-line construction of compact directed acyclic word graphs. Discrete Applied Mathematics 146(2), 156–179 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  15. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear Work Suffix Array Construction. J. ACM 53(6), 1–19 (2006)

    Article  MathSciNet  Google Scholar 

  16. Inenaga, S.: personal communication (December 2006)

    Google Scholar 

  17. Kärkkäinen, J., Ukkonen, E.: Sparse Suffix Trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996)

    Google Scholar 

  18. Ferragina, P., Venturini, R.: A Simple Storage Scheme for Strings Achieving Entropy Bounds. Theoretical Computer Science 372(1), 115–121 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  19. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing Suffix Trees with Enhanced Suffix Arrays. J. Discrete Algorithms 2(1), 53–86 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  20. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)

    Google Scholar 

  21. Aluru, S.: Handbook of Computational Molecular Biology. Chapman & Hall/CRC, Sydney, Australia (2006)

    Google Scholar 

  22. Manber, U., Myers, E.W.: Suffix Arrays: A New Method for On-Line String Searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  23. Fischer, J., Heun, V.: A new succinct representation of RMQ-information and improvements in the enhanced suffix array. In: Proc. ESCAPE. LNCS (to appear, 2007)

    Google Scholar 

  24. Larsson, N.J., Sadakane, K.: Faster suffix sorting. Technical Report LU-CS-TR:99-214, LUNDFD6/(NFCS-3140)/1–20/(1999), Department of Computer Science, Lund University, Sweden (May 1999)

    Google Scholar 

  25. Alstrup, S., Gavoille, C., Kaplan, H., Rauhe, T.: Nearest Common Ancestors: A Survey and a New Distributed Algorithm. In: Proc. SPAA, pp. 258–264. ACM Press, New York (2002)

    Google Scholar 

  26. Ferragina, P., Navarro, G.: The Pizza & Chili Corpus. Available at http://pizzachili.di.unipi.it , http://pizzachili.dcc.uchile.cl

  27. Università degli Studi di Milano, Laboratory for Web Algorithmics: URLs from the.eu domain. Available at http://law.dsi.unimi.it/index.php

  28. Maniscalco, M.A., Puglisi, S.J.: An efficient, versatile approach to suffix sorting. ACM Journal of Experimental Algorithmics (to appear), Available at http://www.michael-maniscalco.com/msufsort.htm

  29. Manzini, G., Ferragina, P.: Engineering a lightweight suffix array construction algorithm. Algorithmica, 40(1), 33–50 (2004), Available at http://www.mfn.unipmn.it/~manzini/lightweight

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Bin Ma Kaizhong Zhang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferragina, P., Fischer, J. (2007). Suffix Arrays on Words. In: Ma, B., Zhang, K. (eds) Combinatorial Pattern Matching. CPM 2007. Lecture Notes in Computer Science, vol 4580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73437-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73437-6_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73436-9

  • Online ISBN: 978-3-540-73437-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics