Skip to main content
Log in

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We introduce a compressed suffix array representation that, on a text T of length n over an alphabet of size \(\sigma \), can be built in O(n) deterministic time, within \(O(n\log \sigma )\) bits of working space, and counts the number of occurrences of any pattern P in T in time \(O(|P| + \log \log _w \sigma )\) on a RAM machine of \(w=\Omega (\log n)\)-bit words. This time is almost optimal for large alphabets (\(\log \sigma =\Theta (\log n)\)), and it outperforms all the other compressed indexes that can be built in linear deterministic time, as well as some others. The only faster indexes can be built in linear time only in expectation, or require \(\Theta (n\log n)\) bits. For smaller alphabets, where \(\log \sigma = o(\log n)\), we show how, by using space proportional to a compressed representation of the text, we can build in linear time an index that counts in time \(O(|P|/\log _\sigma n + \log _\sigma ^\epsilon n)\) for any constant \(\epsilon >0\). This is almost RAM-optimal in the typical case where \(w=\Theta (\log n)\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. In fact it is \(nH_k({{\overline{T}}})\), but this is \(nH_k(T)+O(\log n)\) [12, Thm. A.3].

  2. In the rest of the paper we wrote \({{\overline{B}}}[l_u..r_u]\) instead of \({{\overline{B}}}[l_{{\overline{u}}}..r_{{\overline{u}}}]\) for simplicity, but this may cause confusion in this section.

  3. A table of size \(O(\sqrt{n})\) tells us the first symbol where any two chunks of \((\log _\sigma n)/2\) symbols differ. This is used to find the length of the shared prefix between the first chunks where P and T differ.

References

  1. Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69(1), 232–268 (2014)

    Article  MathSciNet  Google Scholar 

  2. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with applications. In: Proceedings of 18th Annual European Symposium on Algorithms (ESA), LNCS 6346, pp. 427–438 (2010)

    Chapter  Google Scholar 

  3. Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows–Wheeler transform. In: Proceedings of 21st Annual European Symposium on Algorithms (ESA), pp. 133–144 (2013)

    Google Scholar 

  4. Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Linear-time string indexing and analysis in small space. CoRR, arXiv:1609.06378 (2016)

  5. Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), Article 23 (2014)

  6. Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11(4), Article 31 (2015)

  7. Bille, P., Gørtz, I.L., Skjoldjensen, F.R.: Deterministic indexing for packed strings. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM), LIPIcs 78, Article 6 (2017)

  8. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

  9. Clark, D.R.: Compact PAT Trees. PhD thesis, University of Waterloo, Canada (1996)

  10. Cole, R., Kopelowitz, T., Lewenstein, M.: Suffix trays and suffix trists: structures for faster text indexing. Algorithmica 72(2), 450–466 (2015)

    Article  MathSciNet  Google Scholar 

  11. Farach, M.: Optimal suffix tree construction with large alphabets. In: Proceedings of 38th Annual Symposium on Foundations of Computer Science (FOCS), pp. 137–143 (1997)

  12. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  13. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Article 20 (2007)

  14. Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 371(1), 115–121 (2007)

    Article  MathSciNet  Google Scholar 

  15. Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of 26th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 9133, pp. 160–171 (2015)

    Chapter  Google Scholar 

  16. Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)

    Article  MathSciNet  Google Scholar 

  17. Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)

    Article  MathSciNet  Google Scholar 

  18. Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proceedings of 17th Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA), pp. 368–373 (2006)

  19. Grossi, R., Orlandi, A., Raman, R., Rao, S.S.: More haste, less waste: Lowering the redundancy in fully indexable dictionaries. In: Proceedings of 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pp. 517–528 (2009)

  20. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MathSciNet  Google Scholar 

  21. Hagerup, T., Miltersen, P.Bro, Pagh, R.: Deterministic dictionaries. J. Algorithms 41(1), 69–85 (2001)

    Article  MathSciNet  Google Scholar 

  22. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)

    Article  MathSciNet  Google Scholar 

  23. Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. J. Discrete Algorithms 3(2–4), 126–142 (2005)

    Article  MathSciNet  Google Scholar 

  24. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discrete Algorithms 3(2–4), 143–156 (2005)

    Article  MathSciNet  Google Scholar 

  25. Lee, S., Park, K.: Dynamic rank-select structures with applications to run-length encoded texts. In: Proceedings of 18th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 95–106 (2007)

  26. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  27. Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  28. McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    Article  MathSciNet  Google Scholar 

  29. Munro, I., Navarro, G., Nekrich, Y.: Fast compressed self-indexes with deterministic linear-time construction. In: Proceedings of 28th Annual International Symposium on Algorithms and Computation (ISAAC), LIPIcs 92, Article 57 (2017)

  30. Munro, J.I.: Tables. In: Proceedings of 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS 1180, pp. 37–42 (1996)

    Google Scholar 

  31. Munro, J.I., Navarro, G., Nekrich, Y.: Space-efficient construction of compressed indexes in deterministic linear time. In: Proceedings of 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 408–424 (2017)

  32. Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations and functions. Theor. Comput. Sci. 438, 74–88 (2012)

    Article  MathSciNet  Google Scholar 

  33. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), Article 2 (2007)

    Article  Google Scholar 

  34. Navarro, G., Nekrich, Y.: Time-optimal top-\(k\) document retrieval. SIAM J. Comput. 46(1), 89–113 (2017)

    Article  MathSciNet  Google Scholar 

  35. Navarro, G., Sadakane, K.: Fully-functional static and dynamic succinct trees. ACM Trans. Algorithms 10(3), Article 16 (2014)

    Article  MathSciNet  Google Scholar 

  36. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)

    Article  MathSciNet  Google Scholar 

  37. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)

    Article  MathSciNet  Google Scholar 

  38. Sleator, D.D., Tarjan, R.E.: A data structure for dynamic trees. J. Comput. Syst. Sci. 26(3), 362–391 (1983)

    Article  MathSciNet  Google Scholar 

  39. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  Google Scholar 

  40. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of 14th Annual Symposium on Switching and Automata Theory (FOCS), pp. 1–11 (1973)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gonzalo Navarro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Funded with Basal Funds FB0001 and Fondecyt Grant 1-170048, Conicyt, Chile. A conference version of this paper appeared in Proc. ISAAC 2017 [29].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Munro, J.I., Navarro, G. & Nekrich, Y. Fast Compressed Self-indexes with Deterministic Linear-Time Construction. Algorithmica 82, 316–337 (2020). https://doi.org/10.1007/s00453-019-00637-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-019-00637-x

Keywords

Mathematics Subject Classification

Navigation