Advertisement

Springer Nature is making Coronavirus research free. View research | View latest news | Sign up for updates

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

Abstract

We introduce a compressed suffix array representation that, on a text T of length n over an alphabet of size \(\sigma \), can be built in O(n) deterministic time, within \(O(n\log \sigma )\) bits of working space, and counts the number of occurrences of any pattern P in T in time \(O(|P| + \log \log _w \sigma )\) on a RAM machine of \(w=\Omega (\log n)\)-bit words. This time is almost optimal for large alphabets (\(\log \sigma =\Theta (\log n)\)), and it outperforms all the other compressed indexes that can be built in linear deterministic time, as well as some others. The only faster indexes can be built in linear time only in expectation, or require \(\Theta (n\log n)\) bits. For smaller alphabets, where \(\log \sigma = o(\log n)\), we show how, by using space proportional to a compressed representation of the text, we can build in linear time an index that counts in time \(O(|P|/\log _\sigma n + \log _\sigma ^\epsilon n)\) for any constant \(\epsilon >0\). This is almost RAM-optimal in the typical case where \(w=\Theta (\log n)\).

This is a preview of subscription content, log in to check access.

Notes

  1. 1.

    In fact it is \(nH_k({{\overline{T}}})\), but this is \(nH_k(T)+O(\log n)\) [12, Thm. A.3].

  2. 2.

    In the rest of the paper we wrote \({{\overline{B}}}[l_u..r_u]\) instead of \({{\overline{B}}}[l_{{\overline{u}}}..r_{{\overline{u}}}]\) for simplicity, but this may cause confusion in this section.

  3. 3.

    A table of size \(O(\sqrt{n})\) tells us the first symbol where any two chunks of \((\log _\sigma n)/2\) symbols differ. This is used to find the length of the shared prefix between the first chunks where P and T differ.

References

  1. 1.

    Barbay, J., Claude, F., Gagie, T., Navarro, G., Nekrich, Y.: Efficient fully-compressed sequence representations. Algorithmica 69(1), 232–268 (2014)

  2. 2.

    Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Fast prefix search in little space, with applications. In: Proceedings of 18th Annual European Symposium on Algorithms (ESA), LNCS 6346, pp. 427–438 (2010)

  3. 3.

    Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows–Wheeler transform. In: Proceedings of 21st Annual European Symposium on Algorithms (ESA), pp. 133–144 (2013)

  4. 4.

    Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Linear-time string indexing and analysis in small space. CoRR, arXiv:1609.06378 (2016)

  5. 5.

    Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), Article 23 (2014)

  6. 6.

    Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11(4), Article 31 (2015)

  7. 7.

    Bille, P., Gørtz, I.L., Skjoldjensen, F.R.: Deterministic indexing for packed strings. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching (CPM), LIPIcs 78, Article 6 (2017)

  8. 8.

    Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)

  9. 9.

    Clark, D.R.: Compact PAT Trees. PhD thesis, University of Waterloo, Canada (1996)

  10. 10.

    Cole, R., Kopelowitz, T., Lewenstein, M.: Suffix trays and suffix trists: structures for faster text indexing. Algorithmica 72(2), 450–466 (2015)

  11. 11.

    Farach, M.: Optimal suffix tree construction with large alphabets. In: Proceedings of 38th Annual Symposium on Foundations of Computer Science (FOCS), pp. 137–143 (1997)

  12. 12.

    Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

  13. 13.

    Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2), Article 20 (2007)

  14. 14.

    Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 371(1), 115–121 (2007)

  15. 15.

    Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of 26th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 9133, pp. 160–171 (2015)

  16. 16.

    Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)

  17. 17.

    Gagie, T.: Large alphabets and incompressibility. Inf. Process. Lett. 99(6), 246–251 (2006)

  18. 18.

    Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proceedings of 17th Annual ACM-SIAM Symposium on Discrete Algorithms, (SODA), pp. 368–373 (2006)

  19. 19.

    Grossi, R., Orlandi, A., Raman, R., Rao, S.S.: More haste, less waste: Lowering the redundancy in fully indexable dictionaries. In: Proceedings of 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pp. 517–528 (2009)

  20. 20.

    Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

  21. 21.

    Hagerup, T., Miltersen, P.Bro, Pagh, R.: Deterministic dictionaries. J. Algorithms 41(1), 69–85 (2001)

  22. 22.

    Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)

  23. 23.

    Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. J. Discrete Algorithms 3(2–4), 126–142 (2005)

  24. 24.

    Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discrete Algorithms 3(2–4), 143–156 (2005)

  25. 25.

    Lee, S., Park, K.: Dynamic rank-select structures with applications to run-length encoded texts. In: Proceedings of 18th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 95–106 (2007)

  26. 26.

    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

  27. 27.

    Manzini, G.: An analysis of the Burrows–Wheeler transform. J. ACM 48(3), 407–430 (2001)

  28. 28.

    McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

  29. 29.

    Munro, I., Navarro, G., Nekrich, Y.: Fast compressed self-indexes with deterministic linear-time construction. In: Proceedings of 28th Annual International Symposium on Algorithms and Computation (ISAAC), LIPIcs 92, Article 57 (2017)

  30. 30.

    Munro, J.I.: Tables. In: Proceedings of 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), LNCS 1180, pp. 37–42 (1996)

  31. 31.

    Munro, J.I., Navarro, G., Nekrich, Y.: Space-efficient construction of compressed indexes in deterministic linear time. In: Proceedings of 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 408–424 (2017)

  32. 32.

    Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Succinct representations of permutations and functions. Theor. Comput. Sci. 438, 74–88 (2012)

  33. 33.

    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), Article 2 (2007)

  34. 34.

    Navarro, G., Nekrich, Y.: Time-optimal top-\(k\) document retrieval. SIAM J. Comput. 46(1), 89–113 (2017)

  35. 35.

    Navarro, G., Sadakane, K.: Fully-functional static and dynamic succinct trees. ACM Trans. Algorithms 10(3), Article 16 (2014)

  36. 36.

    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)

  37. 37.

    Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)

  38. 38.

    Sleator, D.D., Tarjan, R.E.: A data structure for dynamic trees. J. Comput. Syst. Sci. 26(3), 362–391 (1983)

  39. 39.

    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

  40. 40.

    Weiner, P.: Linear pattern matching algorithms. In: Proceedings of 14th Annual Symposium on Switching and Automata Theory (FOCS), pp. 1–11 (1973)

Download references

Author information

Correspondence to Gonzalo Navarro.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Funded with Basal Funds FB0001 and Fondecyt Grant 1-170048, Conicyt, Chile. A conference version of this paper appeared in Proc. ISAAC 2017 [29].

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Munro, J.I., Navarro, G. & Nekrich, Y. Fast Compressed Self-indexes with Deterministic Linear-Time Construction. Algorithmica 82, 316–337 (2020). https://doi.org/10.1007/s00453-019-00637-x

Download citation

Keywords

  • Succinct data structures
  • Self-indexes
  • Suffix arrays
  • Deterministic construction

Mathematics Subject Classification

  • E.1
  • E.4