Advertisement

Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression

  • Takuya Takagi
  • Keisuke Goto
  • Yuta Fujishige
  • Shunsuke Inenaga
  • Hiroki Arimura
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10508)

Abstract

In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with \(O(\tilde{e}_T \log n)\) bits of space allowing for \(O(\log n)\)-time random and O(1)-time sequential accesses to edge labels, and \(O(m \log \sigma + occ)\)-time pattern matching. Here, \(\tilde{e}_T\) is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, \(\sigma \) is the alphabet size, and \( occ \) is the number of occurrences of the pattern in T. The repetitiveness measure \(\tilde{e}_T\) is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve \(O(m + occ )\) pattern matching time with \(O(e_T^r \log n)\) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of \(\log \log n\), with the same space complexity. Here, \(e_T^r\) is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size \(O(\tilde{e}_T)\) for a given text T in \(O(n + \tilde{e}_T \log \sigma )\) time.

References

  1. 1.
    Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi: 10.1007/978-3-319-19929-0_3 CrossRefGoogle Scholar
  2. 2.
    Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM (JACM) 34(3), 578–595 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundamenta Informaticae 111(3), 313–337 (2011)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Crochemore, M., Epifanio, C., Grossi, R., Mignosi, F.: Linear-size suffix tries. Theor. Comput. Sci. 638, 171–178 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Crochemore, M., Rytter, W.: Jewels of Stringology: Text Algorithms. World Scientific, Singapore (2003)zbMATHGoogle Scholar
  7. 7.
    Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi: 10.1007/3-540-63220-4_55 CrossRefGoogle Scholar
  8. 8.
    Gasieniec, L., Kolpakov, R.M., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Data Compression Conference, p. 458 (2005)Google Scholar
  9. 9.
    Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 841–850 (2003)Google Scholar
  10. 10.
    Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. Algorithmica (2016)Google Scholar
  13. 13.
    Navarro, G.: A self-index on block trees. arXiv preprint arXiv:1606.06617 (2016)
  14. 14.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. (CSUR) 39(1), 2 (2007)CrossRefzbMATHGoogle Scholar
  15. 15.
    Raffinot, M.: On maximal repeats in strings. Inf. Process. Lett. 80(3), 165–169 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). doi: 10.1007/978-3-319-07959-2_29 Google Scholar
  17. 17.
    Weiner, P.: Linear pattern-matching algorithms. In: IEEE Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Takuya Takagi
    • 1
  • Keisuke Goto
    • 2
  • Yuta Fujishige
    • 3
  • Shunsuke Inenaga
    • 3
  • Hiroki Arimura
    • 1
  1. 1.Graduate School of ISTHokkaido UniversitySapporoJapan
  2. 2.Fujitsu Laboratories Ltd.KawasakiJapan
  3. 3.Department of InformaticsKyushu UniversityFukuokaJapan

Personalised recommendations