Skip to main content

Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 10508)

Abstract

In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with \(O(\tilde{e}_T \log n)\) bits of space allowing for \(O(\log n)\)-time random and O(1)-time sequential accesses to edge labels, and \(O(m \log \sigma + occ)\)-time pattern matching. Here, \(\tilde{e}_T\) is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, \(\sigma \) is the alphabet size, and \( occ \) is the number of occurrences of the pattern in T. The repetitiveness measure \(\tilde{e}_T\) is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve \(O(m + occ )\) pattern matching time with \(O(e_T^r \log n)\) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of \(\log \log n\), with the same space complexity. Here, \(e_T^r\) is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size \(O(\tilde{e}_T)\) for a given text T in \(O(n + \tilde{e}_T \log \sigma )\) time.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi:10.1007/978-3-319-19929-0_3

    CrossRef  Google Scholar 

  2. Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015)

    CrossRef  MathSciNet  MATH  Google Scholar 

  3. Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM (JACM) 34(3), 578–595 (1987)

    CrossRef  MathSciNet  MATH  Google Scholar 

  4. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundamenta Informaticae 111(3), 313–337 (2011)

    MathSciNet  MATH  Google Scholar 

  5. Crochemore, M., Epifanio, C., Grossi, R., Mignosi, F.: Linear-size suffix tries. Theor. Comput. Sci. 638, 171–178 (2016)

    CrossRef  MathSciNet  MATH  Google Scholar 

  6. Crochemore, M., Rytter, W.: Jewels of Stringology: Text Algorithms. World Scientific, Singapore (2003)

    MATH  Google Scholar 

  7. Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi:10.1007/3-540-63220-4_55

    CrossRef  Google Scholar 

  8. Gasieniec, L., Kolpakov, R.M., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Data Compression Conference, p. 458 (2005)

    Google Scholar 

  9. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 841–850 (2003)

    Google Scholar 

  10. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)

    CrossRef  MathSciNet  MATH  Google Scholar 

  11. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)

    CrossRef  MathSciNet  Google Scholar 

  12. Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. Algorithmica (2016)

    Google Scholar 

  13. Navarro, G.: A self-index on block trees. arXiv preprint arXiv:1606.06617 (2016)

  14. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. (CSUR) 39(1), 2 (2007)

    CrossRef  MATH  Google Scholar 

  15. Raffinot, M.: On maximal repeats in strings. Inf. Process. Lett. 80(3), 165–169 (2001)

    CrossRef  MathSciNet  MATH  Google Scholar 

  16. Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_29

    Google Scholar 

  17. Weiner, P.: Linear pattern-matching algorithms. In: IEEE Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takuya Takagi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Takagi, T., Goto, K., Fujishige, Y., Inenaga, S., Arimura, H. (2017). Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression. In: Fici, G., Sciortino, M., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2017. Lecture Notes in Computer Science(), vol 10508. Springer, Cham. https://doi.org/10.1007/978-3-319-67428-5_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67428-5_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67427-8

  • Online ISBN: 978-3-319-67428-5

  • eBook Packages: Computer ScienceComputer Science (R0)