Skip to main content

Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10508))

Included in the following conference series:

Abstract

In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with \(O(\tilde{e}_T \log n)\) bits of space allowing for \(O(\log n)\)-time random and O(1)-time sequential accesses to edge labels, and \(O(m \log \sigma + occ)\)-time pattern matching. Here, \(\tilde{e}_T\) is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, \(\sigma \) is the alphabet size, and \( occ \) is the number of occurrences of the pattern in T. The repetitiveness measure \(\tilde{e}_T\) is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve \(O(m + occ )\) pattern matching time with \(O(e_T^r \log n)\) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of \(\log \log n\), with the same space complexity. Here, \(e_T^r\) is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size \(O(\tilde{e}_T)\) for a given text T in \(O(n + \tilde{e}_T \log \sigma )\) time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi:10.1007/978-3-319-19929-0_3

    Chapter  Google Scholar 

  2. Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  3. Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM (JACM) 34(3), 578–595 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  4. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundamenta Informaticae 111(3), 313–337 (2011)

    MathSciNet  MATH  Google Scholar 

  5. Crochemore, M., Epifanio, C., Grossi, R., Mignosi, F.: Linear-size suffix tries. Theor. Comput. Sci. 638, 171–178 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  6. Crochemore, M., Rytter, W.: Jewels of Stringology: Text Algorithms. World Scientific, Singapore (2003)

    MATH  Google Scholar 

  7. Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi:10.1007/3-540-63220-4_55

    Chapter  Google Scholar 

  8. Gasieniec, L., Kolpakov, R.M., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Data Compression Conference, p. 458 (2005)

    Google Scholar 

  9. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 841–850 (2003)

    Google Scholar 

  10. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  11. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)

    Article  MathSciNet  Google Scholar 

  12. Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. Algorithmica (2016)

    Google Scholar 

  13. Navarro, G.: A self-index on block trees. arXiv preprint arXiv:1606.06617 (2016)

  14. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. (CSUR) 39(1), 2 (2007)

    Article  MATH  Google Scholar 

  15. Raffinot, M.: On maximal repeats in strings. Inf. Process. Lett. 80(3), 165–169 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  16. Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_29

    Google Scholar 

  17. Weiner, P.: Linear pattern-matching algorithms. In: IEEE Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takuya Takagi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Takagi, T., Goto, K., Fujishige, Y., Inenaga, S., Arimura, H. (2017). Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression. In: Fici, G., Sciortino, M., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2017. Lecture Notes in Computer Science(), vol 10508. Springer, Cham. https://doi.org/10.1007/978-3-319-67428-5_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67428-5_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67427-8

  • Online ISBN: 978-3-319-67428-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics