Abstract
In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs (L-CDAWGs), which can be represented with \(O(\tilde{e}_T \log n)\) bits of space allowing for \(O(\log n)\)-time random and O(1)-time sequential accesses to edge labels, and \(O(m \log \sigma + occ)\)-time pattern matching. Here, \(\tilde{e}_T\) is the number of all extensions of maximal repeats in T, n and m are respectively the lengths of the text T and a given pattern, \(\sigma \) is the alphabet size, and \( occ \) is the number of occurrences of the pattern in T. The repetitiveness measure \(\tilde{e}_T\) is known to be much smaller than the text length n for highly repetitive text. For constant alphabets, our L-CDAWGs achieve \(O(m + occ )\) pattern matching time with \(O(e_T^r \log n)\) bits of space, which improves the pattern matching time of Belazzougui et al.’s run-length BWT-CDAWGs by a factor of \(\log \log n\), with the same space complexity. Here, \(e_T^r\) is the number of right extensions of maximal repeats in T. As a byproduct, our result gives a way of constructing a straight-line program (SLP) of size \(O(\tilde{e}_T)\) for a given text T in \(O(n + \tilde{e}_T \log \sigma )\) time.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). doi:10.1007/978-3-319-19929-0_3
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015)
Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM (JACM) 34(3), 578–595 (1987)
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundamenta Informaticae 111(3), 313–337 (2011)
Crochemore, M., Epifanio, C., Grossi, R., Mignosi, F.: Linear-size suffix tries. Theor. Comput. Sci. 638, 171–178 (2016)
Crochemore, M., Rytter, W.: Jewels of Stringology: Text Algorithms. World Scientific, Singapore (2003)
Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) CPM 1997. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997). doi:10.1007/3-540-63220-4_55
Gasieniec, L., Kolpakov, R.M., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Data Compression Conference, p. 458 (2005)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 841–850 (2003)
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Narisawa, K., Inenaga, S., Bannai, H., Takeda, M.: Efficient computation of substring equivalence classes with suffix arrays. Algorithmica (2016)
Navarro, G.: A self-index on block trees. arXiv preprint arXiv:1606.06617 (2016)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. (CSUR) 39(1), 2 (2007)
Raffinot, M.: On maximal repeats in strings. Inf. Process. Lett. 80(3), 165–169 (2001)
Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). doi:10.1007/978-3-319-07959-2_29
Weiner, P.: Linear pattern-matching algorithms. In: IEEE Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Takagi, T., Goto, K., Fujishige, Y., Inenaga, S., Arimura, H. (2017). Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression. In: Fici, G., Sciortino, M., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2017. Lecture Notes in Computer Science(), vol 10508. Springer, Cham. https://doi.org/10.1007/978-3-319-67428-5_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-67428-5_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67427-8
Online ISBN: 978-3-319-67428-5
eBook Packages: Computer ScienceComputer Science (R0)