Improved Grammar-Based Compressed Indexes

Claude, Francisco; Navarro, Gonzalo

doi:10.1007/978-3-642-34109-0_19

Improved Grammar-Based Compressed Indexes

Francisco Claude²⁰ &
Gonzalo Navarro²¹

Conference paper

1286 Accesses
34 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7608))

Abstract

We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T[1..u] that is represented by a (context-free) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of T takes \(N\lg n\) bits of space. Our representation requires \(2N\lg n + N\lg u + \epsilon\, n\lg n + o(N\lg n)\) bits of space, for any 0 < ε ≤ 1. It can find the positions of the occ occurrences of a pattern of length m in T in \(O\left((m^2/\epsilon)\lg \left(\frac{\lg u}{\lg n}\right) + (m+occ)\lg n\right)\) time, and extract any substring of length ℓ of T in time \(O(\ell+h\lg(N/h))\), where h is the height of the grammar tree.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the Space Requirement of LZ-Index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 318–329. Springer, Heidelberg (2006)
Chapter Google Scholar
Barbay, J., Gagie, T., Navarro, G., Nekrich, Y.: Alphabet Partitioning for Compressed Rank/Select and Applications. In: Cheong, O., Chwa, K.-Y., Park, K. (eds.) ISAAC 2010, Part II. LNCS, vol. 6507, pp. 315–326. Springer, Heidelberg (2010)
Chapter Google Scholar
Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)
Article MathSciNet MATH Google Scholar
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proc. 22nd SODA, pp. 373–389 (2011)
Google Scholar
Chan, T., Larsen, K., Patrascu, M.: Orthogonal range searching on the RAM, revisited. In: Proc. 27th SoCG, pp. 1–10 (2011)
Google Scholar
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theo. 51(7), 2554–2576 (2005)
Article MathSciNet Google Scholar
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Rasala, A., Sahai, A., Shelat, A.: Approximating the smallest grammar: Kolmogorov complexity in natural models. In: STOC, pp. 792–801 (2002)
Google Scholar
Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: Proc. 10th BIBE (2010)
Google Scholar
Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Indexes for highly repetitive document collections. In: Proc. 20th CIKM, pp. 463–468 (2011)
Google Scholar
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fund. Inf. 111(3), 313–337 (2010)
MathSciNet Google Scholar
Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast Relative Lempel-Ziv Self-index for Similar Sequences. In: Snoeyink, J., Lu, P., Su, K., Wang, L. (eds.) AAIM 2012 and FAW 2012. LNCS, vol. 7285, pp. 291–302. Springer, Heidelberg (2012)
Chapter Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A Faster Grammar-Based Self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012)
Chapter Google Scholar
Gasieniec, L., Kolpakov, R., Potapov, I., Sant, P.: Real-time traversal in grammar-based compressed files. In: Proc. 15th DCC, pp. 458–458 (2005)
Google Scholar
Golynski, A., Raman, R., Rao, S.S.: On the Redundancy of Succinct Data Structures. In: Gudmundsson, J. (ed.) SWAT 2008. LNCS, vol. 5124, pp. 148–159. Springer, Heidelberg (2008)
Chapter Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 841–850 (2003)
Google Scholar
Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing Similar DNA Sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)
Chapter Google Scholar
Kärkkäinen, J.: Repetition-Based Text Indexing. Ph.D. thesis, Department of Computer Science, University of Helsinki, Finland (1999)
Google Scholar
Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theor. Comp. Sci. 298(1), 253–272 (2003)
Article MathSciNet MATH Google Scholar
Kieffer, J., Yang, E.H.: Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theo. 46(3), 737–754 (2000)
Article MathSciNet MATH Google Scholar
Kreft, S., Navarro, G.: Self-indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)
Chapter Google Scholar
Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Repetition-based compression of large DNA datasets. In: Proc. 13th RECOMB (2009) (poster)
Google Scholar
Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proc. of the IEEE 88(11), 1722–1732 (2000)
Article Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and Retrieval of Individual Genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)
Chapter Google Scholar
Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 398–409. Springer, Heidelberg (2011)
Chapter Google Scholar
Morrison, D.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)
Article Google Scholar
Munro, J., Raman, R., Raman, V., Rao, S.S.: Succinct Representations of Permutations. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 345–356. Springer, Heidelberg (2003)
Chapter Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007)
Google Scholar
Nevill-Manning, C., Witten, I., Maulsby, D.: Compression by induction of hierarchical grammars. In: Proc. 4th DCC, pp. 244–253 (1994)
Google Scholar
Raman, R., Raman, V., Rao, S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. 13th SODA, pp. 233–242 (2002)
Google Scholar
Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Ret. 11(4), 359–388 (2008)
Article Google Scholar
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theo. Comp. Sci. 302(1-3), 211–222 (2003)
Article MathSciNet MATH Google Scholar
Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discr. Alg. 3, 416–430 (2005)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theo. 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theo. 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

David R. Cheriton School of Computer Science, University of Waterloo, Canada
Francisco Claude
Department of Computer Science, University of Chile, Chile
Gonzalo Navarro

Authors

Francisco Claude
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Technologies Research Group, Universidad Autónoma de Bucaramanga, Bucaramanga, Colombia
Liliana Calderón-Benavides
Information Technologies and Research Group, Universidad Autónoma de Bucaramanga, Bucaramanga, Colombia
Cristina González-Caro
School of Physics and Mathematics, Universidad Michoacana, Edificio ”B”, Ciudad Universitaria,, 58000, Morelia, Mexico
Edgar Chávez
Department of Computer Science, Universidade Federal de Minas Gerais, Av. Antonio Carlos 6627, Pampulha, 31270-010, Belo Horizonte, Brazil
Nivio Ziviani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Claude, F., Navarro, G. (2012). Improved Grammar-Based Compressed Indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds) String Processing and Information Retrieval. SPIRE 2012. Lecture Notes in Computer Science, vol 7608. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34109-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-34109-0_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34108-3
Online ISBN: 978-3-642-34109-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics