Indexing Highly Repetitive Collections

  • Gonzalo Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7643)


The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.


Compressed Index Suffix Array Phrase Boundary Software Repository Grammar Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abeliuk, A., Navarro, G.: Compressed Suffix Trees for Repetitive Texts. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 30–41. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Bille, P., Landau, G., Raman, R., Sadakane, K., Rao Satti, S., Weimann, O.: Random access to grammar-compressed strings. In: Proc. 22nd SODA, pp. 373–389 (2011)Google Scholar
  3. 3.
    Chan, T., Larsen, K., Patrascu, M.: Orthogonal range searching on the RAM, revisited. In: Proc. 27th SoCG, pp. 1–10 (2011)Google Scholar
  4. 4.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Rasala, A., Sahai, A., Shelat, A.: Approximating the smallest grammar: Kolmogorov complexity in natural models. In: Proc. 34th STOC, pp. 792–801 (2002)Google Scholar
  5. 5.
    Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theo. 51(7), 2554–2576 (2005)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: Proc. 10th BIBE, pp. 86–91 (2010)Google Scholar
  7. 7.
    Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Indexes for highly repetitive document collections. In: Proc. 20th CIKM, pp. 463–468 (2011)Google Scholar
  8. 8.
    Claude, F., Navarro, G.: Improved Grammar-Based Compressed Indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  9. 9.
    Do, H.-H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative Lempel-Ziv self-index for similar sequences. In: Proc. FAW-AAIM, pp. 291–302 (2012)Google Scholar
  10. 10.
    Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theor. Comp. Sci. 410(51), 5354–5364 (2009)zbMATHCrossRefGoogle Scholar
  11. 11.
    Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A Faster Grammar-Based Self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 841–850 (2003)Google Scholar
  13. 13.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comp. 35(2), 378–407 (2006)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing Similar DNA Sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  15. 15.
    Kärkkäinen, J.: Repetition-Based Text Indexing. PhD thesis, Dept of Comp. Sci., Univ. of Helsinki, Finland (1999)Google Scholar
  16. 16.
    Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comp. Sci. (to appear, 2012); Earlier versions in Proc. DCC 2010 and Proc. CPM 2011Google Scholar
  17. 17.
    Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comp. Biol. 17(3), 281–308 (2010)CrossRefGoogle Scholar
  18. 18.
    Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. SIAM J. Comp., 935–948 (1993)Google Scholar
  19. 19.
    Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 398–409. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  21. 21.
    Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007)Google Scholar
  22. 22.
    Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theo. Comp. Sci. 302(1-3), 211–222 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  23. 23.
    Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Alg. 48(2), 294–313 (2003)MathSciNetzbMATHCrossRefGoogle Scholar
  24. 24.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theo. 23(3), 337–343 (1977)MathSciNetzbMATHCrossRefGoogle Scholar
  25. 25.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theo. 24(5), 530–536 (1978)MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Gonzalo Navarro
    • 1
  1. 1.Dept. of Computer ScienceUniversity of ChileChile

Personalised recommendations