Advertisement

On Context-Diverse Repeats and Their Incremental Computation

  • Matthias Gallé
  • Matías Tealdi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8370)

Abstract

The context in which a substring appears is an important notion to identify – for example – its semantic meaning. However, existing classes of repeats fail to take this into account directly. We present here xkcd-repeats, a new family of repeats characterized by the number of different symbols at the left and right of their occurrences. These repeats include as special extreme cases maximal and super-maximal repeats.

We give sufficient and necessary condition to bound their number linearly in the size of the sequence, and show an optimal algorithm that computes them in linear time – given a suffix array –, independent on the size of the alphabet, as well as two other algorithms that are faster in practice.

Additionally, we provide an independent and general framework that allows to compute these (and other) repeats incrementally; extending the application space of repeats in a streaming framework.

Keywords

Maximal Repeat Linear Algorithm Alphabet Size Suffix Array Incremental Computation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)CrossRefzbMATHMathSciNetGoogle Scholar
  2. 2.
    Apostolico, A.: Of maps bigger than the empire (invited paper). In: SPIRE, pp. 2–9 (2001)Google Scholar
  3. 3.
    Bose, P., He, M., Maheshwari, A., Morin, P.: Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing. In: Dehne, F., Gavrilova, M., Sack, J.-R., Tóth, C.D. (eds.) WADS 2009. LNCS, vol. 5664, pp. 98–109. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Carrascosa, R., Coste, F., Gallé, M., Infante-Lopez, G.: The smallest grammar problem as constituents choice and minimal grammar parsing. MDPI Algorithms 4(4), 262–284 (2011)CrossRefGoogle Scholar
  5. 5.
    Clark, A.: Learning deterministic context free grammars: The Omphalos competition. Machine Learning, 93–110 (January 2007)Google Scholar
  6. 6.
    Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of context free languages. In: ICGI, pp. 29–42 (July 2008)Google Scholar
  7. 7.
    Fenwick, P.M.: A New Data Structure for Cumulative Frequency Tables. Softw. Pract. Exper. 24, 327–336 (1994)CrossRefGoogle Scholar
  8. 8.
    Gallé, M.: Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Université de Rennes 1 (February 2011)Google Scholar
  9. 9.
    Gallé, M.: The bag-of-repeats representation of documents. In: SIGIR, pp. 1053–1056 (2013)Google Scholar
  10. 10.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press (January 1997)Google Scholar
  11. 11.
    Manning, C., Raghavan, P., Schütze, H.: Introduction to Inf Retrieval. Cambridge UP (2009)Google Scholar
  12. 12.
    Navarro, G.: Spaces, Trees and Colors: The Algorithmic Landscape of Document Retrieval on Sequences. arXiv (2013)Google Scholar
  13. 13.
    Ohlebusch, E., Beller, T., Abouelhoda, M.I.: Computing the Burrows Wheeler transform of a string and its reverse in parallel. Journal of Discrete Algorithms 1, 1–13 (2013), http://linkinghub.elsevier.com/retrieve/pii/S1570866713000397 Google Scholar
  14. 14.
    Puglisi, S., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2) (July 2007), http://portal.acm.org/citation.cfm?id=1242471.1242472
  15. 15.
    Puglisi, S.J., Smyth, W.F., Yusufu, M.: Fast optimal algorithms for computing all the repeats in a string. In: Prague Stringology Conference, pp. 161–169 (2008)Google Scholar
  16. 16.
    Schütze, H.: Automatic Word Sense Discrimination. Comput. Ling. 24(1) (1998)Google Scholar
  17. 17.
    Solan, Z., Horn, D., Ruppin, E., Edelman, S.: Unsupervised learning of natural languages. PNAS, 11629–11634 (January 2005)Google Scholar
  18. 18.
    Ukkonen, E.: Online construction of suffix trees. Algorithmica 14, 249–260 (1995)CrossRefzbMATHMathSciNetGoogle Scholar
  19. 19.
    van Zaanen, M.: ABL: Alignment-based learning. In: International Conference on Computational Linguistics, pp. 961–967 (2000)Google Scholar
  20. 20.
    Zhang, S., Nong, G., Chan, W.H.: Fast and space efficient linear suffix array construction. In: DCC. IEEE Computer Society, Washington, DC (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Matthias Gallé
    • 1
  • Matías Tealdi
    • 1
  1. 1.Xerox Research Centre EuropeGrenobleFrance

Personalised recommendations