On Context-Diverse Repeats and Their Incremental Computation

Gallé, Matthias; Tealdi, Matías

doi:10.1007/978-3-319-04921-2_31

Matthias Gallé¹⁹ &
Matías Tealdi¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8370))

Included in the following conference series:

International Conference on Language and Automata Theory and Applications

1070 Accesses
1 Citations

Abstract

The context in which a substring appears is an important notion to identify – for example – its semantic meaning. However, existing classes of repeats fail to take this into account directly. We present here xkcd-repeats, a new family of repeats characterized by the number of different symbols at the left and right of their occurrences. These repeats include as special extreme cases maximal and super-maximal repeats.

We give sufficient and necessary condition to bound their number linearly in the size of the sequence, and show an optimal algorithm that computes them in linear time – given a suffix array –, independent on the size of the alphabet, as well as two other algorithms that are faster in practice.

Additionally, we provide an independent and general framework that allows to compute these (and other) repeats incrementally; extending the application space of repeats in a streaming framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 53–86 (2004)
Article MATH MathSciNet Google Scholar
Apostolico, A.: Of maps bigger than the empire (invited paper). In: SPIRE, pp. 2–9 (2001)
Google Scholar
Bose, P., He, M., Maheshwari, A., Morin, P.: Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing. In: Dehne, F., Gavrilova, M., Sack, J.-R., Tóth, C.D. (eds.) WADS 2009. LNCS, vol. 5664, pp. 98–109. Springer, Heidelberg (2009)
Chapter Google Scholar
Carrascosa, R., Coste, F., Gallé, M., Infante-Lopez, G.: The smallest grammar problem as constituents choice and minimal grammar parsing. MDPI Algorithms 4(4), 262–284 (2011)
Article Google Scholar
Clark, A.: Learning deterministic context free grammars: The Omphalos competition. Machine Learning, 93–110 (January 2007)
Google Scholar
Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of context free languages. In: ICGI, pp. 29–42 (July 2008)
Google Scholar
Fenwick, P.M.: A New Data Structure for Cumulative Frequency Tables. Softw. Pract. Exper. 24, 327–336 (1994)
Article Google Scholar
Gallé, M.: Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Université de Rennes 1 (February 2011)
Google Scholar
Gallé, M.: The bag-of-repeats representation of documents. In: SIGIR, pp. 1053–1056 (2013)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press (January 1997)
Google Scholar
Manning, C., Raghavan, P., Schütze, H.: Introduction to Inf Retrieval. Cambridge UP (2009)
Google Scholar
Navarro, G.: Spaces, Trees and Colors: The Algorithmic Landscape of Document Retrieval on Sequences. arXiv (2013)
Google Scholar
Ohlebusch, E., Beller, T., Abouelhoda, M.I.: Computing the Burrows Wheeler transform of a string and its reverse in parallel. Journal of Discrete Algorithms 1, 1–13 (2013), http://linkinghub.elsevier.com/retrieve/pii/S1570866713000397
Google Scholar
Puglisi, S., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39(2) (July 2007), http://portal.acm.org/citation.cfm?id=1242471.1242472
Puglisi, S.J., Smyth, W.F., Yusufu, M.: Fast optimal algorithms for computing all the repeats in a string. In: Prague Stringology Conference, pp. 161–169 (2008)
Google Scholar
Schütze, H.: Automatic Word Sense Discrimination. Comput. Ling. 24(1) (1998)
Google Scholar
Solan, Z., Horn, D., Ruppin, E., Edelman, S.: Unsupervised learning of natural languages. PNAS, 11629–11634 (January 2005)
Google Scholar
Ukkonen, E.: Online construction of suffix trees. Algorithmica 14, 249–260 (1995)
Article MATH MathSciNet Google Scholar
van Zaanen, M.: ABL: Alignment-based learning. In: International Conference on Computational Linguistics, pp. 961–967 (2000)
Google Scholar
Zhang, S., Nong, G., Chan, W.H.: Fast and space efficient linear suffix array construction. In: DCC. IEEE Computer Society, Washington, DC (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Xerox Research Centre Europe, Grenoble, France
Matthias Gallé & Matías Tealdi

Authors

Matthias Gallé
View author publications
You can also search for this author in PubMed Google Scholar
Matías Tealdi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistics, Rovira i Virgili University, Avinguda Catalunya, 35, 43002, Tarragona, Spain
Adrian-Horia Dediu & Carlos Martín-Vide &
School of Computer Science, Department of Software Engineering and Artificial Intelligence, Complutense University of Madrid, Professor José Garcia Santesmases, 9, 28040, Madrid, Spain
José-Luis Sierra-Rodríguez
Fakultät für Informatik, Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Bianca Truthe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gallé, M., Tealdi, M. (2014). On Context-Diverse Repeats and Their Incremental Computation. In: Dediu, AH., Martín-Vide, C., Sierra-Rodríguez, JL., Truthe, B. (eds) Language and Automata Theory and Applications. LATA 2014. Lecture Notes in Computer Science, vol 8370. Springer, Cham. https://doi.org/10.1007/978-3-319-04921-2_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-04921-2_31
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04920-5
Online ISBN: 978-3-319-04921-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics