Computational Complexity of Word Counting

Régnier, Mireille

doi:10.1007/3-540-45727-5_9

Mireille Régnier⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2066))

Included in the following conference series:

International Conference on Biology, Informatics, and Mathematics

348 Accesses

Abstract

Evaluation of the frequency of occurrences of a given set of patterns in a DNA sequence has numerous applications and has been extensively studied recently. We discuss the computational complexity for explicit formulae derived by several authors. We introduce a correlation automaton, that minimizes this complexity. This is crucial for practical applications. Notably, it allows to deal with the Markovian probability model. The case of patterns with some unspecified characters - approximate searching, regular expressions,... - is addressed.

This research was supported by ESPRIT LTR Project No. 20244 (ALCOM IT) and REMAG Action of INRIA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apostolico, A., Bock, M., Lonardi, S., and Xu, X. (1999). Efficient detection of unusual words. Journal of Computational Biology. to appear; preliminary version as Technical Report 97-050, Purdue University Computer Science Department (1996).
Google Scholar
Bender, E. A. and Kochman, F. (1993). The Distribution of Subwords Counts is Usually Normal. European Journal of Combinatorics, 14:265–275.
Article MATH MathSciNet Google Scholar
Borodovsky, M. and Kleffe, J. (1992). First and second moments of counts of words in random texts generated by markov chains. CABIOS, 8:433–441.
Google Scholar
Breen, S., Waterman, M., and Zhang, N. (1985). Renewal theory for several patterns. J. Appl. Prob., 22:228–234.
Article MATH MathSciNet Google Scholar
Chrysaphinou, C. and Papastavridis, S. (1990). The occurrence of sequence of patterns in repeated dependent experiments. Theory of Probability and Applications, 79:167–173.
MathSciNet Google Scholar
Geske, M., Godbole, A., Schafner, A., Skolnick, A., and Wallstrom, G. (1995). Compound Poisson Approximations for Word Patterns Under Markovian Hypotheses. J. Appl. Prob., 32:877–892.
Article MATH Google Scholar
Guibas, L. and Odlyzko, A. (1981). String Overlaps, Pattern Matching and Non-transitive Games. Journal of Combinatorial Theory, Series A, 30:183–208.
MATH MathSciNet Google Scholar
Kemeny, J. and Snell, J. (1983). Finite Markov Chains. Undergraduate Texts in Mathematics. Springer-Verlag, Berlin.
Google Scholar
Klaerr-Blanchard, M., Chiapello, H., and Coward, E. (2000). Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem., 24(1):57–70.
Article Google Scholar
Kurtz, S. and Myers, G. (1997). Estimating the Probability of Approximate Matches. In CPM’97, Lecture Notes in Computer Science. Springer-Verlag.
Google Scholar
Li, S. (1980). A Martingale Approach to the Study of Occurrences of Sequence Patterns in Repeated Experiments. Ann. Prob., 8:1171–1176.
Article MATH Google Scholar
Li, W. (1997). The study of correlation structures of DNA sequences: a critical review. Computers Chem., 21(4):257–271.
Article Google Scholar
Lundstrom, R. (1990). Stochastic Models and Statistical Methods for DNA Sequence Data. Phdthesis, University of Utah.
Google Scholar
Nicodème, P., Salvy, B., and Flajolet, P. (1999). Motif statistics. In ESA’99, volume 1643 of Lecture Notes in Computer Science, pages 194–211. Springer-Verlag. Proc. European Symposium on Algorithms-ESA’99, Prague.
Google Scholar
Pevzner, P., Borodovski, M., and Mironov, A. (1991). Linguistic of Nucleotide sequences:The Significance of Deviations from the Mean: Statistical Characteristics and Prediction of the Frequency of Occurrences of Words. J. Biomol. Struct. Dynam., 6:1013–1026.
Google Scholar
Prum, B., Rodolphe, F., and de Turckheim, E. (1995). Finding Words with Unexpected Frequencies in DNA sequences. J. R. Statist. Soc. B., 57:205–220.
MATH Google Scholar
Régnier, M. (2000). A Unified Approach to Word Occurrences Probabilities. Discrete Applied Mathematics, 104(1):259–280. Special issue on Computational Biology;preliminary version at RECOMB’98.
Article MATH MathSciNet Google Scholar
Régnier, M., Lifanov, A., and Makeev, V. (2000). Three variations on word counting. In GCB’00, pages 75–82. Logos-Verlag. Proc. German Conference on Bioinformatics, Heidelberg.
Google Scholar
Régnier, M. and Szpankowski, W. (1997). On Pattern Frequency Occurrences in a Markovian Sequence. Algorithmica, 22(4):631–649. preliminary draft at ISIT’97.
Article Google Scholar
Schbath, S. (1995). Etude Asymptotique du Nombre d’Occurrences d’un mot dans une Chaine de Markov et Application à la Recherche de Mots de Frequence Exceptionnelle dans les Sequences d’ADN. Thèse de 3e cycle, Universitée de Paris V.
Google Scholar
Tanushev, M. and Arratia, R. (1997). Central Limit Theorem for Renewal Theory for Several Patterns. Journal of Computational Biology, 4(1):35–44.
Article Google Scholar
Tompa, M. (1999). An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In ISMB’99, pages 262–271. AAAI Press. Seventh International Conference on Intelligent Systems for Molecular Biology, Heidelberg,Germany.
Google Scholar
Waterman, M. (1995). Introduction to Computational Biology. Chapman and Hall, London.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

INRIA, 78153, Le Chesnay
Mireille Régnier

Authors

Mireille Régnier
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Laboratoire d’Informatique, de Robotique et de Microelectronique de Montpellier, 161 rue Ada, 34392, Montpellier Cedex 5, France
Olivier Gascuel
Laboratoire d’Algorithmique Combinatoire, Institut Pasteur, 28, rue du Dr. Roux, 75724, Paris Cedex 15, France
Marie-France Sagot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Régnier, M. (2001). Computational Complexity of Word Counting. In: Gascuel, O., Sagot, MF. (eds) Computational Biology. JOBIM 2000. Lecture Notes in Computer Science, vol 2066. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45727-5_9

Download citation

DOI: https://doi.org/10.1007/3-540-45727-5_9
Published: 28 June 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42242-6
Online ISBN: 978-3-540-45727-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics