Computational Complexity of Word Counting

  • Mireille Régnier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2066)


Evaluation of the frequency of occurrences of a given set of patterns in a DNA sequence has numerous applications and has been extensively studied recently. We discuss the computational complexity for explicit formulae derived by several authors. We introduce a correlation automaton, that minimizes this complexity. This is crucial for practical applications. Notably, it allows to deal with the Markovian probability model. The case of patterns with some unspecified characters - approximate searching, regular expressions,... - is addressed.


Computational Complexity Markovian Model Regular Expression Word Counting Pattern Occurrence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Apostolico, A., Bock, M., Lonardi, S., and Xu, X. (1999). Efficient detection of unusual words. Journal of Computational Biology. to appear; preliminary version as Technical Report 97-050, Purdue University Computer Science Department (1996).Google Scholar
  2. 2.
    Bender, E. A. and Kochman, F. (1993). The Distribution of Subwords Counts is Usually Normal. European Journal of Combinatorics, 14:265–275.zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Borodovsky, M. and Kleffe, J. (1992). First and second moments of counts of words in random texts generated by markov chains. CABIOS, 8:433–441.Google Scholar
  4. 4.
    Breen, S., Waterman, M., and Zhang, N. (1985). Renewal theory for several patterns. J. Appl. Prob., 22:228–234.zbMATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Chrysaphinou, C. and Papastavridis, S. (1990). The occurrence of sequence of patterns in repeated dependent experiments. Theory of Probability and Applications, 79:167–173.MathSciNetGoogle Scholar
  6. 6.
    Geske, M., Godbole, A., Schafner, A., Skolnick, A., and Wallstrom, G. (1995). Compound Poisson Approximations for Word Patterns Under Markovian Hypotheses. J. Appl. Prob., 32:877–892.zbMATHCrossRefGoogle Scholar
  7. 7.
    Guibas, L. and Odlyzko, A. (1981). String Overlaps, Pattern Matching and Non-transitive Games. Journal of Combinatorial Theory, Series A, 30:183–208.zbMATHMathSciNetGoogle Scholar
  8. 8.
    Kemeny, J. and Snell, J. (1983). Finite Markov Chains. Undergraduate Texts in Mathematics. Springer-Verlag, Berlin.Google Scholar
  9. 9.
    Klaerr-Blanchard, M., Chiapello, H., and Coward, E. (2000). Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem., 24(1):57–70.CrossRefGoogle Scholar
  10. 10.
    Kurtz, S. and Myers, G. (1997). Estimating the Probability of Approximate Matches. In CPM’97, Lecture Notes in Computer Science. Springer-Verlag.Google Scholar
  11. 11.
    Li, S. (1980). A Martingale Approach to the Study of Occurrences of Sequence Patterns in Repeated Experiments. Ann. Prob., 8:1171–1176.zbMATHCrossRefGoogle Scholar
  12. 12.
    Li, W. (1997). The study of correlation structures of DNA sequences: a critical review. Computers Chem., 21(4):257–271.CrossRefGoogle Scholar
  13. 13.
    Lundstrom, R. (1990). Stochastic Models and Statistical Methods for DNA Sequence Data. Phdthesis, University of Utah.Google Scholar
  14. 14.
    Nicodème, P., Salvy, B., and Flajolet, P. (1999). Motif statistics. In ESA’99, volume 1643 of Lecture Notes in Computer Science, pages 194–211. Springer-Verlag. Proc. European Symposium on Algorithms-ESA’99, Prague.Google Scholar
  15. 15.
    Pevzner, P., Borodovski, M., and Mironov, A. (1991). Linguistic of Nucleotide sequences:The Significance of Deviations from the Mean: Statistical Characteristics and Prediction of the Frequency of Occurrences of Words. J. Biomol. Struct. Dynam., 6:1013–1026.Google Scholar
  16. 16.
    Prum, B., Rodolphe, F., and de Turckheim, E. (1995). Finding Words with Unexpected Frequencies in DNA sequences. J. R. Statist. Soc. B., 57:205–220.zbMATHGoogle Scholar
  17. 17.
    Régnier, M. (2000). A Unified Approach to Word Occurrences Probabilities. Discrete Applied Mathematics, 104(1):259–280. Special issue on Computational Biology;preliminary version at RECOMB’98.zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Régnier, M., Lifanov, A., and Makeev, V. (2000). Three variations on word counting. In GCB’00, pages 75–82. Logos-Verlag. Proc. German Conference on Bioinformatics, Heidelberg.Google Scholar
  19. 19.
    Régnier, M. and Szpankowski, W. (1997). On Pattern Frequency Occurrences in a Markovian Sequence. Algorithmica, 22(4):631–649. preliminary draft at ISIT’97.CrossRefGoogle Scholar
  20. 20.
    Schbath, S. (1995). Etude Asymptotique du Nombre d’Occurrences d’un mot dans une Chaine de Markov et Application à la Recherche de Mots de Frequence Exceptionnelle dans les Sequences d’ADN. Thèse de 3e cycle, Universitée de Paris V.Google Scholar
  21. 21.
    Tanushev, M. and Arratia, R. (1997). Central Limit Theorem for Renewal Theory for Several Patterns. Journal of Computational Biology, 4(1):35–44.CrossRefGoogle Scholar
  22. 22.
    Tompa, M. (1999). An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In ISMB’99, pages 262–271. AAAI Press. Seventh International Conference on Intelligent Systems for Molecular Biology, Heidelberg,Germany.Google Scholar
  23. 23.
    Waterman, M. (1995). Introduction to Computational Biology. Chapman and Hall, London.zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Mireille Régnier
    • 1
  1. 1.INRIALe Chesnay

Personalised recommendations