Skip to main content

Computational Complexity of Word Counting

  • Conference paper
  • First Online:
Computational Biology (JOBIM 2000)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2066))

Included in the following conference series:

  • 348 Accesses

Abstract

Evaluation of the frequency of occurrences of a given set of patterns in a DNA sequence has numerous applications and has been extensively studied recently. We discuss the computational complexity for explicit formulae derived by several authors. We introduce a correlation automaton, that minimizes this complexity. This is crucial for practical applications. Notably, it allows to deal with the Markovian probability model. The case of patterns with some unspecified characters - approximate searching, regular expressions,... - is addressed.

This research was supported by ESPRIT LTR Project No. 20244 (ALCOM IT) and REMAG Action of INRIA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apostolico, A., Bock, M., Lonardi, S., and Xu, X. (1999). Efficient detection of unusual words. Journal of Computational Biology. to appear; preliminary version as Technical Report 97-050, Purdue University Computer Science Department (1996).

    Google Scholar 

  2. Bender, E. A. and Kochman, F. (1993). The Distribution of Subwords Counts is Usually Normal. European Journal of Combinatorics, 14:265–275.

    Article  MATH  MathSciNet  Google Scholar 

  3. Borodovsky, M. and Kleffe, J. (1992). First and second moments of counts of words in random texts generated by markov chains. CABIOS, 8:433–441.

    Google Scholar 

  4. Breen, S., Waterman, M., and Zhang, N. (1985). Renewal theory for several patterns. J. Appl. Prob., 22:228–234.

    Article  MATH  MathSciNet  Google Scholar 

  5. Chrysaphinou, C. and Papastavridis, S. (1990). The occurrence of sequence of patterns in repeated dependent experiments. Theory of Probability and Applications, 79:167–173.

    MathSciNet  Google Scholar 

  6. Geske, M., Godbole, A., Schafner, A., Skolnick, A., and Wallstrom, G. (1995). Compound Poisson Approximations for Word Patterns Under Markovian Hypotheses. J. Appl. Prob., 32:877–892.

    Article  MATH  Google Scholar 

  7. Guibas, L. and Odlyzko, A. (1981). String Overlaps, Pattern Matching and Non-transitive Games. Journal of Combinatorial Theory, Series A, 30:183–208.

    MATH  MathSciNet  Google Scholar 

  8. Kemeny, J. and Snell, J. (1983). Finite Markov Chains. Undergraduate Texts in Mathematics. Springer-Verlag, Berlin.

    Google Scholar 

  9. Klaerr-Blanchard, M., Chiapello, H., and Coward, E. (2000). Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem., 24(1):57–70.

    Article  Google Scholar 

  10. Kurtz, S. and Myers, G. (1997). Estimating the Probability of Approximate Matches. In CPM’97, Lecture Notes in Computer Science. Springer-Verlag.

    Google Scholar 

  11. Li, S. (1980). A Martingale Approach to the Study of Occurrences of Sequence Patterns in Repeated Experiments. Ann. Prob., 8:1171–1176.

    Article  MATH  Google Scholar 

  12. Li, W. (1997). The study of correlation structures of DNA sequences: a critical review. Computers Chem., 21(4):257–271.

    Article  Google Scholar 

  13. Lundstrom, R. (1990). Stochastic Models and Statistical Methods for DNA Sequence Data. Phdthesis, University of Utah.

    Google Scholar 

  14. Nicodème, P., Salvy, B., and Flajolet, P. (1999). Motif statistics. In ESA’99, volume 1643 of Lecture Notes in Computer Science, pages 194–211. Springer-Verlag. Proc. European Symposium on Algorithms-ESA’99, Prague.

    Google Scholar 

  15. Pevzner, P., Borodovski, M., and Mironov, A. (1991). Linguistic of Nucleotide sequences:The Significance of Deviations from the Mean: Statistical Characteristics and Prediction of the Frequency of Occurrences of Words. J. Biomol. Struct. Dynam., 6:1013–1026.

    Google Scholar 

  16. Prum, B., Rodolphe, F., and de Turckheim, E. (1995). Finding Words with Unexpected Frequencies in DNA sequences. J. R. Statist. Soc. B., 57:205–220.

    MATH  Google Scholar 

  17. Régnier, M. (2000). A Unified Approach to Word Occurrences Probabilities. Discrete Applied Mathematics, 104(1):259–280. Special issue on Computational Biology;preliminary version at RECOMB’98.

    Article  MATH  MathSciNet  Google Scholar 

  18. Régnier, M., Lifanov, A., and Makeev, V. (2000). Three variations on word counting. In GCB’00, pages 75–82. Logos-Verlag. Proc. German Conference on Bioinformatics, Heidelberg.

    Google Scholar 

  19. Régnier, M. and Szpankowski, W. (1997). On Pattern Frequency Occurrences in a Markovian Sequence. Algorithmica, 22(4):631–649. preliminary draft at ISIT’97.

    Article  Google Scholar 

  20. Schbath, S. (1995). Etude Asymptotique du Nombre d’Occurrences d’un mot dans une Chaine de Markov et Application à la Recherche de Mots de Frequence Exceptionnelle dans les Sequences d’ADN. Thèse de 3e cycle, Universitée de Paris V.

    Google Scholar 

  21. Tanushev, M. and Arratia, R. (1997). Central Limit Theorem for Renewal Theory for Several Patterns. Journal of Computational Biology, 4(1):35–44.

    Article  Google Scholar 

  22. Tompa, M. (1999). An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In ISMB’99, pages 262–271. AAAI Press. Seventh International Conference on Intelligent Systems for Molecular Biology, Heidelberg,Germany.

    Google Scholar 

  23. Waterman, M. (1995). Introduction to Computational Biology. Chapman and Hall, London.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Régnier, M. (2001). Computational Complexity of Word Counting. In: Gascuel, O., Sagot, MF. (eds) Computational Biology. JOBIM 2000. Lecture Notes in Computer Science, vol 2066. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45727-5_9

Download citation

  • DOI: https://doi.org/10.1007/3-540-45727-5_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42242-6

  • Online ISBN: 978-3-540-45727-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics