Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics

  • Tobias Marschall
  • Sven Rahmann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5029)


We present probabilistic arithmetic automata (PAAs), which can be used to model chains of operations whose operands depend on chance. We provide two different algorithms to exactly calculate the distribution of the results obtained by such probabilistic calculations. Although we introduce PAAs and the corresponding algorithm in a generic manner, our main concern is their application to pattern matching statistics, i.e. we study the distributions of the number of occurrences of a pattern under a given text model. Such calculations play an important role in computational biology as they give access to the significance of pattern occurrences. To assess the practicability of our method, we apply it to the Prosite database of amino acid motifs and to the Jaspar database of transcription factor binding sites. Regarding the latter, we additionally show that our framework permits to take binding affinities predicted from a physical model into account.


Transcription Factor Binding Site Pattern Match Amino Acid Motif Text Model Random Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P., Pagni, M., Sigrist, C.: The PROSITE database. Nucleic Acids Research 34(S1), D227–230 (2006)CrossRefGoogle Scholar
  2. 2.
    Lothaire, M.: Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications). Cambridge University Press, Cambridge (2005)Google Scholar
  3. 3.
    Reinert, G., Schbath, S., Waterman, M.S.: Probabilistic and statistical properties of words: An overview. Journal of Computational Biology 7(1-2), 1–46 (2000)CrossRefGoogle Scholar
  4. 4.
    Régnier, M.: A unifed approach to word occurrence probabilities. Discrete Applied Mathematics 104, 259–280 (2000)CrossRefMathSciNetzbMATHGoogle Scholar
  5. 5.
    Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Computer Science 287, 593–617 (2002)CrossRefMathSciNetzbMATHGoogle Scholar
  6. 6.
    Lladser, M., Betterton, M.D., Knight, R.: Multiple pattern matching: A Markov chain approach. Journal of Mathematical Biology 56(1-2), 51–92 (2008)CrossRefMathSciNetzbMATHGoogle Scholar
  7. 7.
    Kaltenbach, H.M., Böcker, S., Rahmann, S.: Markov additive chains and applications to fragment statistics for peptide mass fingerprinting. In: Ideker, T., Bafna, V. (eds.) Joint RECOMB 2006 Satellite Workshops on Systems Biology and on Computational Proteomics. LNCS (LNBI), vol. 4532, pp. 29–41. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Zhang, J., Jiang, B., Li, M., Tromp, J., Zhang, X., Zhang, M.Q.: Computing exact p-values for DNA motifs. Bioinformatics 23(5), 531–537 (2007)CrossRefGoogle Scholar
  9. 9.
    Stoelinga, M.: An introduction to probabilistic automata. In: Rozenberg, G. (ed.) EATCS bulletin, vol. 78 (2002)Google Scholar
  10. 10.
    Navarro, G., Raffinot, M.: Flexible pattern matching in strings. Cambridge University Press, Cambridge (2002)zbMATHGoogle Scholar
  11. 11.
    Hopcroft, J.: An n logn algorithm for minimizing the states in a finite automaton. In: Kohavi, Z., Paz, A. (eds.) The theory of machines and computations, pp. 189–196. Academic Press, New York (1971)Google Scholar
  12. 12.
    Knuutila, T.: Re-describing an algorithm by Hopcroft. Theoretical Computer Science 250, 333–363 (2001)CrossRefMathSciNetzbMATHGoogle Scholar
  13. 13.
    Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Communications of the ACM 18(6), 333–340 (1975)CrossRefMathSciNetzbMATHGoogle Scholar
  14. 14.
    Dori, S., Landau, G.M.: Construction of Aho Corasick automaton in linear time for integer alphabets. Information Processing Letters 98(2), 66–72 (2006)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000)CrossRefGoogle Scholar
  16. 16.
    Pape, U.J., Grossmann, S., Hammer, S., Sperling, S., Vingron, M.: A new statistical model to select target sequences bound by transcription factors. Genome Informatics 17(1), 134–140 (2006)Google Scholar
  17. 17.
    Sandelin, A., Alkema, W., Engström, P.G., Wasserman, W.W., Lenhard, B.: JASPAR: an open access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32(1) (2004) (Database Issue) Google Scholar
  18. 18.
    Rahmann, S., Müller, T., Vingron, M.: On the power of profiles for transcription factor binding site detection. Statistical Applications in Genetics and Molecular Biology (Article 7), 2(1) (2003) Google Scholar
  19. 19.
    Roider, H., Kanhere, A., Manke, T., Vingron, M.: Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics 23(2), 134–141 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Tobias Marschall
    • 1
  • Sven Rahmann
    • 1
  1. 1.Bioinformatics for High-Throughput Technologies at the Chair of Algorithm Engineering, Computer Science DepartmentTU DortmundDortmundGermany

Personalised recommendations