Assessing the Significance of Sets of Words

  • Valentina Boeva
  • Julien Clément
  • Mireille Régnier
  • Mathias Vandenbogaert
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3537)


Various criteria have been defined to evaluate the significance of sets of words, the computation of them often being difficult. We provide explicit expressions for the waiting time in such a context. In order to assess the significance of a cluster of potential binding sites, we extend them to the co-occurrence problem. We point out that these criteria values depend on a few fundamental parameters. We provide efficient algorithms to compute them, that rely on a combinatorial interpretation of the formulae. We show that our results are very tight in the so-called twilight zone and improve on previous rough approximations. One assumes that the text is generated according to a Markov stationary process. These results are developed for an extended model of consensus.


Markov Model Occurrence Probability String Match Bernoulli Model Markov Stationary Process 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Panina, E., Mironov, A., Gelfand, M.: Statistical analysis of complete bacterial genomes:Avoidance of palindromes and restriction-modification systems. Mol. Biol. 34, 215–221 (2000)CrossRefGoogle Scholar
  2. 2.
    Vandenbogaert, M., Makeev, V.: Analysis of bacterial rm-systems through genomescale analysis and related taxonomic issues. Silico Biol. 3, 12 (2003)Google Scholar
  3. 3.
    Robin, S., Schbath, S.: Numerical comparison of several approximations on the word count distribution in random sequences. J. Comput. Biol. 8, 349–359 (2001)CrossRefGoogle Scholar
  4. 4.
    Chiang, D., Moses, A., Kellis, M., Lander, E., Eisen, M.: Phylogenetically and spatially conserved word pairs associated with gene-expression in yeasts. Genome Biol. 4, R43 (2003)CrossRefGoogle Scholar
  5. 5.
    Régnier, M., Szpankowski, W.: On pattern frequency occurrences in a Markovian sequence. Algorithmica 22, 631–649 (1997)CrossRefGoogle Scholar
  6. 6.
    Régnier, M.: A unified approach to word occurrences probabilities. Discrete Appl. Math. 104, 259–280 (2000); Special issue on Computational BiologyzbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Robin, S., Daudin, J.J.: Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob. 36, 179–193 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Robin, S., Daudin, J.J., Richard, H., Sagot, M., Schbath, S.: Occurrence probability of structured motifs in random sequences. J. Comput. Biol. 9, 761–773 (2001)CrossRefGoogle Scholar
  9. 9.
    Pevzner, P., Borodovski, M., Mironov, A.: Linguistics of nucleotide sequences i: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynam. 6, 1013–1026 (1989)Google Scholar
  10. 10.
    Bender, E.A., Kochman, F.: The Distribution of Subwords Counts is Usually Normal. European J. Combin. 14, 265–275 (1993)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Guibas, L., Odlyzko, A.: String Overlaps, Pattern Matching and Nontransitive Games. J. Combin. Theory Ser. A 30, 183–208 (1981)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Tanushev, M., Arratia, R.: Central limit theorem for renewal theory for several patterns. J. Comput. Biol. 4, 35–44 (1997)CrossRefGoogle Scholar
  13. 13.
    Régnier, M., Szpankowski, W.: On the approximate pattern occurrences in a text. In: Compression and Complexity of sequences, pp. 253–264. IEEE Computer Society, Los Alamitos (1997)Google Scholar
  14. 14.
    Klaerr-Blanchard, M., Chiapello, H., Coward, E.: Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem. 24, 57–70 (2000)CrossRefGoogle Scholar
  15. 15.
    Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoret. Comput. Sci. 287, 593–618 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Chrysaphinou, C., Papastavridis, S.: The occurrence of sequence of patterns in repeated dependent experiments. Theory Probab. App. 79, 167–173 (1990)MathSciNetGoogle Scholar
  17. 17.
    Szpankowski, W.: Average Case Analysis of Algorithms on Sequences. John Wiley and Sons, New York (2001)zbMATHGoogle Scholar
  18. 18.
    Buhler, J., Tompa, M.: Finding Motifs Using Random Projections. In: RECOMB 2001, pp. 69–76. ACM, New York (2001)CrossRefGoogle Scholar
  19. 19.
    Beaudoing, E., Freier, S., Wyatt, J., Claverie, J., Gautheret, D.: Patterns of Variant Polyadenylation Signal Usage in Human Genes. Genome Res. 10, 1001–1010 (2000)CrossRefGoogle Scholar
  20. 20.
    van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998), CrossRefGoogle Scholar
  21. 21.
    Knuth, D.: The average time for carry propagation. Indag. Math. 40, 238–242 (1978)MathSciNetGoogle Scholar
  22. 22.
    Régnier, M.: Mathematical tools for regulatory signals extraction. In: Kolchanov, N., Hofestaedt, R. (eds.) Bioinformatics of Genome Regulation and Structure, pp. 61–70. Kluwer Academic Publisher, Dordrecht (2004)Google Scholar
  23. 23.
    Flajolet, P., Sedgewick, R.: Analysis of Algorithms. Addison-Wesley, Reading (1996)zbMATHGoogle Scholar
  24. 24.
    Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18, 333–340 (1975)zbMATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    Crochemore, M., Rytter, W.: Jewels of Stringology, p. 310. World Scientific Publishing, Hong-Kong (2002)CrossRefGoogle Scholar
  26. 26.
    Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics (ISMB special issue) 817, 30–38 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Valentina Boeva
    • 1
  • Julien Clément
    • 2
  • Mireille Régnier
    • 3
  • Mathias Vandenbogaert
    • 4
  1. 1.Moscow State UniversityVorob’evy GoryRussia
  2. 2.Igm, Université de Marne-la-ValléeFrance
  3. 3.InriaLe ChesnayFrance
  4. 4.BiozentrumBasel UniversitatSwitzerland

Personalised recommendations