Assessing the Significance of Sets of Words
Various criteria have been defined to evaluate the significance of sets of words, the computation of them often being difficult. We provide explicit expressions for the waiting time in such a context. In order to assess the significance of a cluster of potential binding sites, we extend them to the co-occurrence problem. We point out that these criteria values depend on a few fundamental parameters. We provide efficient algorithms to compute them, that rely on a combinatorial interpretation of the formulae. We show that our results are very tight in the so-called twilight zone and improve on previous rough approximations. One assumes that the text is generated according to a Markov stationary process. These results are developed for an extended model of consensus.
KeywordsMarkov Model Occurrence Probability String Match Bernoulli Model Markov Stationary Process
Unable to display preview. Download preview PDF.
- 2.Vandenbogaert, M., Makeev, V.: Analysis of bacterial rm-systems through genomescale analysis and related taxonomic issues. Silico Biol. 3, 12 (2003)Google Scholar
- 9.Pevzner, P., Borodovski, M., Mironov, A.: Linguistics of nucleotide sequences i: the significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. J. Biomol. Struct. Dynam. 6, 1013–1026 (1989)Google Scholar
- 13.Régnier, M., Szpankowski, W.: On the approximate pattern occurrences in a text. In: Compression and Complexity of sequences, pp. 253–264. IEEE Computer Society, Los Alamitos (1997)Google Scholar
- 22.Régnier, M.: Mathematical tools for regulatory signals extraction. In: Kolchanov, N., Hofestaedt, R. (eds.) Bioinformatics of Genome Regulation and Structure, pp. 61–70. Kluwer Academic Publisher, Dordrecht (2004)Google Scholar
- 26.Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics (ISMB special issue) 817, 30–38 (2001)Google Scholar