Abstract
Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in large-scale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(|w|2) time, improving a previous result that required O(2|w|) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(|w|log|w|) time online computation after O(|s|3) preprocessing of s. Our algorithms lend themselves to efficient implementations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Reinert, G., Schbath, S., Waterman, M.: Probabilistic and statistical properties of words: an overview. Journal of Computational Biology 7, 1–46 (2000)
Apostolico, A., Bock, M., Xu, X.: Annotated statistical indices for sequence analysis. In: Proceedings of the Compression and Complexity of Sequences, Sequences 1997, pp. 215–229. IEEE Computer Society, Washington, DC (1997)
Apostolico, A., Bock, M., Lonardi, S.: Monotony of surprise and large-scale quest for unusual words. In: Proceedings of the Sixth Annual International Conference on Computational Biology, RECOMB 2002, pp. 22–31. ACM, New York (2002)
Apostolico, A., Bock, M., Lonardi, S., Xu, X.: Efficient detection of unusual words. Journal of Computational Biology 7(1), 71–94 (2000)
Apostolico, A., Pizzi, C.: Monotone Scoring of Patterns with Mismatches. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 87–98. Springer, Heidelberg (2004)
Pizzi, C., Bianco, M.: Expectation of Strings with Mismatches under Markov Chain Distribution. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 222–233. Springer, Heidelberg (2009)
Ferreira, P., Azevedo, P.: Evaluating deterministic motif significance measures in protein databases. Algorithms for Molecular Biology 2(1), 16 (2007)
Flajolet, P., Guivarc’h, Y., Szpankowski, W., Vallée, B.: Hidden Pattern Statistics. In: Yu, Y., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, pp. 152–165. Springer, Heidelberg (2001)
Gwadera, R., Atallah, M., Szpankowski, W.: Reliable detection of episodes in event sequences. In: Knowledge and Information Systems, pp. 67–74 (2004)
Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Computer Science 287, 593–617 (2002)
Robin, S., Daudin, J.J., Richard, H., Sagot, M.F., Schbath, S.: Occurrence probability of structured motifs in random sequences. Journal of Computational Biology, 761–774 (2002)
Stolovitzky, G., Califano, A.: Statistical significance of patterns in biosequences. IBM research report (1998)
Parida, L., Rigoutsos, I., Floratos, A., Platt, D., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In: Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2000, pp. 297–308. Society for Industrial and Applied Mathematics, Philadelphia (2000)
Apostolico, A., Comin, M., Parida, L.: Conservative extraction of over-represented extensible motifs. Bioinformatics 21, i9–i18 (2005)
Califano, A.: SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics 16, 341–357 (2000)
Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14(1), 55–67 (1998)
Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 344–354 (2000)
Sinha, S., Tompa, M.: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 30(24), 5549–5560 (2002)
Kleffe, J., Borodovsky, M.: First and second moment of counts of words in random texts generated by Markov chains. Bioinformatics/Computer Applications in the Biosciences 8, 433–441 (1992)
Fischer, M., Paterson, M.: String-matching and other products. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA (1974)
Cole, R., Hariharan, R.: Verifying candidate matches in sparse and wildcard matching. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 592–601. ACM, New York (2002)
Sigrist, C., Cerutti, L., de Castro, E., Langendijk-Genevaux, P., Bulliard, V., Bairoch, A., Hulo, N.: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research 38, 161–166 (2010)
Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. Journal of Computational Biology 11, 15–25 (2004)
Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.-F.: A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum. In: Rovan, B., Vojtáš, P. (eds.) MFCS 2003. LNCS, vol. 2747, pp. 622–631. Springer, Heidelberg (2003)
Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.: Bases of motifs for generating repeated patterns with wildcards. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(1), 40–50 (2005)
Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics 17(1), S30–S38 (2001)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22(1), 75–81 (1976)
Parida, L., Rigoutsos, I., Platt, D.: An Output-Sensitive Flexible Pattern Discovery Algorithm. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 131–142. Springer, Heidelberg (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cunial, F. (2012). Faster Variance Computation for Patterns with Gaps. In: Even, G., Rawitz, D. (eds) Design and Analysis of Algorithms. MedAlg 2012. Lecture Notes in Computer Science, vol 7659. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34862-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-34862-4_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34861-7
Online ISBN: 978-3-642-34862-4
eBook Packages: Computer ScienceComputer Science (R0)