Faster Variance Computation for Patterns with Gaps

Cunial, Fabio

doi:10.1007/978-3-642-34862-4_10

Fabio Cunial¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7659))

Included in the following conference series:

Mediterranean Conference on Algorithms

994 Accesses

Abstract

Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in large-scale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(|w|²) time, improving a previous result that required O(2^|w|) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(|w|log|w|) time online computation after O(|s|³) preprocessing of s. Our algorithms lend themselves to efficient implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Reinert, G., Schbath, S., Waterman, M.: Probabilistic and statistical properties of words: an overview. Journal of Computational Biology 7, 1–46 (2000)
Article Google Scholar
Apostolico, A., Bock, M., Xu, X.: Annotated statistical indices for sequence analysis. In: Proceedings of the Compression and Complexity of Sequences, Sequences 1997, pp. 215–229. IEEE Computer Society, Washington, DC (1997)
Google Scholar
Apostolico, A., Bock, M., Lonardi, S.: Monotony of surprise and large-scale quest for unusual words. In: Proceedings of the Sixth Annual International Conference on Computational Biology, RECOMB 2002, pp. 22–31. ACM, New York (2002)
Chapter Google Scholar
Apostolico, A., Bock, M., Lonardi, S., Xu, X.: Efficient detection of unusual words. Journal of Computational Biology 7(1), 71–94 (2000)
Article Google Scholar
Apostolico, A., Pizzi, C.: Monotone Scoring of Patterns with Mismatches. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 87–98. Springer, Heidelberg (2004)
Chapter Google Scholar
Pizzi, C., Bianco, M.: Expectation of Strings with Mismatches under Markov Chain Distribution. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 222–233. Springer, Heidelberg (2009)
Chapter Google Scholar
Ferreira, P., Azevedo, P.: Evaluating deterministic motif significance measures in protein databases. Algorithms for Molecular Biology 2(1), 16 (2007)
Article Google Scholar
Flajolet, P., Guivarc’h, Y., Szpankowski, W., Vallée, B.: Hidden Pattern Statistics. In: Yu, Y., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, pp. 152–165. Springer, Heidelberg (2001)
Chapter Google Scholar
Gwadera, R., Atallah, M., Szpankowski, W.: Reliable detection of episodes in event sequences. In: Knowledge and Information Systems, pp. 67–74 (2004)
Google Scholar
Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Computer Science 287, 593–617 (2002)
Article MathSciNet MATH Google Scholar
Robin, S., Daudin, J.J., Richard, H., Sagot, M.F., Schbath, S.: Occurrence probability of structured motifs in random sequences. Journal of Computational Biology, 761–774 (2002)
Google Scholar
Stolovitzky, G., Califano, A.: Statistical significance of patterns in biosequences. IBM research report (1998)
Google Scholar
Parida, L., Rigoutsos, I., Floratos, A., Platt, D., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In: Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2000, pp. 297–308. Society for Industrial and Applied Mathematics, Philadelphia (2000)
Google Scholar
Apostolico, A., Comin, M., Parida, L.: Conservative extraction of over-represented extensible motifs. Bioinformatics 21, i9–i18 (2005)
Google Scholar
Califano, A.: SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics 16, 341–357 (2000)
Article Google Scholar
Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14(1), 55–67 (1998)
Article Google Scholar
Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 344–354 (2000)
Google Scholar
Sinha, S., Tompa, M.: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 30(24), 5549–5560 (2002)
Article Google Scholar
Kleffe, J., Borodovsky, M.: First and second moment of counts of words in random texts generated by Markov chains. Bioinformatics/Computer Applications in the Biosciences 8, 433–441 (1992)
Google Scholar
Fischer, M., Paterson, M.: String-matching and other products. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA (1974)
Google Scholar
Cole, R., Hariharan, R.: Verifying candidate matches in sparse and wildcard matching. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 592–601. ACM, New York (2002)
Chapter Google Scholar
Sigrist, C., Cerutti, L., de Castro, E., Langendijk-Genevaux, P., Bulliard, V., Bairoch, A., Hulo, N.: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research 38, 161–166 (2010)
Article Google Scholar
Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. Journal of Computational Biology 11, 15–25 (2004)
Article Google Scholar
Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.-F.: A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum. In: Rovan, B., Vojtáš, P. (eds.) MFCS 2003. LNCS, vol. 2747, pp. 622–631. Springer, Heidelberg (2003)
Chapter Google Scholar
Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.: Bases of motifs for generating repeated patterns with wildcards. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(1), 40–50 (2005)
Article Google Scholar
Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics 17(1), S30–S38 (2001)
Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22(1), 75–81 (1976)
Article MathSciNet MATH Google Scholar
Parida, L., Rigoutsos, I., Platt, D.: An Output-Sensitive Flexible Pattern Discovery Algorithm. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 131–142. Springer, Heidelberg (2001)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

College of Computing, Georgia Institute of Technology, Atlanta, GA, USA
Fabio Cunial

Authors

Fabio Cunial
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical Engineering, Tel-Aviv University, 67789, Tel Aviv, Israel
Guy Even
School of Electrical Engineering, Tel-Aviv University, 67789, Tel-Aviv, Israel
Dror Rawitz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cunial, F. (2012). Faster Variance Computation for Patterns with Gaps. In: Even, G., Rawitz, D. (eds) Design and Analysis of Algorithms. MedAlg 2012. Lecture Notes in Computer Science, vol 7659. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34862-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-34862-4_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34861-7
Online ISBN: 978-3-642-34862-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics