Advertisement

Monotone Scoring of Patterns with Mismatches

  • Alberto Apostolico
  • Cinzia Pizzi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3240)

Abstract

We study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly or up to) k mismatches. Two related schemes are presented. In the first one, an O(nk) time preprocessing of x is developed that supports the following subsequent queries: for any substring w of x arbitrarily specified as input, the probability of occurrence of w in x within (either exactly or up to) k mismatches is reported in O(k 2) time. In the second scheme, a length or length range is arbitrarily specified, and the above probabilities are computed for all substrings of x having length in that range, in overall O(nk) time. Further, monotonicity conditions are introduced and studied for probabilities and expected occurrences of a substring under unit increases in its length, allowed number of errors, or both. Over intervals of constant frequency count, these monotonicities translate to some of the scores in use, thereby reducing the size of tables at the outset and enhancing the process of discovery. These latter derivations extend to patterns with mismatches an analysis previously devoted to exact patterns.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A.: Pattern discovery and the algorithmics of surprise. In: Frasconi, P., Shamir, R. (eds.) Artificial Intelligence and Heuristic Methods for Bioinformatics, pp. 111–127. IOS Press, Amsterdam (2003)Google Scholar
  2. 2.
    Apostolico, A., Galil, Z. (eds.): Pattern matching algorithms. Oxford University Press, Oxford (1997)zbMATHGoogle Scholar
  3. 3.
    Apostolico, A., Bock, M.E., Lonardi, S.: Monotony of surprise and largescale quest for unusual words (extended abstract). In: Proc. of Research in Computational Molecular Biology RECOMB, Washington, DC (2002); Myers, G., Hannenhalli, S., Istrail, S., Pevzner, P., Waterman, M. (eds.): Also, J. Comp. Bio., 10:3-4, 283–311 (July 2003)Google Scholar
  4. 4.
    Apostolico, A., Parida, L.: Incremental Paradigms of Motif Discovery. J. Comput. Bio. 7,11(1), 15–25 (2004)CrossRefGoogle Scholar
  5. 5.
    Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21(1/2), 51–80 (1995)CrossRefGoogle Scholar
  6. 6.
    Br\(\bar{a}\)zma, A., Jonassen, I., Ukkonen, E., Vilo, J.: Predicting gene regulatory elements in silico on a genomic scale. Genome Research 8(11), 1202–1215 (1998)Google Scholar
  7. 7.
    Buhler, J., Tompa, M.: Finding motifs using random projections. J. Comput. Bio. 9(2), 225–242 (2002)CrossRefGoogle Scholar
  8. 8.
    Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically sign ificant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)CrossRefGoogle Scholar
  9. 9.
    Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13, 509–522 (1997)Google Scholar
  10. 10.
    Keich, Pevzner: Finding motifs in the twilight zone. In: Annual International Conference on Computational Molecular Biology, Washington, DC, April 2002, pp. 195–204 (2002)Google Scholar
  11. 11.
    Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Alberto Apostolico
    • 1
  • Cinzia Pizzi
    • 2
  1. 1.University of Padova & Purdue University 
  2. 2.University of Padova 

Personalised recommendations