Pattern Discovery Allowing Wild-Cards, Substitution Matrices, and Multiple Score Functions

  • Alban Mancheron
  • Irena Rusu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2812)


Pattern discovery has many applications in finding functionally or structurally important regions in biological sequences (binding sites, regulatory sites, protein signatures etc.). In this paper we present a new pattern discovery algorithm, which has the following features:

it allows to find, in exactly the same manner and without any prior specification, patterns with fixed length gaps (i.e. sequences of one or several consecutive wild-cards) and contiguous patterns;

it allows the use of any pairwise score function, thus offering multiple ways to define or to constrain the type of the searched patterns; in particular, one can use substitution matrices (PAM, BLOSUM) to compare amino acids, or exact matchings to compare nucleotides, or equivalency sets in both cases.

We describe the algorithm, compare it to other algorithms and give the results of the tests on discovering binding sites for DNA-binding proteins (ArgR, LexA, PurR, TyrR respectively) in E. coli, and promoter sites in a set of Dicot plants.


Score Function Biological Sequence Pattern Discovery Sequence Logo Dicot Plant 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995)Google Scholar
  2. 2.
    Brejová, B., DiMarco, C., Vinar, T., Hidalgo, S.R., Hoguin, G., Patten, C.: Finding patterns in biological sequences. Tech. Rep. CS798g, University of Waterloo (2000)Google Scholar
  3. 3.
    Buhler, J., Tompa, M.: Finding motifs using random projections. In: Proceedings of RECOMB 2001, pp. 69–76. ACM Press, New York (2001)CrossRefGoogle Scholar
  4. 4.
    Califano, A.: SPLASH: Structural pattern localization analysis by sequential histograms. Bioinformatics 16(4), 341–357 (2000)CrossRefGoogle Scholar
  5. 5.
    Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)CrossRefGoogle Scholar
  6. 6.
    Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Computer Applications in the Biosciences 13, 509–522 (1997)Google Scholar
  7. 7.
    Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., Wootton, J.: Detecting subtle sequence signal: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)CrossRefGoogle Scholar
  8. 8.
    Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity search. Sciences 227, 1435–1441 (1985)CrossRefGoogle Scholar
  9. 9.
    Marsan, L., Sagot, M.-F.: Extracting structured motifs using a suffix tree. In: Proceedings of RECOMB 2000, pp. 210–219. ACM Press, New York (2000)CrossRefGoogle Scholar
  10. 10.
    Pevzner, P.A., Sze, S.-H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of ISMB, pp. 269–278 (2000)Google Scholar
  11. 11.
    Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14(1), 55–67 (1998)CrossRefGoogle Scholar
  12. 12.
    Robison, K., McGuire, A.M., Church, G.M.: A comprehensive library of DNA binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J. Mol. Biol. 284, 241–254 (1998)CrossRefGoogle Scholar
  13. 13.
    Smith, H.O., Annau, T.M., Chandrasegaran, S.: Finding sequence motifs groups of functionally related proteins. Proc. Nat. Ac. Sci. USA 87, 826–830 (1990)CrossRefGoogle Scholar
  14. 14.
    Schneider, T.D., Stephens, R.M.: Sequence logos: a new way to display consensus sequence. Nucl. Acids Res 18, 6097–6100 (1990)CrossRefGoogle Scholar
  15. 15.
    Waterman, M.: Introduction to computational biology: maps, sequences and genomes. Chapman & Hall, Boca Raton (2000)zbMATHGoogle Scholar
  16. 16.
    Wilbur, W., Lipman, D.: Rapid similarity searches of nucleic acid and protein data banks. In: Proceeding of National Academy of Science, vol. 80, pp. 726–730 (1983)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Alban Mancheron
    • 1
  • Irena Rusu
    • 1
  1. 1.I.R.I.N.Université de NantesNantesFrance

Personalised recommendations