Pattern Discovery Allowing Wild-Cards, Substitution Matrices, and Multiple Score Functions
Pattern discovery has many applications in finding functionally or structurally important regions in biological sequences (binding sites, regulatory sites, protein signatures etc.). In this paper we present a new pattern discovery algorithm, which has the following features:
it allows to find, in exactly the same manner and without any prior specification, patterns with fixed length gaps (i.e. sequences of one or several consecutive wild-cards) and contiguous patterns;
it allows the use of any pairwise score function, thus offering multiple ways to define or to constrain the type of the searched patterns; in particular, one can use substitution matrices (PAM, BLOSUM) to compare amino acids, or exact matchings to compare nucleotides, or equivalency sets in both cases.
We describe the algorithm, compare it to other algorithms and give the results of the tests on discovering binding sites for DNA-binding proteins (ArgR, LexA, PurR, TyrR respectively) in E. coli, and promoter sites in a set of Dicot plants.
KeywordsScore Function Biological Sequence Pattern Discovery Sequence Logo Dicot Plant
Unable to display preview. Download preview PDF.
- 1.Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995)Google Scholar
- 2.Brejová, B., DiMarco, C., Vinar, T., Hidalgo, S.R., Hoguin, G., Patten, C.: Finding patterns in biological sequences. Tech. Rep. CS798g, University of Waterloo (2000)Google Scholar
- 6.Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Computer Applications in the Biosciences 13, 509–522 (1997)Google Scholar
- 10.Pevzner, P.A., Sze, S.-H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of ISMB, pp. 269–278 (2000)Google Scholar
- 16.Wilbur, W., Lipman, D.: Rapid similarity searches of nucleic acid and protein data banks. In: Proceeding of National Academy of Science, vol. 80, pp. 726–730 (1983)Google Scholar