Pattern Discovery Allowing Wild-Cards, Substitution Matrices, and Multiple Score Functions

Mancheron, Alban; Rusu, Irena

doi:10.1007/978-3-540-39763-2_10

Alban Mancheron⁹ &
Irena Rusu⁹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 2812))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

861 Accesses

Abstract

Pattern discovery has many applications in finding functionally or structurally important regions in biological sequences (binding sites, regulatory sites, protein signatures etc.). In this paper we present a new pattern discovery algorithm, which has the following features:

it allows to find, in exactly the same manner and without any prior specification, patterns with fixed length gaps (i.e. sequences of one or several consecutive wild-cards) and contiguous patterns;

it allows the use of any pairwise score function, thus offering multiple ways to define or to constrain the type of the searched patterns; in particular, one can use substitution matrices (PAM, BLOSUM) to compare amino acids, or exact matchings to compare nucleotides, or equivalency sets in both cases.

We describe the algorithm, compare it to other algorithms and give the results of the tests on discovering binding sites for DNA-binding proteins (ArgR, LexA, PurR, TyrR respectively) in E. coli, and promoter sites in a set of Dicot plants.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995)
Google Scholar
Brejová, B., DiMarco, C., Vinar, T., Hidalgo, S.R., Hoguin, G., Patten, C.: Finding patterns in biological sequences. Tech. Rep. CS798g, University of Waterloo (2000)
Google Scholar
Buhler, J., Tompa, M.: Finding motifs using random projections. In: Proceedings of RECOMB 2001, pp. 69–76. ACM Press, New York (2001)
Chapter Google Scholar
Califano, A.: SPLASH: Structural pattern localization analysis by sequential histograms. Bioinformatics 16(4), 341–357 (2000)
Article Google Scholar
Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)
Article Google Scholar
Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Computer Applications in the Biosciences 13, 509–522 (1997)
Google Scholar
Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., Wootton, J.: Detecting subtle sequence signal: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)
Article Google Scholar
Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity search. Sciences 227, 1435–1441 (1985)
Article Google Scholar
Marsan, L., Sagot, M.-F.: Extracting structured motifs using a suffix tree. In: Proceedings of RECOMB 2000, pp. 210–219. ACM Press, New York (2000)
Chapter Google Scholar
Pevzner, P.A., Sze, S.-H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of ISMB, pp. 269–278 (2000)
Google Scholar
Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14(1), 55–67 (1998)
Article Google Scholar
Robison, K., McGuire, A.M., Church, G.M.: A comprehensive library of DNA binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J. Mol. Biol. 284, 241–254 (1998)
Article Google Scholar
Smith, H.O., Annau, T.M., Chandrasegaran, S.: Finding sequence motifs groups of functionally related proteins. Proc. Nat. Ac. Sci. USA 87, 826–830 (1990)
Article Google Scholar
Schneider, T.D., Stephens, R.M.: Sequence logos: a new way to display consensus sequence. Nucl. Acids Res 18, 6097–6100 (1990)
Article Google Scholar
Waterman, M.: Introduction to computational biology: maps, sequences and genomes. Chapman & Hall, Boca Raton (2000)
MATH Google Scholar
Wilbur, W., Lipman, D.: Rapid similarity searches of nucleic acid and protein data banks. In: Proceeding of National Academy of Science, vol. 80, pp. 726–730 (1983)
Google Scholar

Download references

Author information

Authors and Affiliations

I.R.I.N., Université de Nantes, 2,Rue de la Houssinière, B.P. 92208, 44322 Cedex 3, Nantes, France
Alban Mancheron & Irena Rusu

Authors

Alban Mancheron
View author publications
You can also search for this author in PubMed Google Scholar
Irena Rusu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Biomathematical Sciences, The Mount Sinai School of Medicine, 10029-6574, New York, NY
Gary Benson
Institute of Biomedical and Life Sciences, Division of Environmental and Evolutionary Biology, University of Glasgow, G12 8QQ, Glasgow, Scotland
Roderic D. M. Page

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mancheron, A., Rusu, I. (2003). Pattern Discovery Allowing Wild-Cards, Substitution Matrices, and Multiple Score Functions. In: Benson, G., Page, R.D.M. (eds) Algorithms in Bioinformatics. WABI 2003. Lecture Notes in Computer Science(), vol 2812. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39763-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-39763-2_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20076-5
Online ISBN: 978-3-540-39763-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics