Abstract
We show how prior domain knowledge can be used in a system for mining databases of biological data. Our system performs automated discovery of diagnostic patterns from a database of protein sequences. Such patterns are used for classification of new sequences, and identification of biologically interesting positions in the proteins. The patterns have a simple syntax and can be translated into regular expressions, which can be used for rapid scanning of databases. Current pattern libraries are built semi-manually, since the correctness of the pattern depends on the incorporation of domain knowledge. Due to the dramatic growth of the databases it is desirable to automate this process. Our results show that the patterns derived by our fully automated system compete well with the semi-manually constructed patterns.
Chapter PDF
References
Krogh A., Brown B., Mian I.S., Sjölander K., and Haussler D. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–31, 1994.
A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplement TREMBL. Nucleic Acids Research, 25:31–6, 1997.
A. Bairoch, P. Bucher, and K. Hofmann. The PROSITE database, its status in 1997. Nucleic Acids Research, 25:217–221, 1997.
D.A. Benson, M.S. Boguski, D.J. Lipman, J. Ostell, and B. Ouellette. GenBank. Nucleic Acids Research, 26(1):1–7, 1998.
A. Brāzma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the automatic discovery of patterns in biosequences. Technical Report 113, Dept. of Informatics, Univ. of Bergen, 1993.
T.E. Creighton. Protein folding. In R.A. Meyers, editor, Molecular Biology and Biotechnology: A Comprehensive Desk Reference. VCH Publishers, 1995.
EMBL nucleotide sequence database: Release notes, release 53, December 1997.
L. Hunter. Molecular biology for computer scientists. In L. Hunter, editor, Artificial Intelligence and Molecular Biology. AAAI Press/MIT Press, 1993.
K. Karplus. Evaluating regularizers for estimating distributions of amino acids. In C. Rawlings, D. Clark, R. Altman, L. Hunter T. Lengauer, and S. Wodak, editors, Proc. of ISMB95. AAAI Press. 1995.
K. Laurio. Probabilistic modeling of protein families. Master’s thesis, University of Skövde, Sweden, 1997.
NCBI News. NIH Publication No. 95-3272, September 1995.
B. Rost. Learning from evolution to predict protein structure. In Biocomputing and Emergent Computation—Proceedings of BCEC97. World Scientific, 1997.
B. Rost and R. Schneider. Pedestrian guide to analysing sequence databases. In K. Ashman, editor, Core Technologies in Biochemistry. Springer, 1997.
C. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27, 1948.
K. Sjölander. Bayesian evolutionary tree estimation. In Proceedings of the Computing in the Genome Era conference, Washington DC, March 1997.
K. Sjölander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I.S. Mian, and D. Haussler. Dirichlet mixtures: A method for improved detection of weak but significant protein sequence homology. CABIOS, 12(4):327–45, 1996.
E.L.L. Sonnhammer, S.R. Eddy, E. Birney, A. Bateman, and R. Durbin. Pfam: Multiple sequence alignments and hmm-profiles of protein domains. Nucleic Acids Research, in press, 1998.
E.L.L. Sonnhammer, S.R. Eddy, and R. Durbin. Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins, 28:405–20, 1997.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Olsson, B., Laurio, K. (1998). Discovery of diagnostic patterns from protein sequence databases. In: Żytkow, J.M., Quafafou, M. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1998. Lecture Notes in Computer Science, vol 1510. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0094817
Download citation
DOI: https://doi.org/10.1007/BFb0094817
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65068-3
Online ISBN: 978-3-540-49687-8
eBook Packages: Springer Book Archive