Abstract
The identification of evolutionary related (homologous) proteins is a key problem in molecular biology. Here we present a inductive logic programming based method, Homology Induction (HI), which acts as a filter for existing sequence similarity searches to improve their performance in the detection of remote protein homologies. HI performs a PSI-BLAST search to generate positive, negative, and uncertain examples, and collects descriptions of these examples. It then learns rules to discriminate the positive and negative examples. The rules are used to filter the uncertain examples in the “twilight zone”. HI uses a multitable database of 51,430,710 pre-fabricated facts from a variety of biological sources, and the inductive logic programming system Aleph to induce rules. Hi was tested on an independent set of protein sequences with equal or less than 40 per cent sequence similarity (PDB40D). ROC analysis is performed showing that HI can significantly improve existing similarity searches. The method is automated and can be used via a web/mail interface.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. F. Altschul, W. Gish, W. Miller, Eugene W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403–410, 1990.
S. F. Altschul, T L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997.
A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 2000. Nucleic Acids Research, 28:45–48, 2000.
A. P. Bradley. The use of area under ROC curve in the evaluation of learning algorithms. Pattern Recognition, 30(7):1145–1159, 1995.
L. Breiman. Bagging predictors. Machine Learning, 26(2):123–140, 1996.
L. Dehaspe. Frequent Pattern Discovery in First-Order Logic. PhD thesis, Department of Computer Science, Katholieke Universiteit Leuven, Belgium, 1998.
S. Dzeroski. Inductive logic programming and knowledge discovery. In U. M. Fayyad, G. Piatetsky-Sharpiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 117–152. AAAI/MIT Press, 1996.
J. P. Egan. Signal Detection Theory and ROC Analysis. Cognition and Perception. Academic Press, New York, 1975.
D. Eisenberg. Three-dimensional structure of membrane and surface proteins. Ann. Rev. Biochem, 53:595–623, 1984.
Y. Freud and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
U. Hobohm and C. Sander. A sequence property approach to searching protein database. J. Mol. Biol., 251:390–399, 1995.
T. Jaakola, M. Diekhans, and D. Haussler. Using Fisher kernel method to detect remote protein homologies. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pages 149–158. AAAI, AAAI Press, 1999.
K. Karplus, C. Barrett, and R. Hughey. Hidden markov models for detecting remote protein homologies. Bioinformatics, 14(10):846–856, 1998. SAM-T98 paper.
R. D. King, S. Muggleton, A. Srinivasan, and M. J. E. Sterberg. Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Natl. Acad. Sci. USA, 93:438–442, 1996.
Ross D. King and Ashwin Srinivasan. The discovery of indicator variables for qsar unsing inductive logic programming. Journal of Compter-Aided Molecular Design, 11:571–580, 1997.
E. R. Kirk. Statistics: An Introduction. Hardcourt Brace College, USA, fourth edition, 1999.
N. Lavrac and S. Dzeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994.
D. J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997.
D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science, 277:1435–1441, March 1985.
R. M. MacCallum, L. A. Kelley, and M. J. E. Sternberg. SAWTED: Structure Assignment With TExt Description-enhanced detection of remote homologues with automated SWISS-PROT annotation comparision. Bioinformatics, 16(2):125–129, 2000.
Stephen Muggleton. Inductive logic programming. New Generation Computing, 8(4):295–318, 1990.
Stephen Muggleton. Inverse entailment and progol. New Generation Computing Journal, 13:245–286, 1995.
A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536–540, 1995.
S. B. Needleman and C. D. Wunsch. A general method applicable to the research for similarities in the amino acid sequencesof two proteins. J. Mol. Biol., 48:443–453, 1970.
H. Nielsen, J. Engelbrecht, S. Brunack, and G. von Heijne. Identification of prokaryotic and eukariotic signal peptides and prediction of their cleavage sites. Protein Engineering, 10:1–6, 1997.
J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, and C. Chothia. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284:1201–1210, 1998.
J. Park, S. A. Teichmann, T. Hubbard, and C. Chotia. Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol., 273:349–354, 1997.
W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, pages 2444–2448, 1988.
F. Provost, T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms. In Proc. 15th International Conf. on Machine Learning, pages 445–453. Morgan Kaufmann, San Francisco, CA, 1998.
F. J. Provost and T. Fawcett. Robust classification systems for imprecise environments. In AAAI/IAAI, pages 706–713, 1998.
Vijay Raghavan, Peter Bollmann, and Gwang S. Jung. A critical investigation of recall and presicion as measuers of retrievel system performance. ACM Transactions of Information Systems, 7(3):205–229, 1989.
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197, 1981. Smith, Waterman, dynamic, programming, local, alignment.
J. A. Swets and R. M. Pickett. Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press, New York, 1982.
G. Tecuci. Building Intelligent Agents: An Apprenticeship Multistrategy Learning Theory, Methodology, Tool and Case Studies. Academic Press, 1998.
M. Turcotte, Steven. H. Muggleton, and Micheal J. E. Sternberg. Application of inductive logic programming to discover rules governing the three-dimensional topology of protein structure. In C. D. Page, editor, Proc. 8th International Conference on Inductive Logic Programming (ILP-98), pages 53–64. Spinger Verlag, Berlin, 1998.
H. L. Van Trees. Detection, estimation, and modulation theory. Wiley, New York, 1971.
W Wright, P. Scordis, and T. K. Attwood. BLAST PRINTS-alternative perspectives on sequence similarity. Bioinformatics, 15(6):523–524, 1999.
P. Young. PrePRINTS. http://www.bioinf.man.ac.uk/ConceptualBlast.html.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Karwath, A., King, R.D. (2001). An Automated ILP Server in the Field of Bioinformatics. In: Rouveirol, C., Sebag, M. (eds) Inductive Logic Programming. ILP 2001. Lecture Notes in Computer Science(), vol 2157. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44797-0_8
Download citation
DOI: https://doi.org/10.1007/3-540-44797-0_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42538-0
Online ISBN: 978-3-540-44797-9
eBook Packages: Springer Book Archive