Supervised Learning-Aided Optimization of Expert-Driven Functional Protein Sequence Annotation

Soinov, Lev; Kanapin, Alexander; Kapushesky, Misha

doi:10.1007/978-3-540-30219-3_14

Lev Soinov²¹,
Alexander Kanapin²² &
Misha Kapushesky²³

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3240))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

581 Accesses

Abstract

The aim of this work is to use a supervised learning approach to identify sets of motif-based sequence characteristics, combinations of which can give the most accurate annotation of new proteins. We assess several of InterPro Consortium member databases for their informativeness for the annotation of full-length protein sequences. Thus, our study addresses the problem of integrating biological information from various resources. Decision-rule algorithms are used to cross-map different biological classification systems in order to optimise the process of functional annotation of protein sequences. Various features (e.g., keywords, GO terms, structural complex names) may be assigned to a sequence via its characteristics (e.g., motifs built by various protein sequence analysis methods) with the developed approach. We chose SwissProt keywords as the set of features on which to perform our analysis. From the presented results one can quickly obtain the best combinations of methods appropriate for the description of a given class of proteins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R.R., Courcelle, E., Das, U., Durbin, R., Falquet, L., Fleischmann, W., Griffiths-Jones, S., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Lonsdale, D., Silventoinen, V., Orchard, S.E., Pagni, M., Peyruc, D., Ponting, C.P., Selengut, J.D., Servant, F., Sigrist, C.J., Vaughan, R., Zdobnov, E.M.: The InterPro Database, 2003 brings increased coverage and new features. Nucl. Acids. Res. 31(1), 315–318 (2003)
Article Google Scholar
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., Bairoch, A.: The PROSITE database, its status in 2002. Nucleic Acids Res. 30(1), 235–238 (2002)
Article Google Scholar
Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., Sonnhammer, E.: The Pfam protein families database. Nucleic Acids Res. 28(1), 263–266 (2000)
Article Google Scholar
Attwood, T.K., Croning, M.D., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P., Selley, J.N., Wright, W.: PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28(1), 225–227 (2000)
Article Google Scholar
Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., Richter, J., Rubin, G.M., Blake, J.A., Bult, C., Dolan, M., Drabkin, H., Eppig, J.T., Hill, D.P., Ni, L., Ringwald, M., Balakrishnan, R., Cherry, J.M., Christie, K.R., Costanzo, M.C., Dwight, S.S., Engel, S., Fisk, D.G., Hirschman, J.E., Hong, E.L., Nash, R.S., Sethuraman, A., Theesfeld, C.L., Botstein, D., Dolinski, K., Feierbach, B., Berardini, T., Mundodi, S., Rhee, S.Y., Apweiler, R., Barrell, D., Camon, E., Dimmer, E., Lee, V., Chisholm, R., Gaudet, P., Kibbe, W., Kishore, R., Schwarz, E.M., Sternberg, P., Gwinn, M., Hannick, L., Wortman, J., Berriman, M., Wood, V., de la Cruz, N., Tonellato, P., Jaiswal, P., Seigfried, T., White, R.: Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 32 (2004) Database issue:D258-61
Google Scholar
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31(1), 365–370 (2003)
Article Google Scholar
Jensen, L.J., Gupta, R., Staerfeldt, H.H., Brunak, S.: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 19(5), 635–642 (2003)
Article Google Scholar
Bazzan, A.L., Engel, P.M., Schroeder, L.F., Da Silva, S.: Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics 18 (suppl. 2), S35-43 (2002)
Google Scholar
Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10), 920–926 (2001)
Article Google Scholar
Pavlidis, P., Weston, J., Cai, J., Noble, W.S.: Learning gene functional classifications from multiple data types. J. Comput Biol. 9(2), 401–411 (2002)
Article Google Scholar
Provost, F., Fawcett, T., Kohavi, R.: Building the Case Against Accuracy Estimation for Comparing Induction Algorithms. In: ICML 1998 (1998)
Google Scholar
Witten, I., Frank, E.: Data Mining-Practical Machine Learning Tools and Techniques with JAVA Implementations. Morgan Kaufmann, San Francisco (1999)
Google Scholar
WEKA, http://www.cs.waikato.ac.nz/~ml/weka
Hulo, N., Sigrist, C.J., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A.: Recent improvements to the PROSITE database. Nucleic Acids Res. 32, 134–137 (2004)
Article Google Scholar
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R., Courcelle, E., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Griffith-Jones, S., Haft, D., Hermjakob, H., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Orchard, S., Pagni, M., Peyruc, D., Ponting, C.P., Servant, F., Sigrist, C.J.: InterPro Consortium. InterPro: An integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3(3), 225–235 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Algorithms and methods, EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Lev Soinov
InterPro and SwissProt data retrieval and encoding, EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Alexander Kanapin
Calculations and programming, EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Misha Kapushesky

Authors

Lev Soinov
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Kanapin
View author publications
You can also search for this author in PubMed Google Scholar
Misha Kapushesky
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics and Computational Biology Unit, HIB, University of Bergen, 5020, Bergen, Norway
Inge Jonassen
Department of Biology,, Penn Center for Bioinformatics, Penn Genomics Institute, 415 S. University Ave., PA 19104, Philadelphia, USA
Junhyong Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Soinov, L., Kanapin, A., Kapushesky, M. (2004). Supervised Learning-Aided Optimization of Expert-Driven Functional Protein Sequence Annotation. In: Jonassen, I., Kim, J. (eds) Algorithms in Bioinformatics. WABI 2004. Lecture Notes in Computer Science(), vol 3240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30219-3_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-30219-3_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23018-2
Online ISBN: 978-3-540-30219-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics