Supervised Learning-Aided Optimization of Expert-Driven Functional Protein Sequence Annotation

  • Lev Soinov
  • Alexander Kanapin
  • Misha Kapushesky
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3240)


The aim of this work is to use a supervised learning approach to identify sets of motif-based sequence characteristics, combinations of which can give the most accurate annotation of new proteins. We assess several of InterPro Consortium member databases for their informativeness for the annotation of full-length protein sequences. Thus, our study addresses the problem of integrating biological information from various resources. Decision-rule algorithms are used to cross-map different biological classification systems in order to optimise the process of functional annotation of protein sequences. Various features (e.g., keywords, GO terms, structural complex names) may be assigned to a sequence via its characteristics (e.g., motifs built by various protein sequence analysis methods) with the developed approach. We chose SwissProt keywords as the set of features on which to perform our analysis. From the presented results one can quickly obtain the best combinations of methods appropriate for the description of a given class of proteins.


Decision Rule Feature Subset Selection Misclassification Cost Relative Informativeness Supervise Learning Approach 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R.R., Courcelle, E., Das, U., Durbin, R., Falquet, L., Fleischmann, W., Griffiths-Jones, S., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Lonsdale, D., Silventoinen, V., Orchard, S.E., Pagni, M., Peyruc, D., Ponting, C.P., Selengut, J.D., Servant, F., Sigrist, C.J., Vaughan, R., Zdobnov, E.M.: The InterPro Database, 2003 brings increased coverage and new features. Nucl. Acids. Res. 31(1), 315–318 (2003)CrossRefGoogle Scholar
  2. 2.
    Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., Bairoch, A.: The PROSITE database, its status in 2002. Nucleic Acids Res. 30(1), 235–238 (2002)CrossRefGoogle Scholar
  3. 3.
    Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., Sonnhammer, E.: The Pfam protein families database. Nucleic Acids Res. 28(1), 263–266 (2000)CrossRefGoogle Scholar
  4. 4.
    Attwood, T.K., Croning, M.D., Flower, D.R., Lewis, A.P., Mabey, J.E., Scordis, P., Selley, J.N., Wright, W.: PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 28(1), 225–227 (2000)CrossRefGoogle Scholar
  5. 5.
    Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., Richter, J., Rubin, G.M., Blake, J.A., Bult, C., Dolan, M., Drabkin, H., Eppig, J.T., Hill, D.P., Ni, L., Ringwald, M., Balakrishnan, R., Cherry, J.M., Christie, K.R., Costanzo, M.C., Dwight, S.S., Engel, S., Fisk, D.G., Hirschman, J.E., Hong, E.L., Nash, R.S., Sethuraman, A., Theesfeld, C.L., Botstein, D., Dolinski, K., Feierbach, B., Berardini, T., Mundodi, S., Rhee, S.Y., Apweiler, R., Barrell, D., Camon, E., Dimmer, E., Lee, V., Chisholm, R., Gaudet, P., Kibbe, W., Kishore, R., Schwarz, E.M., Sternberg, P., Gwinn, M., Hannick, L., Wortman, J., Berriman, M., Wood, V., de la Cruz, N., Tonellato, P., Jaiswal, P., Seigfried, T., White, R.: Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 32 (2004) Database issue:D258-61Google Scholar
  6. 6.
    Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research 31(1), 365–370 (2003)CrossRefGoogle Scholar
  7. 7.
    Jensen, L.J., Gupta, R., Staerfeldt, H.H., Brunak, S.: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 19(5), 635–642 (2003)CrossRefGoogle Scholar
  8. 8.
    Bazzan, A.L., Engel, P.M., Schroeder, L.F., Da Silva, S.: Automated annotation of keywords for proteins related to mycoplasmataceae using machine learning techniques. Bioinformatics 18 (suppl. 2), S35-43 (2002)Google Scholar
  9. 9.
    Kretschmann, E., Fleischmann, W., Apweiler, R.: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 17(10), 920–926 (2001)CrossRefGoogle Scholar
  10. 10.
    Pavlidis, P., Weston, J., Cai, J., Noble, W.S.: Learning gene functional classifications from multiple data types. J. Comput Biol. 9(2), 401–411 (2002)CrossRefGoogle Scholar
  11. 11.
    Provost, F., Fawcett, T., Kohavi, R.: Building the Case Against Accuracy Estimation for Comparing Induction Algorithms. In: ICML 1998 (1998)Google Scholar
  12. 12.
    Witten, I., Frank, E.: Data Mining-Practical Machine Learning Tools and Techniques with JAVA Implementations. Morgan Kaufmann, San Francisco (1999)Google Scholar
  13. 13.
  14. 14.
    Hulo, N., Sigrist, C.J., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A.: Recent improvements to the PROSITE database. Nucleic Acids Res. 32, 134–137 (2004)CrossRefGoogle Scholar
  15. 15.
    Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., Bucher, P., Copley, R., Courcelle, E., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Griffith-Jones, S., Haft, D., Hermjakob, H., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lopez, R., Letunic, I., Orchard, S., Pagni, M., Peyruc, D., Ponting, C.P., Servant, F., Sigrist, C.J.: InterPro Consortium. InterPro: An integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3(3), 225–235 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Lev Soinov
    • 1
  • Alexander Kanapin
    • 2
  • Misha Kapushesky
    • 3
  1. 1.Algorithms and methodsEBI, Wellcome Trust Genome CampusHinxton, CambridgeUK
  2. 2.InterPro and SwissProt data retrieval and encodingEBI, Wellcome Trust Genome CampusHinxton, CambridgeUK
  3. 3.Calculations and programmingEBI, Wellcome Trust Genome CampusHinxton, CambridgeUK

Personalised recommendations