Enzyme Function Prediction with Interpretable Models

  • Umar Syed
  • Golan Yona
Part of the Methods in Molecular Biology book series (MIMB, volume 541)


Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.

Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.

Key words

Sequence–function relationships functional prediction decision trees, enzyme classification 



This work is supported by the National Science Foundation under Grant No. 0133311 to Golan Yona.


  1. 1.
    Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 28, 27–30.PubMedCrossRefGoogle Scholar
  2. 2.
    Caspi, R., Foerster, H., Fulcher, C.A., Hopkinson, R., Ingraham, J., Kaipa, P., Krummenacker, M., Paley, S., Pick, J., Rhee, S. Y., Tissier, C., Zhang, P., and Karp, P. D. (2006) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucl. Acids Res. 34, D511–D516.PubMedCrossRefGoogle Scholar
  3. 3.
    Paley, S. M. and Karp, P.D. (2002) Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics 18, 715–724.Google Scholar
  4. 4.
    Bono, H., Ogata, H., Goto, S., and Kanehisa, M. (1998) Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. Genome Res. 8, 203–210.PubMedGoogle Scholar
  5. 5.
    Green, M. and Karp, P. D. (2004) A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5, 76.Google Scholar
  6. 6.
    Chen, L. and Vitkup, D. (2006) Predicting genes for orphan metabolic activities using phylogenetic profiles. Genome Biol. 7, R17.Google Scholar
  7. 7.
    Kharchenko, P., Chen, L., Freund, Y., Vitkup, D., and Church, G. M. (2006) Identifying metabolic enzymes with multiple types of association evidence. BMC Bioinformatics 7, 177.PubMedCrossRefGoogle Scholar
  8. 8.
    Popescu, L. and Yona, G. (2005) Automation of gene assignments to metabolic pathways using high-throughput expression data. BMC Bioinformatics 6, 217.Google Scholar
  9. 9.
    Popescu, L. and Yona, G. (2006) Expectation-maximization algorithms for fuzzy assignment of genes to cellular pathways. In proceedings of the 2006 Computational Systems Bioinformatics Conference. Google Scholar
  10. 10.
    Yaminishi, Y., Vert, J., and Kanehisa, M. (2005) Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics 21, i468–i477.CrossRefGoogle Scholar
  11. 11.
  12. 12.
    Shah, I. and Hunter, L. (1997) Predicting enzyme function from sequence: a systematic appraisal. In the Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology 276–283.Google Scholar
  13. 13.
    Wilson, D. B. and Irwin, D. C. (1999) Genetics and properties of cellulases. Adv. Biochem. Eng. 65, 2–21.Google Scholar
  14. 14.
    Stawiski, E. W., Baucom, A. E., Lohr, S. C., and Gregoret, L. M. (2000) Predicting protein function from structure: unique structural features of proteases. Proc. Natl. Acad. Sci. U.S.A. 97, 3954–3958.Google Scholar
  15. 15.
    Todd, A. E., Orengo, C. A., and Thornton, J. M. (2001) Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143.PubMedCrossRefGoogle Scholar
  16. 16.
    Devos, D. and Valencia, A. (2000) Practical limits of function prediction. Prot. Struct. Func. Genet. 41, 98–107.Google Scholar
  17. 17.
    Holm, L. and Sander, C. (1994) The FSSP database of structurally aligned protein fold families. Nucl. Acids Res. 22, 3600–3609.Google Scholar
  18. 18.
    Wilson, C. A., Kreychman, J., and Gerstein, M. (2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297, 233–249.PubMedCrossRefGoogle Scholar
  19. 19.
    Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.PubMedGoogle Scholar
  20. 20.
    Rost, B. (2002) Enzyme function less conserved than anticipated. J. Mol. Biol. 318, 595–608.PubMedCrossRefGoogle Scholar
  21. 21.
    desJardins, M., Karp, P. D., Krummenacker, M., Lee, T. J., and Ouzounis, C. A. (1997) Prediction of enzyme classification from protein sequence without the use of sequence similarity. In the Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology 92–99.Google Scholar
  22. 22.
    Borro, L. C., Oliveira, S. R. M., Yamagishi, M. E. B., Mancini, A. L., Jardine, J. G., Mazoni, I., dos Santos, E. H., Higa, R. H., Kuser P. R., and Neshich G. (2006) Predicting enzyme class from protein structure using Bayesian classification. Genet. Mol. Res. 5, 193–202.Google Scholar
  23. 23.
    Cai, Y-D. and Chou, K-C. (2004) Using functional domain composition to predict enzyme family classes. J. Proteome Res. 4, 109–111.Google Scholar
  24. 24.
    The Gene Ontology Consortium. (2000) Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29.CrossRefGoogle Scholar
  25. 25.
    Clare, A. and King R. D. (2003) Predicting gene function in Saccharomyces cerevisiae. Bioinformatics 19, ii42–ii49PubMedCrossRefGoogle Scholar
  26. 26.
    Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402.PubMedCrossRefGoogle Scholar
  27. 27.
    Mewes, H. W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., and Frishman, D. (1999) MIPS: a database for genomes and protein sequences. Nucl. Acids Res. 27, 44–48.PubMedCrossRefGoogle Scholar
  28. 28.
    Burges, C. J. C. (1998) A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167.CrossRefGoogle Scholar
  29. 29.
    Jaakola, T., Diekhans, M., and Haussler, D. (1999) Using the Fisher kernel method to detect remote protein homologies. In the Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology 149–158.Google Scholar
  30. 30.
    Han, L. Y., Cai, C. Z., Ji, Z. L., Cao, Z. W., Cui, J., and Chen, Y. Z. (2004) Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucl. Acids Res. 32, 6437–6444.PubMedCrossRefGoogle Scholar
  31. 31.
    Leslie, C., Eskin, E., Cohen, A., Weston, J., and Noble, W. S. (2004) Mismatch string kernels for discriminitive protein classification. Bioinformatics 1, 1–10.Google Scholar
  32. 32.
    Ben-Hur, A. and Brutlag, D. L. (2006) Sequence motifs: highly predictive features of protein function, in Feature Extraction, Foundations and Applications (Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. eds.), Springer Verlag, New York.Google Scholar
  33. 33.
    Kolesov, G., Mewes, H. W., and Frishman, D. (2001) SNAPping up functionally related genes based on context information: a colinearity-free approach. J. Mol. Biol. 311, 639–656.PubMedCrossRefGoogle Scholar
  34. 34.
    Tian, W., Arakaki, A. K., and Skolnick, J. (2004) EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucl. Acids Res. 32, 6226–6239.PubMedCrossRefGoogle Scholar
  35. 35.
    Levy, E. D., Ouzounis, C. A., Gilks, W. R., and Audit, B. (2005) Probabilistic annotation of protein sequences based on functional classifications. BMC Bioinformatics 6, 302.PubMedCrossRefGoogle Scholar
  36. 36.
    Duda, R. O., Hart, P. E., and Stork, D. G. (2000) Pattern Classification. John Wiley and Sons, New York.Google Scholar
  37. 37.
    Mitchell, T. M. (1997) Machine Learning. McGraw-Hill, New York.Google Scholar
  38. 38.
    Breiman, L., Friedman, J. H., Olshen, R.A., and Stone, C. J. (1993) Classification and Regression Trees. Chapman and Hall, New York.Google Scholar
  39. 39.
    Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R. D., and Sonnhammer E. L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res. 27, 260–262.PubMedCrossRefGoogle Scholar
  40. 40.
    Bairoch, A. and Apweiler, R. (1999) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucl. Acids Res. 27, 49–54.Google Scholar
  41. 41.
    Hobohm, U. and Sander, C. (1995) A sequence property approach to searching protein database. J. Mol. Biol. 251, 390–399.Google Scholar
  42. 42.
    Ferran, E. A., Pflugfelder, B., and Ferrara P. (1994) Self-organized neural maps of human protein sequences. Protein Sci. 3, 507–521.PubMedCrossRefGoogle Scholar
  43. 43.
    Black, S.D. and Mould, D.R. (1991) Development of hydrophobicity parameters to analyze proteins which bear post or cotranslational modifications. Anal. Biochem. 193, 72–82.Google Scholar
  44. 44.
  45. 45.
    McGuffin, L. J., Bryson, K., and Jones, D. T. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16, 404–405.PubMedCrossRefGoogle Scholar
  46. 46.
  47. 47.
    Quinlan, J.R., (1986) Induction of decision trees. Mach. Learn. 1, 81–106.Google Scholar
  48. 48.
    Quinlan, J.R., (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.Google Scholar
  49. 49.
    Syed, U. and Yona, G. (2003) Using a mixture of probabilistic decision trees for direct prediction of protein function. In the Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology 289–300.Google Scholar
  50. 50.
    Dietterich, T. G. (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40, 139–157.CrossRefGoogle Scholar
  51. 51.
    Ho, T. K. (1998) The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844.CrossRefGoogle Scholar
  52. 52.
    Breiman, L. (2001) Random forests. Mach. Learn. 45, 5–32, 48CrossRefGoogle Scholar
  53. 53.
    Lin, J. (1991) Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37:1, 145–151.CrossRefGoogle Scholar
  54. 54.
    Kullback, S. (1959) Information Theory and Statistics. John Wiley and Sons, New York.Google Scholar
  55. 55.
    Hughey, R., Karplus, K., and Krogh, A. (1999) SAM: sequence alignment and modeling software system. Technical report UCSC-CRL-99-11. University of California, Santa Cruz, CA.Google Scholar
  56. 56.
    Birkland, A. and Yona, G. (2006) The BIOZON database: a hub of heterogeneous biological data. Nucl. Acids Res. 34, D235–D242.Google Scholar
  57. 57.
    Fayyad, U. M. and Irani, K. B. (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In the Proceedings of the 13th International Joint Conference on Artificial Intelligence 1022–1027.Google Scholar
  58. 58.
    Kohavi, R. and Sahami, M. (1996) Error-based and entropy-based discretization of continuous features. In the Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining 114– 119.Google Scholar
  59. 59.
    Breiman, L., Friedman, J. H., Olshen, R.A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth Int. Group, Belmont, CA.Google Scholar
  60. 60.
    Mantaras, R. L. (1991) A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6, 81–92.CrossRefGoogle Scholar
  61. 61.
    Kononenko, I. (1995) On biases in estimating multi-valued attributes. In the Proceedings of the 14th International Joint Conference on Artificial Intelligence 1034–1040.Google Scholar
  62. 62.
    Eskin, E., Grundy, W. N., and Singer, Y. (2000) Protein family classification using sparse Markov transducers. In the Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology 20–23.Google Scholar
  63. 63.
    Rissanen, J. (1989) Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore.Google Scholar
  64. 64.
    Hjorth, J. S. U. (1994) Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap. Chapman and Hall, London.Google Scholar
  65. 65.
    Jain, A. K., Dubes, R. C., and Chen, C. (1998) Bootstrap techniques for error estimation. IEEE Trans. Pattern Anal. Appl. 9, 628–633.CrossRefGoogle Scholar
  66. 66.
    Shakhnarovich, G., El-Yaniv, R., and Baram, Y. (2001) Smoothed bootstrap and statistical data cloning for classifier evaluation. In the Proceedings of the 18th International Conference on Machine Learning 521–528.Google Scholar
  67. 67.
    Pearson, W. R. (1995) Comparison of methods for searching protein sequence databases. Protein Sci. 4, 1145–1160.PubMedCrossRefGoogle Scholar

Copyright information

© Humana Press, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Umar Syed
    • 1
  • Golan Yona
    • 2
    • 3
  1. 1.Department of Computer SciencePrinceton UniversityPrincetonUSA
  2. 2.Department of Biological Statistics and Computational BiologyCornell UniversityIthacaUSA
  3. 3.Department of Computer ScienceTechnion - Israel Institute of TechnologyHaifaIsrael

Personalised recommendations