Skip to main content

Computational Protein Function Prediction: Framework and Challenges

  • Chapter
  • First Online:
Protein Function Prediction for Omics Era

Abstract

Large scale genome sequencing technologies are increasing the abundance of experimental data which requires functional characterization. There is a continually widening gap between the mounting numbers of available genomes and completeness of their annotations, which makes it impractical to manually curate the genomes for function information. To handle this growing challenge we need computational techniques that can accurately predict functions for these newly sequenced genomes. In this chapter we focus on the framework required for computational function annotation and the challenges involved. Controlled vocabularies of functional terms, e.g. Gene Ontology, MIPS functional catalogues, Enzyme commission numbers, form the basis of prediction methods by capturing the available biological knowledge in the form, suitable for computational processing. We review functional vocabularies in detail along with the methods developed for quantitatively gauging the functional similarity between the vocabulary terms. We also discuss challenges in this area, first pertaining to the erroneous annotations floating in the sequence database and second regarding the limitations of the functional term vocabulary used for protein annotations. Lastly, we introduce community efforts to objectively assess the accuracy of function prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kanehisa, M., Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1): 27–30 (2000).

    Article  PubMed  CAS  Google Scholar 

  2. Flicek, P., Birney, E. Sense from sequence reads: methods for alignment and assembly. Nat. Methods 6(11 Suppl): S6–S12 (2009).

    Article  PubMed  CAS  Google Scholar 

  3. Reeves, G.A., Talavera, D., Thornton, J.M. Genome and proteome annotation: organization, interpretation and integration. J. R. Soc. Interface 6(31): 129–147 (2009).

    Article  PubMed  CAS  Google Scholar 

  4. Bujnicki, J.M. Prediction of protein structures, functions, and interactions. Chichester, West Sussex: Wiley. xiv, 287p., [2] p. of plates (2009).

    Google Scholar 

  5. Eisenberg, D., et al. Protein function in the post-genomic era. Nature 405(6788): 823–826 (2000).

    Article  PubMed  CAS  Google Scholar 

  6. Friedberg, I. Automated protein function prediction – the genomic challenge. Brief Bioinform. 7(3): 225–242 (2006).

    Article  PubMed  CAS  Google Scholar 

  7. Hawkins, T., Chitale, M., Kihara, D. New paradigm in protein function prediction for large scale omics analysis. Mol. Biosyst. 4(3): 223–231 (2008).

    Article  PubMed  CAS  Google Scholar 

  8. Karp, P.D. What we do not know about sequence analysis and sequence databases. Bioinformatics 14(9): 753–754 (1998).

    Article  PubMed  CAS  Google Scholar 

  9. Altschul, S.F., et al. Basic local alignment search tool. J. Mol. Biol. 215(3): 403–410 (1990).

    PubMed  CAS  Google Scholar 

  10. Pearson, W.R. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183: 63–98 (1990).

    Article  PubMed  CAS  Google Scholar 

  11. Pearson, W.R., Lipman, D.J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85(8): 2444–2448 (1988).

    Article  PubMed  CAS  Google Scholar 

  12. Harris, M.A., et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32(Database issue): D258–261 (2004).

    Google Scholar 

  13. Nomenclature committee of the international union of biochemistry and molecular biology (NC-IUBMB), Enzyme Supplement 5 (1999). Eur. J. Biochem. 264(2): 610–650 (1999). http://www.ncbi.nlm.nih.gov/pubmed/10491110

    Google Scholar 

  14. Ruepp, A., et al. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 32(18): 5539–5545 (2004).

    Article  PubMed  CAS  Google Scholar 

  15. Saier, M.H., Jr. A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol. Mol. Biol. Rev. 64(2): 354–411 (2000).

    Article  PubMed  CAS  Google Scholar 

  16. Mao, X., et al. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 21(19): 3787–3793 (2005).

    Article  PubMed  CAS  Google Scholar 

  17. Ashburner, M., et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25(1): 25–29 (2000).

    CAS  Google Scholar 

  18. Kanehisa, M., et al. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38(Database issue): D355–360 (2010).

    Google Scholar 

  19. Smith, B., et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25(11): 1251–1255 (2007).

    Article  PubMed  CAS  Google Scholar 

  20. Sheehan, B., et al. A relation based measure of semantic similarity for Gene Ontology annotations. BMC Bioinformatics 9: 468 (2008).

    Article  PubMed  Google Scholar 

  21. Lopez, G., et al. Assessment of predictions submitted for the CASP7 function prediction category. Proteins 69(Suppl 8): 165–174 (2007).

    Article  PubMed  CAS  Google Scholar 

  22. Vinayagam, A., et al. GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 7: 161 (2006).

    Article  PubMed  Google Scholar 

  23. Tversky, A. Features of similarity. Psychol. Rev. 84(4): 327–352 (1977).

    Article  Google Scholar 

  24. Hawkins, T., Luban, S., Kihara, D. Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 15(6): 1550–1556 (2006).

    Article  PubMed  CAS  Google Scholar 

  25. Wass, M.N., Sternberg, M.J. ConFunc – functional annotation in the twilight zone. Bioinformatics 24(6): 798–806 (2008).

    Article  PubMed  CAS  Google Scholar 

  26. Chabalier, J., Mosser, J., Burgun, A. A transversal approach to predict gene product networks from ontology-based similarity. BMC Bioinformatics 8: 235 (2007).

    Article  PubMed  Google Scholar 

  27. Chagoyen, M., Carazo, J.M., Pascual-Montano, A. Assessment of protein set coherence using functional annotations. BMC Bioinformatics 9: 444 (2008).

    Article  PubMed  Google Scholar 

  28. Resnik, P. Using information content to evaluate semantic similarity in a taxonomy. Proceedings of International Joint Conference on Artificial Intelligence 1: 448–453 (1995).

    Google Scholar 

  29. Lin, D. An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning 1: 296–304 (1998).

    Google Scholar 

  30. Lord, P.W., et al. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10): 1275–1283 (2003).

    Article  PubMed  CAS  Google Scholar 

  31. Schlicker, A., et al. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7: 302 (2006).

    Article  PubMed  Google Scholar 

  32. Martin, D., et al. GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol. 5(12): R101 (2004).

    Article  PubMed  Google Scholar 

  33. Pehkonen, P., Wong, G., Toronen, P. Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics 6: 162 (2005).

    Article  PubMed  Google Scholar 

  34. Huang da, W., et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 8(9): R183 (2007).

    Article  Google Scholar 

  35. Carmona-Saez, P., et al. GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists. Genome Biol. 8(1): R3 (2007).

    Article  PubMed  Google Scholar 

  36. Pandey, J., Koyuturk, M., Grama, A. Functional characterization and topological modularity of molecular interaction networks. BMC Bioinformatics 11(Suppl 1): S35 (2010).

    Article  PubMed  Google Scholar 

  37. Zheng, B., Lu, X. Novel metrics for evaluating the functional coherence of protein groups via protein semantic network. Genome Biol. 8(7): R153 (2007).

    Article  PubMed  Google Scholar 

  38. Curtis, R.K., Oresic, M., Vidal Puig A. Pathways to the analysis of microarray data. Trends Biotechnol. 23(8): 429–435 (2005).

    Article  PubMed  CAS  Google Scholar 

  39. Draghici, S., et al. Global functional profiling of gene expression. Genomics 81(2): 98–104 (2003).

    Article  PubMed  CAS  Google Scholar 

  40. Altschul, S.F., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17): 3389–3402 (1997).

    Article  PubMed  CAS  Google Scholar 

  41. Boeckmann, B., et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31(1): 365–370 (2003).

    Article  PubMed  CAS  Google Scholar 

  42. Benson, D.A., et al. GenBank. Nucleic Acids Res. 37(Database issue): D26–31 (2009).

    Google Scholar 

  43. Devos, D., Valencia, A. Practical limits of function prediction. Proteins 41(1): 98–107 (2000).

    Article  PubMed  CAS  Google Scholar 

  44. Valencia, A. Automatic annotation of protein function. Curr. Opin. Struct. Biol. 15(3): 267–274 (2005).

    Article  PubMed  CAS  Google Scholar 

  45. Bork, P., Koonin, E.V. Predicting functions from protein sequences – where are the bottlenecks? Nat. Genet. 18(4): 313–318 (1998).

    Article  PubMed  CAS  Google Scholar 

  46. Tian, W., Skolnick, J. How well is enzyme function conserved as a function of pairwise sequence identity? J. Mol. Biol. 333(4): 863–882 (2003).

    Article  PubMed  CAS  Google Scholar 

  47. Galperin, M.Y., Koonin, E.V. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1(1): 55–67 (1998).

    PubMed  CAS  Google Scholar 

  48. Jeffery, C.J. Moonlighting proteins – an update. Mol. Biosyst. 5(4): 345–350 (2009).

    Article  PubMed  CAS  Google Scholar 

  49. Brenner, S.E. Errors in genome annotation. Trends Genet. 15(4): 132–133 (1999).

    Article  PubMed  CAS  Google Scholar 

  50. Devos, D., Valencia, A. Intrinsic errors in genome annotation. Trends Genet. 17(8): 429–431 (2001).

    Article  PubMed  CAS  Google Scholar 

  51. Schnoes, A.M., et al. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 5(12): e1000605 (2009).

    Article  PubMed  Google Scholar 

  52. Gilks, W.R., et al. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 18(12): 1641–1649 (2002).

    Article  PubMed  CAS  Google Scholar 

  53. Riley, M., et al. Escherichia coli K-12: a cooperatively developed annotation snapshot – 2005. Nucleic Acids Res. 34(1): 1–9 (2006).

    Article  PubMed  CAS  Google Scholar 

  54. Hu, J.C., et al. The emerging world of wikis. Science 320(5881): 1289–1290 (2008).

    Article  PubMed  CAS  Google Scholar 

  55. Florez, L.A., et al. A community-curated consensual annotation that is continuously updated: the Bacillus subtilis centred wiki SubtiWiki. Database (Oxford) 2009: bap012 (2009).

    Google Scholar 

  56. Huss, J.W., 3rd, et al. The Gene Wiki: community intelligence applied to human gene annotation. Nucleic Acids Res. 38(Database issue): D633–639 (2009).

    Google Scholar 

  57. Zhang, M., Kihara, D., Prabhakar, S. Tracing lineage in multi-version scientific databases. Proceedings of IEEE 7th International Symposium on Bioinformatics & Bioengineering (BIBE) 1: 440–447 (2007).

    Google Scholar 

  58. Friedberg, I., Jambon, M., Godzik, A. New avenues in protein function prediction. Protein Sci. 15(6): 1527–1529 (2006).

    Article  PubMed  CAS  Google Scholar 

  59. Soro, S., Tramontano, A. The prediction of protein function at CASP6. Proteins 61(Suppl 7): 201–213 (2005).

    Article  PubMed  CAS  Google Scholar 

Download references

Acknowledgements

MC is supported by grants from Purdue Research Foundation and Showalter Trust. DK also acknowledges a grant from National Institutes of Health (GM075004) and National Science Foundation (DMS800568, EF0850009, IIS0915801).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daisuke Kihara .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Chitale, M., Kihara, D. (2011). Computational Protein Function Prediction: Framework and Challenges. In: Kihara, D. (eds) Protein Function Prediction for Omics Era. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0881-5_1

Download citation

Publish with us

Policies and ethics