Abstract
Large scale genome sequencing technologies are increasing the abundance of experimental data which requires functional characterization. There is a continually widening gap between the mounting numbers of available genomes and completeness of their annotations, which makes it impractical to manually curate the genomes for function information. To handle this growing challenge we need computational techniques that can accurately predict functions for these newly sequenced genomes. In this chapter we focus on the framework required for computational function annotation and the challenges involved. Controlled vocabularies of functional terms, e.g. Gene Ontology, MIPS functional catalogues, Enzyme commission numbers, form the basis of prediction methods by capturing the available biological knowledge in the form, suitable for computational processing. We review functional vocabularies in detail along with the methods developed for quantitatively gauging the functional similarity between the vocabulary terms. We also discuss challenges in this area, first pertaining to the erroneous annotations floating in the sequence database and second regarding the limitations of the functional term vocabulary used for protein annotations. Lastly, we introduce community efforts to objectively assess the accuracy of function prediction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kanehisa, M., Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1): 27–30 (2000).
Flicek, P., Birney, E. Sense from sequence reads: methods for alignment and assembly. Nat. Methods 6(11 Suppl): S6–S12 (2009).
Reeves, G.A., Talavera, D., Thornton, J.M. Genome and proteome annotation: organization, interpretation and integration. J. R. Soc. Interface 6(31): 129–147 (2009).
Bujnicki, J.M. Prediction of protein structures, functions, and interactions. Chichester, West Sussex: Wiley. xiv, 287p., [2] p. of plates (2009).
Eisenberg, D., et al. Protein function in the post-genomic era. Nature 405(6788): 823–826 (2000).
Friedberg, I. Automated protein function prediction – the genomic challenge. Brief Bioinform. 7(3): 225–242 (2006).
Hawkins, T., Chitale, M., Kihara, D. New paradigm in protein function prediction for large scale omics analysis. Mol. Biosyst. 4(3): 223–231 (2008).
Karp, P.D. What we do not know about sequence analysis and sequence databases. Bioinformatics 14(9): 753–754 (1998).
Altschul, S.F., et al. Basic local alignment search tool. J. Mol. Biol. 215(3): 403–410 (1990).
Pearson, W.R. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183: 63–98 (1990).
Pearson, W.R., Lipman, D.J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85(8): 2444–2448 (1988).
Harris, M.A., et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32(Database issue): D258–261 (2004).
Nomenclature committee of the international union of biochemistry and molecular biology (NC-IUBMB), Enzyme Supplement 5 (1999). Eur. J. Biochem. 264(2): 610–650 (1999). http://www.ncbi.nlm.nih.gov/pubmed/10491110
Ruepp, A., et al. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 32(18): 5539–5545 (2004).
Saier, M.H., Jr. A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol. Mol. Biol. Rev. 64(2): 354–411 (2000).
Mao, X., et al. Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics 21(19): 3787–3793 (2005).
Ashburner, M., et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25(1): 25–29 (2000).
Kanehisa, M., et al. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38(Database issue): D355–360 (2010).
Smith, B., et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25(11): 1251–1255 (2007).
Sheehan, B., et al. A relation based measure of semantic similarity for Gene Ontology annotations. BMC Bioinformatics 9: 468 (2008).
Lopez, G., et al. Assessment of predictions submitted for the CASP7 function prediction category. Proteins 69(Suppl 8): 165–174 (2007).
Vinayagam, A., et al. GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 7: 161 (2006).
Tversky, A. Features of similarity. Psychol. Rev. 84(4): 327–352 (1977).
Hawkins, T., Luban, S., Kihara, D. Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 15(6): 1550–1556 (2006).
Wass, M.N., Sternberg, M.J. ConFunc – functional annotation in the twilight zone. Bioinformatics 24(6): 798–806 (2008).
Chabalier, J., Mosser, J., Burgun, A. A transversal approach to predict gene product networks from ontology-based similarity. BMC Bioinformatics 8: 235 (2007).
Chagoyen, M., Carazo, J.M., Pascual-Montano, A. Assessment of protein set coherence using functional annotations. BMC Bioinformatics 9: 444 (2008).
Resnik, P. Using information content to evaluate semantic similarity in a taxonomy. Proceedings of International Joint Conference on Artificial Intelligence 1: 448–453 (1995).
Lin, D. An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning 1: 296–304 (1998).
Lord, P.W., et al. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10): 1275–1283 (2003).
Schlicker, A., et al. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7: 302 (2006).
Martin, D., et al. GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol. 5(12): R101 (2004).
Pehkonen, P., Wong, G., Toronen, P. Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics 6: 162 (2005).
Huang da, W., et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 8(9): R183 (2007).
Carmona-Saez, P., et al. GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists. Genome Biol. 8(1): R3 (2007).
Pandey, J., Koyuturk, M., Grama, A. Functional characterization and topological modularity of molecular interaction networks. BMC Bioinformatics 11(Suppl 1): S35 (2010).
Zheng, B., Lu, X. Novel metrics for evaluating the functional coherence of protein groups via protein semantic network. Genome Biol. 8(7): R153 (2007).
Curtis, R.K., Oresic, M., Vidal Puig A. Pathways to the analysis of microarray data. Trends Biotechnol. 23(8): 429–435 (2005).
Draghici, S., et al. Global functional profiling of gene expression. Genomics 81(2): 98–104 (2003).
Altschul, S.F., et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17): 3389–3402 (1997).
Boeckmann, B., et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31(1): 365–370 (2003).
Benson, D.A., et al. GenBank. Nucleic Acids Res. 37(Database issue): D26–31 (2009).
Devos, D., Valencia, A. Practical limits of function prediction. Proteins 41(1): 98–107 (2000).
Valencia, A. Automatic annotation of protein function. Curr. Opin. Struct. Biol. 15(3): 267–274 (2005).
Bork, P., Koonin, E.V. Predicting functions from protein sequences – where are the bottlenecks? Nat. Genet. 18(4): 313–318 (1998).
Tian, W., Skolnick, J. How well is enzyme function conserved as a function of pairwise sequence identity? J. Mol. Biol. 333(4): 863–882 (2003).
Galperin, M.Y., Koonin, E.V. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1(1): 55–67 (1998).
Jeffery, C.J. Moonlighting proteins – an update. Mol. Biosyst. 5(4): 345–350 (2009).
Brenner, S.E. Errors in genome annotation. Trends Genet. 15(4): 132–133 (1999).
Devos, D., Valencia, A. Intrinsic errors in genome annotation. Trends Genet. 17(8): 429–431 (2001).
Schnoes, A.M., et al. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 5(12): e1000605 (2009).
Gilks, W.R., et al. Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 18(12): 1641–1649 (2002).
Riley, M., et al. Escherichia coli K-12: a cooperatively developed annotation snapshot – 2005. Nucleic Acids Res. 34(1): 1–9 (2006).
Hu, J.C., et al. The emerging world of wikis. Science 320(5881): 1289–1290 (2008).
Florez, L.A., et al. A community-curated consensual annotation that is continuously updated: the Bacillus subtilis centred wiki SubtiWiki. Database (Oxford) 2009: bap012 (2009).
Huss, J.W., 3rd, et al. The Gene Wiki: community intelligence applied to human gene annotation. Nucleic Acids Res. 38(Database issue): D633–639 (2009).
Zhang, M., Kihara, D., Prabhakar, S. Tracing lineage in multi-version scientific databases. Proceedings of IEEE 7th International Symposium on Bioinformatics & Bioengineering (BIBE) 1: 440–447 (2007).
Friedberg, I., Jambon, M., Godzik, A. New avenues in protein function prediction. Protein Sci. 15(6): 1527–1529 (2006).
Soro, S., Tramontano, A. The prediction of protein function at CASP6. Proteins 61(Suppl 7): 201–213 (2005).
Acknowledgements
MC is supported by grants from Purdue Research Foundation and Showalter Trust. DK also acknowledges a grant from National Institutes of Health (GM075004) and National Science Foundation (DMS800568, EF0850009, IIS0915801).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Chitale, M., Kihara, D. (2011). Computational Protein Function Prediction: Framework and Challenges. In: Kihara, D. (eds) Protein Function Prediction for Omics Era. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-0881-5_1
Download citation
DOI: https://doi.org/10.1007/978-94-007-0881-5_1
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-0880-8
Online ISBN: 978-94-007-0881-5
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)