Key words

1 Introduction

The Gene Ontology (GO) offers experimental and computational biology researchers an accessible range of controlled vocabulary annotations to describe protein function. This allows detailed as well as large-scale analyses to be conducted. There is, however, a range of other sources of functional annotations, which in combination with GO provide enhance function descriptions. Examples of such complementary resources include the Enzyme Commission’s classification of enzyme reactions [1], the Kyoto Encyclopedia of Genes and Genomes (KEGG) [2], BRENDA [3], CSA [4], MACiE [5], MetaCyc database of enzyme and pathways [6], amongst many others. Most of these resources include GO terms within their own annotations or their definitions are included within the Gene Ontology. Mapping terms between resources offers enhanced descriptions and relationships between them not readily captured solely within GO. The Gene Ontology provides many of these mappings through its website (http://geneontology.org/page/download-mappings), which are automatically updated with various periodicities depending on how often the corresponding resource is updated. This chapter describes some of these complementary resources focusing mainly on enzymes.

2 Annotating Enzymes

Due to the over 100 years of experimental biochemical data, one of the richest areas for complementary functional annotations are for enzymes. Historically, naming conventions for enzymes have been confused and haphazard, with several names being given to one enzyme and one name being given to several enzymes. Often the names bear little information as to the reaction the enzyme is undertaking. This led to the development of the Enzyme Classification (E.C.) system by the International Commission on Enzymes founded in 1956 by the International Union of Biochemistry [1]. The E.C. number is a hierarchal system consisting of four levels. The first level has six divisions giving a broad description of the overall chemical transformation (enzyme class): Oxidoreductases, Transferases, Hydrolases, Lyases, Isomerases and Ligases. The next two levels (sub class and sub-subclass) generally describe the reactive species and the type of bond being acted upon. The meaning of these numbers is class dependent. The final level is a serial number for the overall reaction of that sub-subclass. The overall reactions described are mass-balanced, as much as possible, though they are not necessarily charge-balanced, nor are they meant to represent the equilibrium position or reaction direction with a convention for writing the reaction in the same direction for all reactions within a given sub-subclass even if their physiological direction is different. General reactions, where the enzyme has broad specificity, are given as single generic reactions and alternative reactions with specific metabolites are also given. Some reactions are incomplete, while others are combinations of successive reactions [7]. Thus it is possible that one enzyme E.C. number might have a multiple number of reactions associated with it and for many reactions to be assigned to the same E.C. number (see Fig. 1a).

Fig. 1
figure 1

(a) Examples of ambiguity in the E.C. classification, where one E.C. number can represent many reactions and where many E.C. numbers are describing one reaction. (b) The two representations of the same enzyme (phosphoinositide phospholipase C) in E.C. and GO, with the overall chemical reaction also shown. The reaction diagram is highlighted to show sub-structures across the reaction used in the determination of bond changes and reaction centers in EC-Blast

Currently there are 6510 E.C. numbers approved, with 5560 of them in active use. Of these active annotations only 3924 (70 %) have an equivalent GO term. A full list of E.C. to GO cross-references can be found on the GO website (http://geneontology.org/external2go/ec2go). There are a number of reasons why a mapping between E.C. and GO cannot be made. Most likely is that GO does not yet have a term that covers the EC term, e.g. E.C. 1.1.1.287 (d-arabinitol dehydrogenase). An automatic pipeline updates the cross-reference file after each GO release with any new terms that are created. Other reasons why E.C. and GO terms cannot be mapped are because of E.C. entries being transferred from one term to another or the E.C. number has yet to be associated with a gene product (termed orphaned E.C. terms). Additionally, there are “pseudo” E.C. terms created by UniProt that describe an overall reaction derived from the literature but have yet to be included in the E.C. These are easily identifiable as they have a letter n in the fourth level of the hierarchy, e.g. 1.1.1.n5 (3-methylmalate dehydrogenase).

Databases such as KEGG and BRENDA hold details of alternative reactions and data relating to physiological function. Other resources hold more specific functional annotations such as the catalytic residues and how they function in the overall reactions, as cataloged by the Catalytic Site Atlas (CSA), or MACiE that annotates the steps in an enzyme’s reaction, the order in which bonds are broken and formed, the role of cofactors and the function of protein residues at each step. To bridge the gap between these more chemical descriptors and the biological descriptors associated with a protein a new ontology, the Enzyme Mechanism Ontology (EMO), has been developed [4]. Though not directly linked to GO, EMO terms can be determined though links with GOA terms of the UniProtKB record for a particular enzyme.

3 Comparing Enzyme Annotations

Unlike GO, the E.C. number cannot be used to make automated quantitative comparisons between annotations. There are a number of measures of annotation similarity that can be made based on the GO ontological graph. The most basic similarity measure is based on the length of the common path between two terms to the ontology root and has been enhanced to overcome the fact that the depth of a term within the ontology is not necessarily indicative of its specificity, termed information content (IC). Further enhancements normalize the IC measure (Lin score) and use semantic similarity (Wang score) [8, 9]. To overcome the deficiencies of E.C. as a means to measure functional similarity and to capture detailed reaction information not encapsulated in GO, new methods have been developed. Efforts to compare reactions based on their overall reaction chemistry have met with only moderate success, limited by their reliance upon the consistency and reliability of the underlying reaction data and the ability of the algorithm used to process a diverse range of reactions. The latest method called EC-Blast [10] has proven more successful. It uses an atom-atom mapping approach to automatically assign bond changes and reaction centers (the atom and bond type in the immediate region of the metabolite where the bonds are broken/formed). This allows for the reaction to be described in a set of fingerprints that in composite can be used to compare reactions. Taking all available E.C. numbers and equivalent GO terms that can be compared to each other, the difference between the two ways of measuring functional similarity is shown in Fig. 2. Though many comparisons result in similar scores, a substantial number diverge significantly. For example, E.C. 2.1.2.9 when compared to E.C. 2.1.2.11, based on bond order changes, the similarity score as calculated by EC-Blast is 0.22, where as the semantic similarity between the equivalent GO terms is 0.73. The low similarity from EC-Blast encapsulates the differences in bonds cleaved (two C-N bonds and 2 H-N bonds for E.C. 2.1.2.9; compared to one C-C, one H-O and one C-H for E.C. 2.1.2.11 as well as differences in stereochemistry changes and bond order rearrangements.) Thus, care needs to be taken in choosing the best measure of functional similarity, a widely used technique in functional inference (see Chap. 12 [26]).

Fig. 2
figure 2

Differences between GO and EC measures of functional similarity. A frequency histogram showing the difference between the similarity scores of all-by-all pairs of E.C. numbers calculated using EC-Blast bond similarity measure and the equivalent GO term. GO similarity scores are calculated using the Wang semantic similarity method. Not all E.C. numbers are used as: EC-Blast requires fully balanced reactions, and not all E.C. numbers have a GO term equivalents

4 Annotating Domains

One of the challenges of functional annotation is the granularity to which an annotation can be attached. Most genomic annotations are assigned to whole protein translations, i.e. the gene, but for many functions it is a protein domain that can be considered the functional unit. Of course functions are not solely confined to a single domain and many functions are a product of multiple domains in combination. Many domains are combined with others in increasingly complex combinations and arrangements (see Fig. 3). This biological complexity adds considerable complexity to functional annotations, where a function can be assigned to complete gene products and other functional annotation to just one component domain or multi-domain combinations. There are a number of domain and motif databases that provide functional annotations, many of which are mapped to GO via the InterPro [11] proteins family database, that integrates predictive models from a range of different protein family databases. One of the main sequence based domain protein family databases is PFam [12], with the goal of creating a collection of functionally annotated families that is representative as much as possible of protein-sequence space. PFam curators provide functional annotations, but in recent releases these annotations have been outsourced to the community via the use of Wikipedia allowing anyone to freely edit and improve the content, with the original curator annotations maintained. By their very nature these annotations do not conform to a controlled vocabulary, but it is possible for PFam annotations to be mapped back to GO terms; this is provided by the InterPro group and is available via the GO website.

Fig. 3
figure 3

Biological complexity generated by multi-domain architectures. A force-directed graph of the multi-domain architectures associated with a domain superfamily (“winged helix” repressor DNA binding domain). The graph is centered on architecture containing just the single domain with nodes (red boxes) radiating from this representing ever-increasing multi-domain architecture (shown to the right of the node). A key to the domains in these multi-domain architectures is shown on the left identified by PFam codes (starting PF or PB) or CATH codes. Functions are associated with the whole gene product as well as for single domains within the multi-domain architecture. An interactive version of this graph can be found at http://www.funtree.info/templates/showArch.php?cathcode=00001.00010.00010.00010&cathmethod=&cathcluster=&type=AS

The CATH [13] resource, which uses protein structures to define domains both within known protein structures and sequences where there is no structural information, uses the GO terms associated with a sequence to define functionally coherent clusters (termed FunFams) within the superfamily division of the classification. The functional annotation provided is derived from the predominant GO term found within the FunFam. These terms though are assigned to the whole sequence and not the domain and therefore may not directly relate to the specific function the domain is participating in. In the SFLD [14] domains that are critical for function are determined (often being used to define the superfamily), thereby linking the functional annotation to a domain or combination of domains within a multi-domain architecture (see Chap. 9 [27]). SUPERFAMILY [15], a domain centric resource that uses an alternative structure based domain classification called SCOP, attempts to assign functional annotations specifically to a domain. Using the GO semantic structure and the proteins multi domain architecture, domain-centric functional annotations are statistically inferred based on the assumption that if a GO term is annotated to proteins that contain a shared domain then that term should also confer functional indicators for that domain. The SUPERFAMILY developers have generated a reduced version of GO for annotating domains and forms part of a structural domain functional ontology (SDFO) [16]. The approach of linking ontological terms to a domain can be generalized to other ontologies, most notably for phenotypic annotations. For example SUPERFAMILY integrates mammalian phenotype ontology (MPO) [17] from the mouse genome informatics (MGI) and the Human Phenotype Ontology (HPO) from the (OMIM) [18] resource.

5 Pathways and Interactions

Individual components of a pathway or groups of interacting proteins are described by the molecular function set of GO terms, while the pathways and interactions these components participate in are captured in the biological process GO terms. These provide overall descriptions of a biological process, such as signal transduction, or more specific terms such as thiamine metabolism. GO does not try to represent the dynamics or dependencies that are equivalent to a signal or metabolic pathway, though the GO consortium has recognized the importance of contextualizing gene product annotations and had begun to add some directional information (see Chap. 17 [28]). To be able to put the components into the context of a metabolic pathway for example, the use of specialist databases such as KEGG, BioCarta, MetaCyc, Pathway Interaction Database [19] and Reactome [20] is required (see Table 1). These provide curated and computationally derived descriptions of overall topologies and interactions, often displayed as pathway diagrams and maps. Many of these data resources are able to map terms back to GO. IntAct [21], which is a molecular interaction database curated from the literature or by data depositors, scores and filters interaction evidences to generate a high confidence subset of molecular interactions that are exported to GO.

Table 1 A summary of the data resources mentioned

Combinations of GO terms and pathway/interactions databases can be used in the analysis of proteomics data for functional annotation. This can be achieved either using methods for GO enrichment analysis and subsequently linking the results to external pathway resources [22] or by dynamically constructing the pathway/interaction network based on the gene list of interest to create a functionally organized GO/pathway term network [23]. Additionally proteins participating in common biological processes or sharing molecular functions are predictive of interactions [24]. Many methods that combine semantic similarity and machine learning techniques have been developed to use GO to predict PPIs (see ref. 25 and references therein).

6 Conclusions

The Gene Ontology provides a rich set of ontological terms to describe many aspects of a protein’s function. Many of these terms have equivalences in more specialist resources that like the Gene Ontology collate primary data derived from the literature. Often these resources include functional annotations that are not directly captured in GO or allow for annotations to be collated around a different functional unit, as in the case of protein domain centered functional annotations. Other types of functional descriptors such as the dependencies in metabolic pathways and protein–protein interactions are not explicitly captured in GO (though this is currently being addressed through GO annotation extensions), but in combination with other resources can be used to provide and enhance functional annotation of proteins.