1 Introduction

Functional annotation of gene products using the GO has gone far in simplifying the task of finding functional roles of both individual and groups of gene products. It has enabled a multitude of analyses that were previously not possible. For example, GO annotations are invaluable for analyzing a list of genes that are identified as differentially expressed in a microarray experiment using one of the many freely available functional enrichment programs [1, 2] (see also Chap. 13 [3]).

The original simplistic GO annotation pairs a gene product with a GO term (one of biological process, molecular function or cellular component). Because these pair-wise associations are treated independently, vast amounts of correlated functional data are omitted from the basic GO annotation and therefore inaccessible to network and pathway analyses. This contextual information is essential for understanding the physiological roles of gene products. Without contextual information bioinformatics analyses cannot identify gene products that perform a role only under certain conditions or in the presence of specific factors and therefore will present an incomplete view of the available data [4]. Specific gene products will often have different biological roles in different cells or tissues as these roles will be dependent on the available interacting partners; already tissue-specific network analyses are able to demonstrate the importance of the cellular environment. For example, Greene et al. [5] analyzed the GO and pathway annotations of the available interaction partners of the transcription factor, LEF1, in different tissue types. They demonstrated that LEF1 was significantly associated with biological processes that were relevant to each tissue type. For instance, in blood vessels the LEF1 interacting partners were associated with angiogenesis, whereas in hypothalamus they are associated with hypothalamus development.

Here we describe an incremental extension of the GO annotation format to allow more detailed statements about gene product function, which will benefit all types of functional analyses [5].

2 Extending the Core GO Annotation Model

In practical terms, the newly introduced annotation extensions field enables curators to provide appropriate experimentally evidenced contextual information for manually curated annotations (extant software pipelines for electronically inferred annotations (IEA) do not yet support population of this field).

Generating a comprehensive annotation, one that includes its context, involves refining the core pair-wise association with additional relationships to other ontology classes [5]. This dynamic approach is logically equivalent to creating a new term for the subtype in the ontology, but offers advantages in terms of both flexibility and efficiency.

In essence this approach allows curators to dynamically create “virtual” terms. It enables curators to combine all of the specific terms needed to fully describe a gene product in a way that can be reproducibly, computationally interpreted. For example, “core RNA polymerase binding transcription factor in hypothalamus”, associates a gene product with that activity occurring in that specific location. From the computer logic perspective this effectively has created a subclass of “core RNA polymerase binding transcription factor activity” (GO:0000990). The flexibility of expression thus supports the virtual creation of complex, compound child terms on an as-needed basis. Additionally, this approach to virtual term creation is immediate. Because the parent can be automatically inferred from the primary term of the association, and because the additional relationships to other terms provide the refinements needed to create a more specific term, the result is that the previously independent processes of annotating gene products and creating ontology terms are now fully integrated. The use of annotation extensions means that curators can immediately make the biological statement required without having to return to the annotation to update it only after the term is available in the ontology, thus making the overall process more efficient. As these virtual terms are not consequently added to the ontology—although they could be if required—the extended annotations can be “folded” to create the logical equivalent of a GO term [5]. The GO Consortium (GOC) is in the process of incorporating these inferred annotations into the files it provides and so this contextual information will be included by default for use by anyone, or any analysis tool, that utilizes the annotation files.

3 Annotation Extension Format

Annotation extensions refine the GO term used in the basic annotation by adding one or more relational expressions (extensions). Each extension is written as Relation(Entity), where Relation is a label describing the relationship between the GO term and the entity, and Entity is an identifier for a database object or ontology term, for example part_of(GO:0005634), where GO:0005634 is the Gene Ontology identifier for “nucleus”.

Relations can be one of two types: “molecular relations” that are used with entities such as a gene, gene product, complex, or chemical and “contextual relations” that are used with entities such as a cell type, anatomy term, developmental stage, or a GO term.

In order to clearly define the semantics of the extensions, rules have been implemented defining what types of entity identifiers may be used with each relation. Generally, curators may only use contextual relations (e.g., where and when) with terms from the Cell Type Ontology (CL) [6], Uber Anatomy Ontology (Uberon) [7], Plant Ontology (PO) [8], nematode life stages (WBls) [9] and certain GO terms, and molecular target relations may only apply to a physical entity such as a gene product (e.g., UniProtKB [10] or PomBase [11]), a macromolecular complex (e.g., Intact Complex Portal [12]), or a chemical using a ChEBI [13] identifier. Curation tools can incorporate these rules to prevent invalid annotations from being created. Table 1 shows the most commonly used relations with examples of their usage.

Table 1 Most commonly used relationships for annotation extension statements and examples of their usage

4 Improved Expressiveness of GO Annotations: Examples

4.1 Targets of an Enzyme

One means of adding value to a GO annotation, using annotation extensions, is by specifying the molecular target of an enzyme activity. The inability to add effector–target relationships has been a major limitation of the core GO annotation model, with this addition we can now begin to provide directional information that can be used for network and pathway analyses. Take as an example the annotation of human mitogen-activated protein kinase-activated protein kinase 2 (MAPKAP-K2), which was shown to phosphorylate the CapZ-interacting protein (CapZIP) [14]. A basic GO annotation would describe MAPKAP-K2 as a protein serine/threonine kinase:

Gene product:

UniProtKB:P49137 (human MAPKAP-K2)

GO term:

GO:0004674 (protein serine/threonine kinase activity)

Using an annotation extension, a curator can add more detail as follows:

Gene product:

UniProtKB:P49137 (human MAPKAP-K2)

GO term:

GO:0004674 (protein serine/threonine kinase activity)

Extension:

has_direct_input(UniProtKB:Q6JBY9) (human CapZIP)

N.B. phrases in italics are not part of the syntax but are added for better interpretation by the reader.

The extended GO annotation describes MAPKAP-K2 as a protein serine/threonine kinase that can phosphorylate CapZIP. This is vital information that can be utilized for linking together processes and pathways that MAPKAP-K2 and CapZIP, and any further targets of these proteins, are involved in. The rules of usage for has_direct_input are that the primary GO term used should be a Biological Process or Molecular Function and in this example the term used is a Molecular Function, additionally the entity used in the extension should be a gene product, macromolecular complex, or chemical and in this example it is a gene product, i.e., a protein. Note that has_direct_input was used here instead of has_input because there was evidence in the paper that MAPKAP-K2 acted directly on the substrate CapZIP, if there was a possibility of an intermediate molecule in this reaction, has_input would have been used.

4.2 Anatomical Location of a Gene Product’s Function

An annotation can be extended to specify the locational context in which a gene product performs its roles. It is important to note that we intend only to capture those locations that are physiologically relevant to the organism and not the experimental detail in which the observation was made.

The rat protein dihydrofolate reductase (Dhfr) was shown to reduce dihydrofolic acid to tetrahydrofolic acid in rat neurons [15]. From this evidence a basic GO annotation could be made as follows:

Gene product:

UniProtKB:Q920D2 (rat Dhfr)

GO term:

GO:0004146 (dihydrofolate reductase activity)

By extending the annotation the curator can also specify in which cell type this activity occurs:

Gene product:

UniProtKB:Q920D2 (rat Dhfr)

GO term:

GO:0004146 (dihydrofolate reductase activity)

Extension:

occurs_in(CL:0000540) (neuron)

This annotation now provides the physiologically relevant information that Dhfr is active in neurons. The rules for occurs_in are that the primary GO term used must be a Biological Process or Molecular Function (in this example it is a Molecular Function); additionally the entity in the extension must be a cell type, anatomical feature, or GO Cellular Component (in this example it is an identifier from the Cell Type Ontology).

4.3 Timing-Specific Location of a Gene Product

A gene product’s annotation may be made more specific by including the appropriate developmental stage. An example is the location of the C. elegans PAXT-1 protein, which is located in the nucleus during the embryo stage [16]. Using the basic GO annotation format, a curator might indicate that PAXT-1 is located in the nucleus:

Gene product:

UniProtKB:Q21738 (C. elegans PAXT-1)

GO term:

GO:0005634 (nucleus)

By extending the annotation the curator can also specify when this localization occurs:

Gene product:

UniProtKB:Q21738 (C. elegans PAXT-1)

GO term:

GO:0005634 (nucleus)

Extension:

exists_during(WBls:0000003) (embryo)

This annotation means that PAXT-1 is located in the nucleus during the C. elegans embryo stage. The rules for exists_during are that the primary GO term used should be a Cellular Component and the entity in the extension should be a developmental stage or a GO Biological Process, in this case the entity is from the C. elegans life stage ontology.

4.4 Multiple Relational Expressions

If several contextual statements can be made for the gene product, it is possible to combine relational expressions to make even more complex statements. Relational expressions can be separated by commas “,” (meaning AND) or by pipes, “|” (meaning OR), depending on whether the conditions in the statement are co-occurring (AND) or independent (OR).

The human microRNA miR-145 provides an example of the application of multiple annotation extensions. MiR-145 was shown to directly bind and silence the POU5F1 transcription factor, among others, causing inhibition of embryonic stem cell division [17]. This evidence could therefore be represented by two basic GO annotations as follows:

Gene product:

RNACentral:URS0000527F89_9606 (human miR-145)

GO term:

GO:1903231 (mRNA binding involved in posttranscriptional gene silencing)

Gene product:

RNACentral:URS0000527F89_9606 (human miR-145)

GO term:

GO:1904676 (negative regulation of somatic stem cell division)

Using relational expressions, separated by commas, we can make one extended annotation as follows:

Gene product:

RNACentral:URS0000527F89_9606 (human miR-145)

GO term:

GO:1903231 (mRNA binding involved in posttranscriptional gene silencing)

Extension:

has_direct_input(Ensembl:ENSG00000204531), occurs_in(CL:0002322), part_of(GO:1904676) (human POU5F1, embryonic stem cell, negative regulation of somatic stem cell division)

The extended annotation signifies that miR-145 directly binds and silences POU5F1 mRNA expression as part of the inhibition of somatic stem cell division of embryonic stem cells. Again, this contextual information will be essential information when analyzing the physiological relevance of the role of a gene product in a pathway.

Although the use of a pipe (|) to indicate independent contextual statements does not provide any additional expressivity to the statements already made, it allows a curator to capture several statements from the same evidence within a paper. An example is when specifying the multiple substrates of an enzyme—the enzyme may act on each of the substrates independently, but not all at the same time; therefore, the substrates can be listed in the extension separated by pipe symbols:

Gene product:

UniProtKB:O14522 (human PTPRT)

GO term:

GO:0004725 (protein tyrosine phosphatase activity)

Extension:

has_direct_input(UniProtKB:P12830) |

 

has_direct_input(UniProtKB:O60716) (E-cadherin|CTNND1)

This annotation indicates that the receptor protein tyrosine phosphatase rho (PTPRT) dephosphorylates E-cadherin and CTNND1, but not necessarily both simultaneously. It would be equally correct to create two separate annotations each with a single substrate in the extension.

5 Practical Use of Extended Annotations

There are likely to be many use cases for extended annotations—even some we have not yet envisioned. Users will be able to perform more advanced queries with the available functional data; such as filtering on the subcellular, cellular or anatomical locations in which a gene product performs its roles, or which genes a transcription factor regulates in a specified cell type. Annotation extensions can also help create functional networks through the use of directional relationships such as has_input and has_direct_input, which allow specification of the target of an effector, for example in a signaling pathway or the substrates of a metabolic enzyme activity.

Without contextual detail, bioinformatics analyses of gene products involved in a specified process cannot distinguish, for example, between those gene products that are active only in a particular cell type and those that are inactive or absent from that cell type, therefore creating a bias in the interpretation of the data. With extended annotations any differences in the active components of a process or pathway between various cell or tissue types can be determined.

5.1 Access

Extended annotations are available for download in the current GO annotation files, both in the GAF2.0 format (column 16; http://www.geneontology.org/GO.format.gaf-2_0.shtml) and in the Gene Product Association Data format (GPAD column 11; http://www.geneontology.org/GO.format.gpad.shtml). These files can be accessed from the GOC website (http://geneontology.org/GO.downloads.annotations.shtml) and the GOA website (http://www.ebi.ac.uk/GOA/downloads).

Extended annotations can be accessed on the web via the GO browsers QuickGO ([18]; www.ebi.ac.uk/QuickGO-Beta) and AmiGO 2 ([19]; http://amigo.geneontology.org/amigo/). Both browsers allow users to filter annotation sets based on the contents of the annotation extension. The display of extended annotations may be different depending on the resource (Fig. 1), but the GO annotation files display the plain text extension since this is more compatible for computational analysis (see also Chap. 11 [20]). Any questions on how to access or use extended annotations should be directed to the GOC helpdesk (http://geneontology.org/form/contact-go).

Fig. 1
figure 1

Display of extended annotations in (a) the beta version of the EBI GO browser QuickGO, (b) AmiGO 2, and (c) PomBase (http://www.pombase.org/)

5.2 Exercise

The addition of extended annotations to Gene Ontology datasets enables users to perform sophisticated queries. This exercise will demonstrate how to build such a query in the GOC browser AmiGO 2, namely, to provide all of the gene products from S. pombe that are located in the spindle midzone during mitotic anaphase.

  1. 1.

    Open the AmiGO 2 browser (http://amigo.geneontology.org/amigo/).

  2. 2.

    Click on the Advanced Search button and select “Annotations” from the drop-down list.

  3. 3.

    In the free-text filtering box on the left (Fig. 2a) type in GO:0051233, the GO identifier for the Cellular Component term “spindle midzone”.

    Fig. 2
    figure 2

    Finding annotations in AmiGO 2 based on annotation extension data. (a) Filters applied in the AmiGO 2 browser: GO:ID (GO:0051233 “spindle midzone”), annotation extension (mitotic anaphase), taxon (Schizosaccharomyces pombe). (b) Results of the search using the filters applied in (a). Six unique gene products are located to the spindle midzone during mitotic anaphase

  4. 4.

    Now open the Taxon menu on the left and click on the “more” button at the bottom. A pop-up menu will open, in the top filter box start typing “pombe”--“Schizosaccharomyces pombe” should be the only option that appears. Click on the + next to the species name to add this to the filter.

  5. 5.

    Now open the Annotation Extension menu on the left and click on the + button next to the term “mitotic anaphase” to add this to the filter.

  6. 6.

    AmiGO 2 will display all of the annotations that use the “mitotic anaphase” term (or one of its child terms) in the annotation extension of a primary annotation to “spindle midzone” (or one of its child terms) (Fig. 2b).

6 Summary

The Gene Ontology has proven a vital resource for researchers, enabling them to easily find and use functional data. GO is continually evolving to reflect both accumulating biological knowledge and the computational techniques that researchers need for analysis of a list of gene products. One of the major limitations of using the original simple GO functional annotation has been the lack of contextual information linking together gene products and the roles and pathways they are involved in [4]. Inclusion of this type of data within GO annotations can advance pathway and network analyses substantially, allowing more sophisticated queries and analyses to be performed.

As with all other aspects of GO, annotation extensions continue to evolve—through discussion involving all GOC members and the community—to allow representation and ultimately simple access to a wide variety of contextual data.

Funding Open Access charges were funded by the University College London Library, the Swiss Institute of Bioinformatics, the Agassiz Foundation, and the Foundation for the University of Lausanne.