Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt
- 7.5k Downloads
The Gene Ontology Consortium (GOC) is a major bioinformatics project that provides structured controlled vocabularies to classify gene product function and location. GOC members create annotations to gene products using the Gene Ontology (GO) vocabularies, thus providing an extensive, publicly available resource. The GO and its annotations to gene products are now an integral part of functional analysis, and statistical tests using GO data are becoming routine for researchers to include when publishing functional information. While many helpful articles about the GOC are available, there are certain updates to the ontology and annotation sets that sometimes go unobserved. Here we describe some of the ways in which GO can change that should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets. GO annotations for gene products change for many reasons, and while these changes generally improve the accuracy of the representation of the underlying biology, they do not necessarily imply that previous annotations were incorrect. We additionally describe the quality assurance mechanisms we employ to improve the accuracy of annotations, which necessarily changes the composition of the annotation sets we provide. We use the Universal Protein Resource (UniProt) for illustrative purposes of how the GO Consortium, as a whole, manages these changes.
KeywordsGene Ontology Annotation Function prediction Misinterpretation
Gene Ontology Consortium
Microtubule organizing centers
Universal Protein Resource
UniProt Gene Ontology Annotation project.
Since its initial publication in 2000 over 5,000 peer-reviewed articles have cited the Gene Ontology (GO)a, and each year an increasing number of researchers are using GO to assist them in informing or validating their hypotheses. GO is used both for small- and large-scale applications, for example, providing functional data for a single protein or a GO term enrichment analysis for an entire proteome, as well as by researchers analyzing the changes in GO itself over time and measuring the reliability and comprehensiveness of GO annotations.
With this increasing usage, it is imperative that users of GO are well informed as to how GO and its associations to gene products (GO annotations) are created and maintained. Several helpful articles have been published that describe the work of the GO Consortium (GOC)[3, 4, 5, 6, 7, 8], but there are certain changes to the ontology and annotation sets that are less widely known among users of the GO.
GO provides almost 40,000 terms across three ontologies describing biological attributes of gene products (October 2013). This includes the 1) molecular functions a gene product performs, 2) the biological processes it is involved in and 3) the cellular components it is located in; each one of these terms may be associated with any number of gene products. These associations are known as 'annotations’ and can be created either manually or automatically. Manual annotations are made by a curator reading full-text primary literature and capturing functional data. Manual annotations provide detailed and specific information and are critical for creating learning sets for automated pipelines. Automatic annotations are made using algorithms that consider gene product properties, such as orthology, domains and sequence similarity[5, 9], and they provide a broad coverage of annotation and cover a significantly larger taxonomic range than manual annotations. This difference in coverage is illustrated by the annotation statistics from the database of the UniProt GO Annotation project (UniProt-GOA) that includes GO annotation from all of the GOC members; as of November 2013, GOC provides over 200 million annotations, with around 99% of these being automatically created[10, 11].
Many changes are made to both the ontology and annotation sets over time - some of these changes are planned and announced by GOC or its members via mailing lists or release notes[10, 12], whereas others are not and reflect ongoing improvements, such as user requests for updates to the ontology or annotations, as well as revisions in response to quality assurance checks.
Here we will cover some of the types of changes that can occur on a regular basis and give examples of unplanned changes that have happened in the past. We will also describe the quality assurance mechanisms we have in place, which are available to any group generating GO annotation, and which can be used to improve the accuracy of both manual and automatic annotations, inevitably changing the composition of GO annotation datasets.
Changes to ontologies and annotations
One of the major misconceptions about GO is that the ontologies and annotations give a complete coverage of biological knowledge and are therefore stable and unchanging. This is not the case, partly because biological knowledge itself is incomplete and partly because of the large volume of experimental evidence that has yet to be captured by functional annotation. Changes to both the ontology and annotations are frequent; the revisions and additions that are made to the ontology are publicly released by the GOC each day and those to the UniProt GO annotation dataset every week. At its most complete, GO can only ever reflect what is currently known and there are parts of biology that are not represented in GO as well as they could be. In the case of the ontologies, these parts are being identified and progressively improved by collaborations between expert scientists and the GOC to accurately represent specific areas of biology. These changes to the GO also involve a subsequent effort to assign the new terms to gene products, thereby affecting the composition of annotation sets, in terms of both adding and removing annotations. It should be said, however, that the fact that an association between a gene product and a particular GO term may be removed does not necessarily imply the annotation was incorrect. Here we will describe several reasons why GO terms and annotations may change over time.
Development of the ontologies
The ontologies need to be refined constantly in order to keep up with the latest biological knowledge and to intersect appropriately with other ontologies. The priorities for development are decided based on the particular interests and expertise of GOC members, funding from external bodies to develop a certain area of the ontology, as well as addressing inconsistencies in the GO. It must be noted that any change to the ontology is carefully considered, and discussed with experts if necessary, to avoid introducing inconsistencies or incorrect information. The ontology request tracker is a publicly available tool for users to request and follow updates to the ontology, and terms affected by a current ontology development project are highlighted in the “GO Discussions” section of the term page in the UniProt-developed GO browser QuickGO.
The alterations may involve only small-scale changes to update a definition or add parent or child terms, or it may be a more comprehensive project involving experts in the scientific community to assist a larger restructuring of specific parts of the ontologies. Some examples of this large-scale ontology development that have been done recently include the restructuring and supplementation of the GO terms referring to heart and kidney development, apoptosis [Paola Roncaglia et al., personal communication] and the cell cycle [Valerie Wood et al., personal communication].
Changes to relationships between ontology terms
In addition to updating the ontology terms themselves, changes to relationships between the terms can also have a significant impact on annotations. In November 2011, there was a decrease of ~2,500 manually and automatically assigned annotations to the Biological Process term 'transcription, DNA-dependent’ (GO:0006351; 18% of total GO Consortium inferred annotations: data from UniProt-GOA releases 101 and 100) due to the removal of an inter-ontology link between this term and the Molecular Function term 'sequence-specific DNA binding transcription factor activity’ (GO:0003700). Inter-ontology links are provided when a function is always found to be part of a process, or a process always occurs in a specific subcellular location; these are used to create inferred annotations to Biological Process or Cellular Component terms when a linked Molecular Function or Biological Process term has been annotated, respectively. In the example given, it was determined that gene products involved in sequence-specific DNA binding transcription factor activity are not always directly involved in DNA-dependent transcription and so the link was removed, resulting in the removal of annotations inferred from this ontology link.
Obsoleting terms from the ontologies
In addition to adding new GO terms and refining existing ones, there is occasionally the need to make obsolete terms that are now considered out-of-scope for GO, for example, those that describe gene products or temporal events. When terms have been identified for obsoletion, quite often a replacement or suggested term is given, for example the Molecular Function term 'apoptosis inhibitor activity’ (GO:0008189) was obsoleted because it was actually describing a Biological Process, and it was therefore replaced by the term 'negative regulation of apoptotic process’ (GO:0043066), meaning any gene products that were associated with the former term are also appropriately described by the latter term. When providing replacement terms, the existing annotations are carefully reviewed to ensure that incorrect annotations will not be introduced. Curators are alerted to any change that may cause problems and are asked to re-annotate to a more appropriate term. Consequently, these refinements can have a significant impact on annotations in terms of numbers, that is, annotations using a term that is now obsolete will be removed and potentially replaced by a more appropriate term, but also an increase in accuracy of the resulting annotations.
Identifying ontology terms unsuitable for direct annotation
Since 2012, GO has started designating some GO terms as not acceptable for direct annotation (direct annotation here means the use of the GO term as the primary annotation for a gene product). These terms remain in the GO as they are still in scope and are useful for other applications.
Those that are inappropriate for both manual and automatic annotation, for example, 'nuclear part’ (GO:0044428) or 'S phase’ (GO:0051320) . The 'Cellular Component-part’ terms are present only for ontology completeness, whereas the cell-cycle phase terms describe a time period rather than a specific process, but remain in the Biological Process ontology as they are used in other parts of an annotation, such as annotation extensions , but cannot be used to directly associate to a gene product. For annotations that are associated with a term from this subset, it should always be possible to associate the gene product with an alternative term, for example, gene products associated with 'nuclear part’ (GO:0044428) are more appropriately associated with the 'nucleus’ term (GO:0005634).
Those that are inappropriate for manual annotation, but automatic annotation is acceptable (e.g. 'response to stress’ [GO:0006950]) . These terms are considered too general to be useful. In all cases, there should be a more specific term that the gene product should be associated with, for example, a paper describing a gene product’s involvement in a stress response should always detail the type of stress applied, allowing the curator to choose a more specific child term of 'response to stress’ (GO:0006950). Predictions made by automatic annotation methods use these terms as it may not be possible to choose a more specific term that will always be true for all gene products annotated by the prediction method.
Terms in these subsets are identified by a “Restriction” comment in the UniProt-developed GO browser QuickGO alerting curators not to use them.
Adding annotations using the ontologies
Subsequent to the ontology developments for heart and kidney development, and apoptosis that were described earlier, curators made use of the revised and expanded ontologies to improve the annotation of gene products. In the case of the heart development annotation project, this has so far resulted in almost 30,000 annotations being provided for 4,000 proteins (data from UniProt-GOA release 124). The added value of these new annotations has previously been demonstrated by performing GO term enrichment analyses on annotation sets from before and after the curation project. The analysis of the annotations after the curation effort provided terms with an increased depth and specificity compared with the analysis before the curation effort, thereby improving the interpretation of analyses of large datasets[21, 22].
Curators not only provide annotation following ontology development, but also take part in focused annotation of gene products independent of ontology development. As increasing amounts of published experimental evidence has yet to be captured by functional annotation, prioritization of gene products for curation is important to consider. UniProt prioritizes annotation based on the expected benefit to the larger scientific community and has had proven success when curating proteins involved in specific organ development[17, 21] or location in a specific organelle. This latter project involved the curation of all known roles and locations of human proteins that are found in the peroxisome. During the course of the curation, 49 new peroxisome-related terms were identified for addition to the ontologies, highlighting that an important role of the curator is to identify and request the creation of terms missing from GO. The project resulted in 1,551 annotations being created for 88 peroxisomal proteins, as well as 296 non-peroxisomal proteins where functional data was available in the same publications as the peroxisomal proteins. This work additionally demonstrated an increased depth and specificity of enriched GO terms in a term enrichment analysis.
Changes to the reference of annotations
All GO annotations require a reference or authority describing where the evidence for the annotation originated, such as a PubMed identifier or an abstract where a description of how the annotation was made is provided. The latter are termed “GO references”. In the process of refining annotation sets, it has very occasionally become necessary to change the reference for a set of annotations in order to better describe the origin of the annotations.
There were two such cases at the beginning of 2013 when it became necessary to change the reference associated with a large set of manually created Cellular Component annotations provided by the Human Protein Atlas and LifeDB projects[25, 26]. Previously these annotations were referenced by publications describing the experimental methods used in pilot studies for obtaining the annotations. As such, these publications did not actually contain the experimental data for all of the gene product subcellular localizations that were referenced by them. As this left the annotations open for misinterpretation, it was decided that they would be more correctly described using a GO reference. This kind of change is important for users to be aware of since any analysis that makes use of the reference field of an annotation will have registered a large decrease in the number of annotations, and has prompted some researchers to mistakenly assume these annotations were incorrect. When substantial changes such as these are made in the UniProt annotation files, an announcement is always provided with the accompanying file release notes, and we encourage users and researchers to read them to ensure that their subsequent analyses benefit.
Changes in submitted annotation sets
In addition to the planned changes described above, there may be problems with the data supplied by contributing annotation groups that may result in large changes in the number of annotations we import from these external databases. This can be due to a variety of reasons, for example in April 2013 Ensembl Plants suspended GO annotation for some species whilst they introduced taxon constraints into their annotation pipeline. This resulted in the omission of approximately 125,000 automatic annotations from release 119 of the UniProt GO annotation file, and the GO annotations for these species were not available again until release 122 (July 2013), after Ensembl Plants resumed their GO annotation.
Large changes in annotation numbers can also occur when annotating groups update their files that map between their gene product identifiers and UniProtKB accessions. In January 2012, the Zebrafish Model Organism Database updated their identifier mapping file resulting in a decrease of approximately 15,000 manual annotations to zebrafish proteins in release 104 of the UniProt GO annotation file; in November 2011 a similar update to the Rat Genome Database identifier mapping file was responsible for a loss of approximately 20,000 manual annotations to rat proteins (release 100 of the UniProt file).
Given the many changes and updates to the ontologies and annotations described here, no single annotation file can be considered as definitive. It is therefore recommended that if a user wishes to use specific annotation files for analyses, it is first worth checking with the provider of the file to determine if there are any significant changes to the datasets that should be considered.
Examples of taxonomic groups whose only source of annotation is from automatic prediction methods
Simian immunodeficiency virus
Dengue virus type 1
West Indian ocean coelacanth
Black flying fox
In addition to the ongoing work to improve existing annotations as described in previous sections, specific taxon-related quality control mechanisms have been developed that can prevent unsuitable annotations from being created in the first place. A description of two of these follows.
GO terms are generally defined to be taxon neutral, but some are applicable only to certain taxa. In 2010, the GOC started applying taxon restrictions to certain GO terms. These restrictions improve the accuracy of annotations as well as identifying errors in the ontologies when applied as an automated check of GO term:taxon combinations. In the initial implementation of the taxon restrictions, approximately 1.6 million erroneous annotations were found and corrected. One example of an improvement to GO, resulting from taxon restrictions, was the refinement of the definitions for the terms concerning microtubule organizing centers (MTOC). In fungi, the MTOC is called the spindle pole body, whilst in mammals it is called the centrosome. In GO we have terms for 'centrosome organization’ (GO:0051297) and for 'spindle pole body organization’ (GO:0051300); only fungal gene products should be annotated to the 'spindle pole body organization’ (GO:0051300) class, therefore the definitions of these terms were clarified so that the meaning is more apparent for curators and users.
The taxon restrictions are publicly available[29, 30] and here we explain how to use these effectively. There are currently two types of taxon restrictions; 'only_in_taxon’ or 'never_in_taxon’ and a term can have more than one taxon constraint. It is important to understand that the taxon restrictions are inherited by any child terms of the term they are applied to. For this reason, the taxon restrictions must be used in conjunction with the GO and a taxonomy hierarchy.
Groups providing UniProt with annotations (e.g., InterPro, Ensembl and Ensembl Genomes) are starting to use these taxon restrictions when creating their annotation sets, so providing us with an improved annotation set already from source. The taxon restrictions are also used by the UniProt-developed GO annotation curation interface, Protein2GO, to prevent inappropriate manual annotation from being created. Increasing numbers of GO curators from the GOC, as well as some groups external to the GOC, use Protein2GO, thus having quality control mechanisms in place at the time an annotation is created is important to these groups as it significantly reduces annotation errors.
As this will be an ongoing quality assurance process with further annotation providers implementing taxon constraints in their resource, this is expected to result in increasing numbers of annotations being removed from datasets giving a reduced, but more accurate, set of annotation predictions.
Post-processing of annotations using taxon restrictions
In some cases, however, it is not always possible for the annotation-providing group to refine their mappings between GO and another vocabulary to remove all of the incorrect automatic predictions without an unacceptably large loss of correct annotations. In these cases, UniProt has introduced additional measures to prevent certain gene product-GO term combinations from being created and these are described below.
The annotation blacklist allows us to specify which protein:GO term combinations should not exist as annotations and it is populated from three sources: 1) curator-review of automatic annotation pipelines, 2) manually curated “caution” comments from UniProt that indicate when a protein does not have the expected function, process or location and 3) NOT annotations provided to UniProtKB entries (annotations that explicitly state a gene product is experimentally shown not to be associated with the annotated Biological Process, Molecular Function or Cellular Component). The blacklist is used by UniProt to prevent these annotations being created not only by automatic annotation, but also by manual annotation through the use of on-the-fly checks in our curation interface, Protein2GO. The annotation blacklist is publicly available as a webservice and we welcome suggestions for further additions.
We have described here some examples of how the Gene Ontology and its associated annotations can change over time, using examples of how UniProt manages these changes. It is of particular importance to those researchers who make use of GO data for analysis to understand why these changes occur in order to make the appropriate conclusions for their interpretations. Additionally it is important for those who are generating GO annotation to ensure they are using the most up-to-date and appropriate terms. The GOC and the groups who provide GO annotation are continually looking for ways to enhance both the content of the GO and the GO annotations in order to prevent this data from becoming stale, as well as to assist researchers in forming hypotheses based on current and accurate information.
aSearch of PubMed using the phrase “Gene Ontology”.
RPH is Project Leader of the UniProt-Gene Ontology Annotation project and an Annotation Manager for the GO Consortium since 2012.
TS is Software Engineer of the UniProt-Gene Ontology Annotation project since 2009.
CO’D is Team Leader of UniProt Content since 2009.
MJ-M is Team Leader of UniProt Development since 2009.
We would like to thank the GO Consortium, the UniProt Consortium and from InterPro: Alex Mitchell, Amaia Sangrador, Hsin-Yu Chang and Siew-Yit Yong.
Funding is provided by the National Institutes of Health (NIH) [UniProt 4U41HG006104-04; GO 5U41HG002273-13]; British Heart Foundation [SP/07/007/23671] and EMBL core funds.
- 3.Blake JA, Dolan M, Drabkin H, Hill DP, Li N, Sitnikov D, Bridges S, Burgess S, Buza T, McCarthy F, Peddinti D, Pillai L, Carbon S, Dietze H, Ireland A, Lewis SE, Mungall CJ, Gaudet P, Chrisholm RL, Fey P, Kibbe WA, Basu S, Siegele DA, McIntosh BK, Renfro DP, Zweifel AE, Hu JC, Brown NH, Tweedie S, Alam-Faruque Y: Gene Ontology annotations and resources. Nucleic Acids Res. 2013, 41 (Database issue): D530-D535.PubMedGoogle Scholar
- 5.Dimmer EC, Huntley RP, Alam-Faruque Y, Sawford T, O’Donovan C, Martin MJ, Bely B, Browne P, Mun Chan W, Eberhardt R, Gardner M, Laiho K, Legge D, Magrane M, Pichler K, Poggioli D, Sehra H, Auchincloss A, Axelsen K, Blatter M-C, Boutet E, Braconi-Quintaje S, Breuza L, Bridge A, Coudert E, Estreicher A, Famiglietti L, Ferro-Rojas S, Feuermann M, Gos A: The UniProt-GO annotation database in 2011. Nucleic Acids Res. 2011, 40 (Database issue): D565-D570.PubMedPubMedCentralGoogle Scholar
- 10.: UniProt-GOA Latest Annotation File Release Notes. [http://www.ebi.ac.uk/GOA]
- 11.: UniProt-GOA Annotation File Downloads. [http://www.ebi.ac.uk/GOA/downloads]
- 12.Ontology Changes Mailing Lists. http://www.geneontology.org/GO.mailing.lists.shtml#no-post]
- 13.Request Tracker for Ontology Updates. http://sourceforge.net/p/geneontology/ontology-requests/]
- 14.Request Tracker for Annotation Updates. http://sourceforge.net/p/geneontology/annotation-issues/]
- 15.Example of GO Discussions in QuickGO. http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0023052#info=2]
- 18.Subset of GO Terms Unsuitable for Direct Annotation. http://purl.obolibrary.org/obo/go/subsets/gocheck_do_not_annotate.owl]
- 19.Subset of GO Terms Unsuitable for Direct Manual Annotation. http://purl.obolibrary.org/obo/go/subsets/gocheck_do_not_manually_annotate.owl]
- 20.UniProt-GOA Gene Ontology browser QuickGO. http://www.ebi.ac.uk/QuickGO]
- 21.Alam-Faruque Y, Huntley RP, Khodiyar VK, Camon EB, Dimmer EC, Sawford T, Martin MJ, O’Donovan C, Talmud PJ, Scambler P, Apweiler R, Lovering RC: The impact of focused gene ontology curation of specific mammalian systems. PLoS One. 2011, 6: e27541-10.1371/journal.pone.0027541.CrossRefPubMedPubMedCentralGoogle Scholar
- 24.GO Reference Collection. http://www.geneontology.org/cgi-bin/references.cgi]
- 29.GO Taxon Constraints File Importer. http://purl.obolibrary.org/obo/go/extensions/x-taxon-importer.owl]
- 30.UniProt-GOA Taxon Constraints Webservice. http://www.ebi.ac.uk/QuickGO/GValidate?service=taxon&action=getConstraints]
- 32.UniProt-GOA Annotation Blacklist Webservice. http://www.ebi.ac.uk/QuickGO/GValidate?service=taxon&action=getBlacklist&taxon=<tax id>
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.