GO functional similarity clustering depends on similarity measure, clustering method, and annotation completeness
Biological knowledge, and therefore Gene Ontology annotation sets, for human genes is incomplete. Recent studies have reported that biases in available GO annotations result in biased estimates of functional similarities of genes, but it is still unclear what the effect of incompleteness itself may be, even in the absence of bias. Pairwise gene similarities are used in a number of contexts, including gene “functional similarity” clustering and the related problem of functional ontology structure inference, but it is not known how different similarity measures or clustering methods perform on this task, and how the clusters are affected by annotation completeness.
We developed representations of both “complete” and “incomplete” GO annotation datasets based on experimentally-supported annotations from the GO database—specifically designed to model the incompleteness of human gene annotations—and computed semantic similarities for each set using a variety of different published measures. We then assessed the clusters derived from these measures using two different clustering methods: hierarchical clustering, and the CliXO algorithm. We find the CliXO algorithm, combined with appropriate measures, performs better than hierarchical clustering in reconstructing GO both when the data are complete, and incomplete. Some measures, particularly those that create a pairwise gene similarity by averaging over all pairwise annotation similarities, had consistently poor performance, and a few measures, such as Lin best-matched average and Relevance maximum, were generally among the best performers for a broad range in annotation completeness and types of GO classes. Finally, we show that for semantic similarity-based clustering, the multicellular organism process branch of the GO biological process ontology is more challenging to represent than the cellular process branch.
We assessed the effects of annotation completeness on the distribution of pairwise gene semantic similarity scores, and subsequent effects on the clusters derived from these scores. Our results suggest combinations of semantic similarity measures, gene-level scoring methods and clustering method that perform best for functional gene clustering using annotation sets of varying completeness. Overall, our results underscore the importance of increasing the completeness of GO annotations to for supporting computational analyses of gene function.
KeywordsGene Ontology semantic similarity annotation completeness Directed Acyclic Graphic clustering hierarchical clustering least-diverged human orthologs information content
Area under the curve
Gene-gene similarity score derived by averaging over all annotation similarities
Gene-gene similarity score derived by best-match-average
Clique extracted ontology
Directed acyclic graph
Hierarchical agglomerative clustering
Gene-gene similarity score derived by maximum score match
Receiver operating characteristic
The Gene Ontology (GO), a standardized vocabulary of biological function and process terms, is one of the most frequently used resources for gene function annotations . It consists of 3 domains: molecular function (how a gene functions at the molecular level, e.g. a protein kinase), cellular component (location relative to cell compartments and structures where the gene product is active, e.g. the plasma membrane) and biological process (what larger processes a gene product helps to carry out). Within each domain, the ontology is structured as a directed acyclic graph (DAG) and consists of GO terms that represent different biological properties. Terms low in the DAG are more specific and can have several types of defined relationships to one or more “parent” terms. For the purposes of this paper (grouping genes into biological process classes), we consider two relationship types: “is-a” indicating a child term is a sub-class of its parent term, and “part-of” indicating it is a component of its parent term. It is now common to use the GO in many applications, including gene set enrichment [2, 3, 4, 5], gene network [6, 7] and pathway analysis [8, 9].
A GO annotation associates a specific gene (more precisely a gene product, a protein or noncoding RNA, though we use the term “gene” for simplicity) with a specific class (or “term”) in the Gene Ontology, identifying some aspect of its function. Genes annotated to the same molecular function term have a common molecular mechanism of action, e.g. protein kinase activity; genes annotated to the same cellular component term perform their activities in the same cellular compartment or structure; and genes annotated to the same biological process class are involved in the given biological process. All GO annotations also refer to the evidence underlying them, which can be either from a published experiment, or inferred using a computational method. In this paper, we consider only GO annotations supported by experimental evidence.
GO annotations are commonly used in measures that seek to quantify the functional similarity between genes. As each gene is typically annotated with multiple GO terms, functional similarity involves both a measure for the “semantic” similarity between two GO terms, as well as a method for combining multiple pairwise GO term similarities into an overall gene function similarity score. Several proposed GO semantic similarity measures have been published in the literature, and applied in numerous subsequent studies. Most of measures quantify pairwise GO term semantic similarity by the amount of information shared between two terms, i.e. information content (IC) of the most informative (usually also the nearest) common ancestor of two terms. The most highly-cited measures for computing IC-based GO term similarity are Lin’s , Jiang and Conrath’s , Resnik’s  and Schicker’s scores . Overall pairwise gene level similarities are computed from the pairwise semantic similarity scores in three distinct ways: 1) using the maximal GO term semantic similarity (MAX), 2) averaging over those best-matched pairwise term semantic similarities (best-match average, or BMA), or 3) averaging over all pairwise term semantic similarities (AVG) [13, 14, 15]. In addition to IC-based measures, other measures include graph-based approaches  and vector based approaches, e.g. Cosine/vector dot product , and Jaccard index . An additional file introduces each similarity measure in more detail [see Additional file 1].
Different studies have evaluated and compared those measures. For example, Resnik’s method has been reported to have the highest correlation with sequence similarity [13, 18], as well as performing best in stratifying protein-protein interactions , and the best-match average method of combining GO term semantic similarities was found to perform best overall . More recently, Mazandu and Mulder assessed the performance of different measures in different applications, and found that while BMA approaches (except using Resnik’s measure) correlate best with sequence similarity and functional similarity measures, AVG-based approaches correlate best with protein-protein interaction networks .
Pairwise gene semantic similarities are used in a number of contexts, such as for summarizing and visualizing lists of GO terms obtained in enrichment analysis , for constructing functional gene modules , and perhaps most commonly, for gene ‘functional similarity’ clustering [12, 18]. For functional similarity clustering, most of the published methods create hierarchical clusters. Two types of strategies are generally considered for hierarchical clustering: the agglomerative approach (“bottom up”), and the divisive approach (“top down”). In addition, different linkage criteria are used to determine the distance between objects to be clustered [23, 24]. The major limitation for hierarchical clustering is that it only allows each gene to belong to one cluster. To overcome this limitation, Kramer, et al. developed Clique Extracted Oncology (CliXO) algorithm for Directed Acyclic Graphic (DAG) clustering , which allows each gene to belong to different clusters, and for each cluster to have multiple parent clusters. Kramer et al. showed that for at least one similarity measure, CliXO can reconstruct the Gene Ontology (cellular component aspect) to a high degree of accuracy, using the annotations for yeast genes. However, we note that cellular component annotations for yeast genes are relatively complete. It remains unclear how clustering approaches perform in the more common scenario of incomplete annotations. Annotation incompleteness has been shown to be an important confounder in recent efforts to evaluate gene function prediction accuracy .
Quantifying the incompleteness of knowledge of human gene function
The change of pairwise gene semantic similarities due to incomplete annotations
Accuracy of gene clustering methods for “complete” annotation sets
Overall, for cellular processes, the performance tends to be better when two conditions hold: 1) the semantic similarity measure uses either maximal functional similarity between genes or the average-best-matched functional similarities between genes, and 2) the DAG clustering (CliXO) was applied. According to a one-directional paired t test, the combinations of Relevance MAX, JiangConrath BMA and Lin BMA utilizing DAG clustering, and combination of JiangConrath MAX utilizing HAC clustering, have significantly higher AUC than other combinations. The poor overall performance of similarity measures that average all pairwise annotation scores is not surprising given its dependence on the number of annotations, which varies across different genes as described above. The better overall performance of DAG clustering results from allowing genes to be grouped into multiple clusters, which is a key element of the Gene Ontology structure.
By contrast, the overall performance of gene clustering based on multicellular organism-level processes is quite poor (the overall median AUC value across all measures is below 0.7). This may be due to the fact that this annotation set has, on average, a much larger number of distinct annotations per gene than does the cellular process set (Fig. 2). If two genes work together in one or a few processes but not in others, their overall similarities will be low and they will not tend to be clustered together. In other words, information about conditional similarity in functions can be lost in the overall score, and therefore in the gene clusters constructed from these scores. According to one-directional paired t test, Lin BMA utilizing DAG clustering, and Resnik MAX, Weighted Jaccard and Weighted Cosine utilizing HAC clustering have significantly higher AUC than other combinations. In addition, the performance of DAG clustering decreases substantially for clustering using multicellular process annotations: three out of the top four combinations with significantly higher AUC for reconstructing cellular GO classes utilized DAG clustering (Fig. 4a); only one out of the top four combinations with significantly higher AUC for reconstructing multicellular GO classes utilized DAG clustering (Fig. 4b). This result is consistent with our interpretation that conditional similarities can be effectively lost in the overall pairwise score, so that the DAG clustering property of allowing multiple clusters for each gene is no longer an advantage when the diverse annotations are summarized by a single similarity score.
Accuracy of gene clustering with incomplete annotations
For “incomplete” multicellular organism process annotation sets, while the median AUC value across all combinations is the same as for the complete set, the best combinations perform substantially worse on incomplete data, e.g. the Lin-BMA-DAG combination has an average AUC of 0.76 on complete data (Fig. 4b) with a maximum of 1 (perfect performance), while on incomplete data the average AUC is 0.72 with a maximum of 0.8 (Fig. 5b). The average performance of different combinations on multicellular processes is much worse than on cellular processes. Given the poor clustering results on even the complete multicellular process annotations as described above, this is not surprising.
Best combinations of similarity and clustering methods for recovering the known structure of GO classes
Type of GO classes
Semantic similarity measure
Relevance MAX, JiangConrath BMA, Lin BMA
Resnik MAX, Weighted Jaccard, Weighted Cosine
JiangConrath MAX, Weighted Cosine
Lin BMA, Lin MAX, Relevance MAX, Resnik MAX
Lin BMA, Relevance MAX
Only one combination, Lin BMA utilizing DAG clustering (CliXO), is among the top performing combinations in all cases, and JiangConrath MAX tends to perform best when utilizing hierarchical clustering. The top performing combinations never use the AVG method for combining similarity scores. Overall, a larger number of top performing combinations utilize DAG clustering.
To assess whether the accuracy calculations for our incomplete data sets were consistent between different simulated sets, we calculated the coefficient of variation (CV) of all AUC values (each simulated set has a corresponding AUC value) for each GO class. The distribution of AUC values for each measure/algorithm combination was then plotted as shown in Addtional file 2: Figure S4. Overall, there is a high degree of consistency: the grand median of CV is around 10%, i.e. on average there is an around 10% deviation of AUC value from the mean AUC value for each simulated set. Specifically, for simulated cellular process sets, for most combinations of measure and algorithm, CV values are narrowly distributed around 10% (except for Resnik-BMA-DAG, Resnik-AVG-DAG and Relevance-AVG-DAG). For simulated multicellular process sets, quite a few combinations gave more dispersed distribution of CV values with the 75th percentile close to 20%. This indicates a smaller degree of consistency for simulated multicellular process sets than the cellular process sets, though still showing overall consistency. This result is expected given the much higher degree of incompleteness of the multicellular process annotation sets compared to cellular processes (Fig. 2).
Measure the change of clusters due to incomplete annotations
The preceding sections compared the clusters obtained for either complete or incomplete annotation sets, to the actual GO classes. We used this as a proxy for clustering accuracy. In this section, we compare the clusters obtained for a given method (combination of similarity measure and clustering algorithm) on the incomplete annotation sets, to those obtained on the complete annotation sets. Thus we are assessing the robustness of each method to incompleteness.
We assessed the effects of annotation completeness on the distribution of pairwise gene semantic similarity scores, and subsequent effects on the clusters derived from these scores. We performed our assessments on all combinations of similarity measure and clustering method for recovering the known GO classes, using both “complete” and “incomplete” annotations. Specifically we considered 14 previously published similarity measures, and two types of clustering, hierarchical and CliXO. For both complete and incomplete annotation sets, measures which create a pairwise gene similarity by using the maximum or best matched average over all pairwise annotation similarities tend to perform best. In addition, the CliXO clustering method, combined with appropriate similarity measures, tends to perform better than hierarchical clustering. A few particular methods, such as Lin BMA and Relevance MAX utilizing CliXO, are generally among the most accurate for both complete and incomplete annotation sets, and both cellular and multicellular organism processes (Table 1). The best-match-average method of deriving gene-level scores, however, generally shows greater robustness to incompleteness than maximum method, meaning that the cluster identities are more similar to those obtained for “complete” annotations. Therefore this method might be preferable for many clustering applications. The averaging method at the gene-level, while the most robust, has much lower clustering accuracy than any other method. This is at least in part because the signal of similar annotations (shared between two genes) is diluted to varying degrees by the noise of dissimilar annotations, an effect that depends on the number of annotations.
We find that hierarchical agglomerative clustering approaches (which yield only strict hierarchies, i.e. a cluster can have only one parent cluster) have higher accuracy with similarity measures that utilize the maximum pairwise annotation score, or with the WeightedJaccard or WeightedCosine measures; the WeightedJaccard or WeightedCosine measures are more robust to incompleteness. The CliXO clustering method, because it can allow multiple parent clusters, is able to utilize information from multiple different annotations captured in the best-match average scores (which average over the best match between each annotation of one gene, and an annotation of the other gene). This is consistent with the testing of CliXO with best-match-average scores by Kramer et al.  (though they utilized the Resnik measure, which we find to be less accurate, and less robust to incompleteness than some other measures). However, when the number of distinct annotations for each gene is too large, such as our multicellular process annotation sets, this advantage disappears.
We find that while several combinations of similarity measure and clustering algorithm perform well for representing GO cellular processes, all combinations perform much worse for representing multicellular organism-level processes. This likely reflects the greater complexity of this branch of the GO biological process ontology, and the larger number of annotations in both the complete and incomplete sets (and therefore the greater loss of information when reducing into one dimension of a similarity score).
Our study has attempted to estimate a lower bound on the incompleteness of experimental GO annotations of human genes, by comparing with experimental annotations of orthologous genes in highly studied model organisms (yeast and mouse). We find that human annotations are highly incomplete, and much more incomplete for multicellular organism level processes than for cellular level processes. We also find, not surprisingly, that genes tend to be more highly pleiotropic (fewer distinct annotations per gene) at the multicellular level, than at the cellular level. We used this estimate to simulate incomplete annotation sets, and assess how this incompleteness can affect downstream GO-based analyses, specifically pairwise semantic similarity scores and gene similarity clusters derived from them. To make this comparison, we also needed to assess the clusters derived from “complete” annotation sets. We find that for cellular-level process annotations, which are moderately incomplete and show less functional pleiotropy, the DAG-based CliXO clustering method performs well with several different GO term semantic similarity measures. However, because genes are generally annotated to multiple, distinct terms, it is critical that the overall gene pairwise similarity is derived from a method that attempts to first match up each GO annotation for one gene with its cognate for the other gene (either using the maximum method, or best-match-average method), rather than taking a simple average over all possible matches (the average method). For multicellular processes, for which genes display much greater pleiotropy, nearly all combinations of similarity measures and clustering methods perform relatively poorly, on both complete and incomplete annotation sets, at least in part due to the difficulty in reducing a comparison over a large number of distinct functional annotations to a single gene-gene similarity score. However, in all cases, we find a substantial decrease in both clustering accuracy and robustness when annotations are incomplete, underscoring the importance of increasing the completeness of GO annotations for supporting computational analyses of gene function.
Creating a representation of a set of “completely annotated” human genes
We first created an approximation to a set of human genes that are “completely annotated” with respect to GO biological process terms. Recognizing that yeast is the best-studied eukaryotic cellular system, and mouse the best-studied vertebrate animal, we determined two sets separately: cellular processes, and multicellular organism-level processes. For cellular processes, we considered all yeast genes that are associated with at least 75 distinct publications in PubMed to be “well studied” experimentally. Similarly, for multicellular processes, we considered all mouse genes that are associated with at least 75 distinct publications. Associations between PubMed IDs and yeast genes and mouse genes were obtained from Saccharomyces Genome Database (SGD)  and the Mouse Genome Informatics (MGI) Data and Statistical Reports , respectively. This resulted in a set of 866 “well-annotated” yeast genes, and a set of 850 “well-annotated” mouse genes. All GO annotations having experimental evidence codes (EXP, IDA, IPI, IGI, IEP, IMP) were extracted for each yeast/mouse gene in the final sets, from the GO database (AmiGO 2.0, Mar 12, 2014). We used the Bioconductor package GO.db 2.8.0 to remove “redundant” annotations; this included annotations to the same term using a different piece of evidence, and annotations to less specific terms that were already covered by a more specific annotation.
To convert these model organism annotations to an approximation of “complete” human gene annotations, each yeast (cellular processes) or mouse (multicellular organism processes) gene was mapped to the corresponding least-diverged human ortholog, as defined from PANTHER . Of the 866 well-annotated yeast genes, and 850 well-annotated mouse genes, 434 and 813, respectively, could be mapped to least-diverged orthologs in humans. We took these sets as our approximated “completely annotated” human gene sets for cellular processes (434 human genes) and multicellular organism processes (813 human genes).
Quantify incompleteness of experimental biological process annotations among human genes
Simulating “incompletely annotated” genes
Step 1: Select the number of annotations to remove from a completely annotated gene; begin with the largest remaining number of missing annotations, i.e. choose n = max (N).
Step 2: from all remaining genes (i.e. with unmodified annotations) in S, randomly select gn of them, denoted by s (genes in s must have at least n annotations)
Step 3: randomly remove n annotations from each gene in s
Step 4: exclude s from S and exclude n from N
Step 5: repeat Steps 1-4 until complete.
The simulation was repeated 100 times to generate 100 different incomplete sets each of cellular and multicellular annotations. Gene-cluster analysis using “complete” and “incomplete” annotation sets were compared with each other to evaluate the impact of incompleteness of annotations.
Calculation of GO-based similarities and clustering of genes using complete and incomplete annotation sets
For both complete and incomplete annotation sets, only those genes with at least one annotation were included in analysis. GO-based gene-gene similarity scores were first calculated. Information content (IC) based semantic similarity measures included Resnik's, Lin's, Jiang and Conrath's and Schlicker’s measures, and non-IC based measures included weighted Cosine and weighted Jaccard measures. For each IC based semantic similarity measure, three different methods were used to calculate pairwise annotation similarity scores to pairwise gene similarity scores: average of pairwise annotation similarities, maximum of annotation similarities and best-matched annotation similarities. Several R-based tools have been developed recently for computing both IC based and non-IC based semantic similarities. Specifically in this study, we used an R-based tool called csbl.go to calculate all similarities listed above. With csbl.go, a similarity score between genes can be automatically computed. For IC-based similarities, the probability of each GO term occurring in the set of annotations for all genes for different species is calculated by this tool. This information can be directly transformed to an IC value for each GO term. Thus, the only parameters need to be specified for calculating similarities are the name of the species, ontology domain and the name of similarity measure used for the calculation . For each similarity measure, both hierarchical agglomerative clustering (HAC) and Directed Acyclic Graph (DAG) clustering were performed to cluster genes, separately for each annotation set (2 complete sets and 200 incomplete sets). The agglomerative nesting algorithm implemented in the R package ‘agne’ was used for hierarchical clustering; the Clique Extracted Oncology (CliXO) algorithm developed by Kramer et al.  was used for DAG clustering.
Measuring the accuracy of clustering
Measuring the consistency of clustering from different simulated “incomplete” gene sets
We assessed the consistency of analysis from different simulated sets by calculating the CV(%) of AUC for clustering genes from the same GO class using different simulated sets: For each GO class with a meaningful size (containing between 5 and 50 genes in the given “complete annotation” set), we calculated the AUC as described above for each of simulated data sets. We then calculated the CV(%) of AUC (i.e. standard deviation of AUC values across all simulated set divided by the mean of AUC values) for each GO class, given each similarity measure and clustering algorithm. The distribution of CV(%) across all GO classes were calculated. A small CV(%) indicates a small data deviation from the mean AUC, which means a high degree of consistency of results between different simulated data sets.
Measuring the robustness of clustering to incompleteness
Each cluster of “incompletely annotated” genes was matched to the cluster(s) of “completely annotated” genes with the largest number of overlapping genes.
Following step 1, if a cluster of “completely annotated” genes has multiple matched clusters of “incompletely annotated” genes, the one with the largest number of overlapping genes was considered as the best match.
For each cluster from the completely annotated set, we calculated the proportion of genes that were found in the best-matched cluster from an incompletely annotated set. For unclustered singletons from the completely annotated set, we calculated the proportion of genes that remained as singletons in the clustered incompletely annotated set.
We thank Dr. Huaiyu Mi for helpful comments on the manuscript.
Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award number U41HG002273. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Availability of data and materials
The datasets used and analyzed during the current study is available on reasonable request.
ML developed the methods under the supervision of PDT. ML and PDT evaluated and interpreted the results. Both authors contributed to the final version of the paper. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 4.Nota B. Gogadget: an R Package for Interpretation and Visualization of GO Enrichment Results. Mol Inform. 2016;36(5–6). https://doi.org/10.1002/minf.201600132.
- 5.Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–50. https://doi.org/10.1073/pnas.0506580102.CrossRefPubMedPubMedCentralGoogle Scholar
- 7.Oliveira G, Santos A. Using the Gene Ontology tool to produce de novo protein-protein interaction networks with IS_A relationship. Genet Mol Res. 2016;15(4). https://doi.org/10.4238/gmr15049273.
- 9.Hill D, D’Eustachio P, Berardini T, Mungall C, Renedo N, Blake J. Modeling biochemical pathways in the gene ontology. Database (Oxford). 2016. https://doi.org/10.1093/database/baw126.
- 10.Lin D. An information-theoretic definition of similarity. In: In Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann; 1998. p. 296–304.Google Scholar
- 11.Jiang J, Conrath DW. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proc of 10th International Conference on Research in Computational Linguistics, ROCLING’97. Taiwan; 1997. 12. Resnik P. Using information content to evaluate semantic similarity in a taxonomy. In: In Proceedings of the 14th International Joint Conference on Artificial Intelligence; 1995. p. 448–453.Google Scholar
- 17.Bodenreider O, Aubry M, Burgun A. Non-lexical approaches to identifying associative relations in the gene ontology. Pac Symp Biocomput. 2005;10:91–102. https://psb.stanford.edu/psb-online/proceedings/psb05/bodenreider.pdf.
- 20.Mazandu G, Mulder N. Information content-based Gene Ontology functional similarity measures: which one to use for a given biological data type? PLoS One. 2014;9(12):e113859. https://doi.org/10.1371/journal.pone.0113859.eCollection2014.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.