OrthoClust: an orthology-based network framework for clustering data across multiple species
- 4.6k Downloads
Increasingly, high-dimensional genomics data are becoming available for many organisms.Here, we develop OrthoClust for simultaneously clustering data across multiple species. OrthoClust is a computational framework that integrates the co-association networks of individual species by utilizing the orthology relationships of genes between species. It outputs optimized modules that are fundamentally cross-species, which can either be conserved or species-specific. We demonstrate the application of OrthoClust using the RNA-Seq expression profiles of Caenorhabditis elegans and Drosophila melanogaster from the modENCODE consortium. A potential application of cross-species modules is to infer putative analogous functions of uncharacterized elements like non-coding RNAs based on guilt-by-association.
KeywordsGene Ontology Cost Function Orthology Relationship Orthologous Pair Network Alignment
long non-coding RNA
small nucleolar RNA.
Over the past decade, we have witnessed the burgeoning of comparative genomics. With the advancement of sequencing and other high-throughput techniques, ‘omics’-scale data have been generated in many species [1, 2]. Apart from genomic sequences, one can now compare two or more species in terms of their epigenome, regulome, transcriptome, interactome, and so on. As a result, computational frameworks that integrate such system-level data from different species are of particular interest. While different kinds of ‘omics’ scale data reflect different facets of a biological system, many of these high-dimensional data can be projected onto a network. For instance, the expression profiles of genes or the histone modification patterns in their upstream regions can be used to connect genes to form various co-association networks. Data from different species thus form species-specific networks that can, in principle, be integrated by incorporating evolutionary relationships.
For a set of genes, features associated with the topological properties of networks open additional windows to interpret their genomics features and annotation, among which the concept of network modules is particularly important from a systems biology perspective. Through identifying modules, one can reduce the complexity of a biological system by collapsing the large number of interconnections amongst its constituents into a smaller number of interactions between the modules [3, 4]. While different ‘omic’ data result in different networks, genes clustered together to form modules are likely to have a common biological role; for instance, being regulated by a common transcription factor, being part of a protein complex, or being presented in the same pathway. One of the most widely studied types of ‘omic’-scale data is genome-wide expression data. To analyze genome-wide expression profiles, network-based algorithms  together with approaches like hierarchical clustering , self-organized maps , spectral techniques  and superparamagnetic clustering  have been developed and extensively used since the dawn of the microarray era. While these methods have provided valuable biological insights, they were aimed at clustering within individual species only. To utilize the evolutionary information between species, a natural generalization that performs clustering across multiple species will be instructive, especially because the recent advancements in transcriptome profiling techniques like RNA-Seq have generated tremendous amounts of genome-wide expression data across many different species [10, 11].
Here we present OrthoClust, a novel network-based framework for clustering data across multiple species. OrthoClust integrates the networks of individual species using orthology relationships of genes between species. As connected genes within a species and orthologous pairs across species connect genes with the same function within and across species, respectively, OrthoClust naturally extends the idea of functional modules into a cross-species dimension. The essence of OrthoClust is a cost function for the detection of modules across species. We present a solution of the optimization problem by using simulated annealing. As expression data comprise one of the most important classes of 'omic-' data, we demonstrate OrthoClust using the genome-wide expression data of worm and fly generated by the modENCODE consortium, arriving at co-expression modules that range from being highly conserved to species-specific. We then compare the results with traditional single-species clustering and demonstrate the advantage of our approach. We further compare the conserved modules with results obtained from IsoRank , one of the state-of-the-art methods for network alignment. As more and more system-wide data are generated across different species, the concept of orthology-based meta-clustering demonstrated by OrthoClust can serve as a general computational framework for integration of other ‘omic’-scale data, such as protein-protein interactions.
Cross-species modules in a multi-layer network
Overview of OrthoClust
Construction of the multi-layer network
The inputs of OrthoClust are the co-association networks of two or more species, and the orthology relationships between genes of the species of interest. Of course, co-association networks are derived from raw data, and there are various ways to arrive at the networks depending on the specific data and biological purpose. OrthoClust combines individual layers of co-association networks by connecting genes in different species via their orthology relationships. To account for the fact that many orthologous pairs are not one-to-one but many-to-many, orthologous links are weighted such that the weights are normalized by the number of orthologs of each node (Materials and methods).
Defining the cost function in the multi-layer network
Here, S1 and S2 are the sets of genes for the two species, respectively. Λ ij = A ij - k i k j /2 m, with k i = ∑ j A ij and m = ∑ i k i /2. As A is a network adjacency matrix, the subtracted term is the expected number of links between nodes i and j in an ensemble of random graphs with the same degree distribution [14, 15]. Its presence in H is to reduce the contribution of links between nodes with higher degree (that is, hubs). The superscripts (1 or 2) correspond to the networks of two species. The value of the Kronecker delta equals one if nodes i and j have the same label and zero otherwise. The first two terms of the cost function H are essentially the modularity functions of two individual networks . In the standard modularity function, a network with high modularity means there is a high number of links between nodes in the same module, and a low number of links between nodes in different modules. The novelty of OrthoClust is the last term regarding the orthologous links between nodes in different layers of the co-association networks. It sums over O(S1, S2), that is, all the orthologous pairs between S1 and S2. As mentioned above, each pair of orthologs is weighted by w ij to take into account the many-to-many orthology relationships (Materials and methods). Configurations in which orthologs have the same label will lower the cost function. The relative contribution between co-association edges and orthologous edges is controlled by a coupling constant κ (for determination of the constant, see below). In the language of statistical physics, the entire framework can be interpreted as a spin system called a q-state Potts model , which is a generalization of the Ising model. The cost function characterizes the energy of the spin (label) system and the optimal assignment of nodes to different modules is equivalent to the ground state of the Potts model.
Assigning nodes to modules by multiple runs of simulated annealing
To optimize the cost function, OrthoClust employs a standard simulated annealing procedure similar to the one used in . Labels are randomly assigned initially, and updated via a heat bath algorithm. The temperature of the system is gradually lowered until the flipping rate of labels is lower than a certain threshold (Materials and methods). Although the labels have divided nodes of the network into modules, we do not directly use the resultant configuration due to the probabilistic nature of simulated annealing, but perform the annealing process R times. By summarizing the results using a co-appearance matrix (a matrix whose elements (i,j) represents how often the two nodes i and j co-appear in the same module), OrthoClust arrives at a set of modules by thresholding the co-appearance frequency and looking for nodes that co-appear often (Materials and methods). Often the sizes of the modules follow a power law distribution; tiny modules are therefore neglected (Materials and methods). OrthoClust is generally not very sensitive to the value of q. This is because, even though the system starts with many different labels (a high value of q), the large range of states will coalesce into a few modules and only a few labels will remain to cover the appropriate number of modules as the system cools down. In other words, the exact value of q is not very important as long as q is chosen to be large.
Using OrthoClust for integrating expression profiles across species
The minus sign in front of the negative links means the effects of the negative links are opposite to those of the positive links, meaning that, in the favorable configurations, nodes in the same module are likely to be connected by positive links and nodes from different modules tend to be connected by negative links .
Simultaneous clustering of expression profiles in worm and fly via OrthoClust
Separation of modules in the Gene Ontology space
Benchmarking modules based on co-regulation patterns
Apart from GO analysis in Figure 4, we further tested whether genes inside a module are indeed more functionally relevant by examining the number of common regulators they possess. We identified the binding targets of a set of worm and fly transcription factors based on ChIP-Seq experiments also generated by the modENCODE consortium  (Materials and methods). These ChIP-Seq experiments and the RNA-Seq experiments for expression profiles were performed in the same developmental stages. For all pairwise combinations of modules, we then counted the number of common transcription factors for each pair of genes (Figure S3 in Additional file 1). We found that pairs of genes within a module, on average, have more common transcription factors than pairs of genes in different modules (a 2.6-fold increase in worm and 1.6-fold increase in fly, P < 0.001 under permutation test). This result is consistent with general observations that a transcription factor tends to regulate targets sharing similar biological functions.
Effects and the determination of the coupling constant κ
Another method we used to examine the effects of κ is the similarity measure between genes based on their GO annotation as described in the previous section. The similarity scores between each pair of the 34,000 worm and fly genes define a weighted network W, where the nodes are the genes and the edges are weighted by the pair-wise scores. Since the weighted network is not a multi-layer network but a single-layer network, its modularity can be quantified by a more traditional modularity function for weighted network defined as: , where k i = ∑ j W ij , m = ∑ i k i /2 and σ i is the module label of node i. A high modularity score means highly similar genes (in terms of GO annotation) are grouped in a module whereas distant genes are separated. In principle, this weighted network based on GO annotation serves as a benchmark for the multi-layer network defined by OrthoClust. A favorable way of assigning nodes to modules by OrthoClust therefore should also be a favorable way to divide the weighted network into modules. As shown in Figure 5B, for each value of κ, we found the way to assign nodes to the modules by OrthoClust and then calculated the corresponding modularity score of the weighted network. When the value of κ is too high or too low, the modularity score of the weighted network is low. The κ that maximizes the modularity score of the weighted network should therefore be the optimal κ for OrthoClust. Combing Figure 5A and Figure 5B, we picked κ = 3 as our optimal value.
Weights associated with the orthology relationships
Orthology relationships between species connect layers of co-association networks. While the coupling parameter κ defines the overall relative contribution between intra-species and inter-species connections, the weight of each orthology connection could, in principle, be adjusted. It is very common that in eukaryotes many orthologs are many-to-many instead of one-to-one, mathematically forming various bipartite cliques in the multi-layer network. We tested OrthoClust by treating all the orthologous pairs equally in the cost function. We found that most of the cliques cannot be resolved, and their nodes got assigned to a single module (Figure S4 in Additional file 1). This implies the cost function favors very large cliques and is biased against the conserved clusters that are linked by one-to-one orthologs. To account for this effect, OrthoClust therefore weights down the orthologous link of a node by the number of orthologs it possesses. As shown in Figure S4 in Additional file 1, the weighted approach works better in resolving the huge cliques.
Comparison with single-species clustering
Comparison with network alignment
The concatenation of networks using orthology relationships resembles the problem of cross-species network alignment . To compare OrthoClust with network alignment, we applied IsoRank  to align the worm and fly co-expression networks (Materials and methods). Again, using the metagenes obtained from  as a gold-standard, we found that 88% of metagenes were aligned by IsoRank (Figure S5 in Additional file 1), compared with 81% by OrthoClust. Although IsoRank slightly outperformed OrthoClust in identifying the corresponding functional genes between two species, it does not immediately report how these pairs form clusters. Motivated by , we looked for co-expression edges in two networks whose nodes are aligned by IsoRank. By connecting such edges in the network, we generated aligned subgraphs that could potentially be interpreted as modules conserved across two species. Among the gene-pairs that are predicted to be in the same module, we found that 43% are consistent with OrthoClust. The percentage is probably reasonable because aligned subgraphs do not really possess the properties of clusters signified by the dense connections between genes within a species.
Simulated annealing was employed to optimize the cost function defined by OrthoClust. To reduce the effects of the stochastic nature of simulated annealing, we constructed the co-appearance matrix by repeating the annealing process R times. To determine R, we ran independent trials of R runs, resulting at different co-appearance matrices and thus different sets of clusters. We then compared the consistency between two sets of clusters by considering if two genes have been assigned to the same module by trial 1, whether or not they are assigned to a common module in trial 2. This is essentially done by calculating a confusion matrix (Materials and methods). By pairwise comparison of independent trials, we found that the overlap between trials increases as R increases (Figure S6 in Additional file 1). More specifically, the overlap increases from 46% for R = 8 to 65% for R = 32, and 75% for R = 64. The statistically significant results for R = 32 in the previous analysis show that the value offers a reasonable compromise between computational cost and robustness. We then further superposed different trials to construct a co-appearance matrix with 128 runs, and thus a set of 'most accurate' clusters. We then calculated the consistency between the ultimate set with sets constructed with smaller values of R (Figure S6 in Additional file 1). We found that the average consistency between clusters generated with R = 32 and the ultimate set is 76%.
Mapping uncharacterized elements to modules
We then mapped worm and fly ncRNAs to the 21 modules based on their expression profiles (Materials and methods). Though there is no gold standard available to evaluate systematically the performance of the mapping, we found examples suggesting that ncRNAs from different species could be linked together in terms of their potential functions. For instance, sphinx, the fly long non-coding RNA (lncRNA) expressed in the brain, was shown to be involved in regulation of male courtship behavior . In our analysis, this lncRNA was mapped to module 1, which is characterized by neurological system process and behavior. On the other hand, linc-10 and linc-104, worm large intergenic non-coding RNAs (lincRNAs) that are highly expressed in male-related stages , were mapped to the same module.
In addition to the mapping, we also found that some modules are enriched with different classes of ncRNAs (Figure S8 in Additional file 1). The list of worm and fly ncRNAs we tested and the modules they mapped to can be found in Additional file 3.
Generalization to N species
Here, S 1 , S 2 , S 3 stand for three different species. The inner summation is the modularity function for the network of a single species. The outer summation sums the three networks together. The extra terms represent the coupling (with coupling constant κ) between three pairwise combinations, namely the orthologous pairs represented by O(S 1 , S 2 ), O(S1, S3) and O(S2, S3). The coupling constant κ can be determined by the same approach we explained in the example of two species. Of particular interest is the third-order coupling term for the 1-1-1 triplets, O(S1, S2, S3). A triplet consists of three genes from three species that are orthologous to one another in a one-to-one fashion; that is, for triplet , apart from , no other gene in S 1 is orthologous to and vice versa. In this cost function, the third-order term favors a 1-1-1 triplet to have the same label. The 1-1-1 triplets are of particular importance among all the orthologous triplets because they correspond to particularly conserved biological function. Genes performing less conserved functions are more likely to undergo gene duplication and end up with many-to-many orthologs.
Discussion and conclusions
In this study, we present OrthoClust, an orthology-based network framework that performs data clustering across multiple species. Due to the rapid increase of data from many species, a novel meta-clustering framework that integrates data from different species will be highly useful for comparative genomics. In OrthoClust, a module is defined based on evolutionary information as well as co-association information. A conserved module groups genes from multiple species corresponding to a common biological function, whereas a species-specific module consists of genes that are responsible for a specific function novel to a species. Though we have focused on expression data for illustration, OrthoClust can be readily applied to other high-dimensional data like histone modification patterns or protein-protein interactions by appropriately modifying the cost function. For instance, in the context of interactome, OrthoClust can be used to detect modules in protein-protein interaction networks in a cross-species context and examine the evolution of protein complexes.
In single-species clustering, a conserved gene can be grouped in a species-specific module simply because of their strong tie. By incorporating orthology relationships between species, OrthoClust detects better the conserved modules. The concept of cross-species modules complements the principle of 'guilt by association' because it may potentially lead to functionally analogous elements across species. This is of fundamental interest for elements like ncRNAs because only short regions of ncRNAs are constrained by structure or sequence-specific interactions and, compared to protein-coding genes, this discrepancy in selection pressure makes it very hard to find orthologous ncRNAs by their sequence . RNA-Seq experiments have found that over 60% of the genome is transcribed, and many transcriptionally active regions (TARs) have been identified . Therefore, mapping onto OrthoClust modules based on expression profiles serves at least as an intermediate step to inferring the putative functions of the vast amount of transcriptionally active regions.
Like many clustering algorithms, OrthoClust is unsupervised. Generalization to supervised clustering based on various gold standards are possible in principle. Nevertheless, the essence of OrthoClust is a global optimization process. The exact mathematical form of the cost function depends on the number of species and the types of data. In our application of OrthoClust to expression networks, we have separated the effects of positive and negative edges. The procedure is analogous to the idea of structural balance in the context of social networks . Nevertheless, the cost function of such a signed network has a more complicated energy landscape and therefore makes the optimization process trickier. In this analysis, we did not find a significant improvement compared with the unsigned case. Another aspect concerning the mathematical form of the cost function is worth mentioning. As pointed out by several studies, finding network communities or modules using a global objective function is subject to resolution limits [40, 41]. The limit explains the existence of giant modules detected by OrthoClust. In principle, this could be complemented by a recursive scheme. Nevertheless, an extra coupling parameter λ can be added to the definition of Λ such that Λ ij = A ij - λ k i k j /2 m. The parameter can control the size of resultant modules  but it is commonly chosen to be 1 by convention. In principle, the coupling parameter can be tuned to obtain a better resolution, or to obtain sizes that are more biologically relevant. In terms of minimization of the cost function, we used simulated annealing as a conceptual demonstration. Though it is theoretically possible to obtain the optimal solution, it is computationally expensive. Indeed, there is no simple theory to address the convergence time, which may present problems especially for multiple species scenarios. In principle, one can replace simulated annealing with another heuristic ; there are other faster approaches for module identification for single-layered networks, for instance, spectral techniques . A recent study combining various co-expression networks from the same species based on tensor computation point in a similar direction . As simulated annealing does not scale very well as the number of species increases, the other approaches described may present more efficient solutions.
Efforts have been spent on the comparison of networks from different species (for example, [12, 44, 45, 46, 47]). Earlier work focused on whether individual interactions are conserved across species , and, more recently, attempts have been made to quantify the rate of rewiring . In particular, network alignment is a well-established analysis for cross-species network comparison [44, 45, 46]. Though OrthoClust resembles network alignment, the two problems are quite different. The essence of network alignment is to understand how individual nodes and edges in one network have their counterparts in another network, whereas OrthoClust focuses on whether genes working together in one species (signified by the dense connections between genes) would preserve the collaboration in another species (another dense region). Network alignment thus involves greater topological details, and to a certain extent is a harder problem. As many of the networks constructed are rather incomplete and there are possibly false positives, detecting modules is generally less sensitive to these errors compared with network alignment. It is worthwhile to point out that while the original motivation of some network alignment algorithms like IsoRank is to improve orthology prediction , conserved modules in OrthoClust could be interpreted as potential homologs and analogs that perform similar functions in different organisms. The incorporation of corresponding co-association networks is actually important because common orthology detection approaches focus on the sequence level . Because of the resemblance of network alignment and cross-species clustering, one could also modify OrthoClust by replacing the orthology relationships using aligned gene-pairs. As OrthoClust is a flexible computational framework, such modification would be technically straightforward but conceptually interesting to explore.
Materials and methods
Datasets of transcriptome and orthologous pairs
Transcriptome profiling data from worm and fly were generated by the modENCODE consortia using RNA-Seq . The expression values of worm and fly were measured across 33 and 30 developmental stages, respectively. The total 10,031 worm-fly orthologous pairs (including one-to-one, one-to-many, many-to-many relationships from 5,769 unique worm orthologous genes and from 5,507 unique fly orthologous genes) between worm and fly were downloaded from the modENCODE website as they were compiled by the consortium . At the genome-wide level, there are 20,377 worm genes and 13,623 fly genes. For each species, expression values in different developmental stages or cell lines were log-transformed and standardized and Pearson correlation coefficients were calculated for each pair of genes. The list of ncRNAs in worm and fly were obtained from wormbase  and flybase , including lncRNAs (228 in worm, 852 in fly), microRNAs (211 in worm, 215 in fly), small nucleolar RNAs (snoRNAs) (141 in worm, 287 in fly) and tRNAs (236 in worm, 238 in fly); more than 100 RNAs of each type are present in both worm and fly. The accession numbers for raw data, the processed expression values from the RNA-Seq experiments, and the orthologous pairs between worm and fly can be found in Additional file 4.
More details on OrthoClust
The cost function
To take into account the fact that many orthologous pairs are not one-to-one but many-to-many, the contribution of a pair of orthologs to the generalized modularity function is not 1, but normalized by the number of orthologs. For example, if gene i from species 1 is orthologous to genes in species 2 including gene j’ whereas gene j’ in species 2 is orthologous to genes in species 1 including gene i, the weight w ij’ is defined as . For simplicity, this modification is not displayed in the main equation.
Standard simulated annealing was employed. Spin values were randomly assigned initially, and updated via a heat bath algorithm. The initial temperature was chosen in a way such that the flipping rate (the probability that a node changes its spin state) was higher than 1 - 1/q. The temperature was gradually decreased with a cooling factor 0.9, until the flipping rate was less than 1%.
More details on applying OrthoClust to cluster expression profiles
Construction of individual co-expression networks
Many algorithms have been proposed to transform raw expression profiles into individual co-expression networks based on calculating the N-by-N Pearson correlation matrix [19, 20, 21, 22]. There are two classes of algorithms: those imposing a global threshold on the values of the correlation coefficients for all genes (value-based), and those that locally allow each gene to connect to the top d most correlated genes (rank-based). Networks constructed by global value-based algorithms were found to be more difficult to resolve into smaller modules. Therefore, a local rank-based algorithm in which each gene is connected to the top d genes with the highest (absolute) Pearson correlation was employed . The value of d was chosen in order to keep the sparsity of networks. More specifically, d is the smallest value such that all genes from individual species independently form giant connected networks In general, if d is very small, the resultant network by definition cannot form a giant connected graph. On the other hand, if d is very large, the network would not be sparse. In the worm fly analysis, d was chosen to be 5. Even though the numbers of nodes and edges in the two co-expression networks vary, the average number of links per node is quite similar (6.29 for worm and 6.56 for fly).
Decomposition of modules in worm and fly
In the genome-wide worm fly analysis, a stringent threshold (0.95) for co-appearance was employed for the co-appearance matrix shown in Figure 3A. Nodes that ended up with the same spin value for more than 95% of trials were assigned to the same module. Tiny clusters were neglected, arriving at a set of about 150 modules covering about 80% of nodes. Proper GO terms were found in the remaining modules.
GO similarity between pairs of genes
There are many metrics for quantifying gene functional similarity based on GO terms [28, 50, 51, 52, 53]. The metric used in this study was adopted from . The relationship where gene i is annotated with GO term j was represented by an adjacency matrix B, and further a matrix G was defined such that where n is the number of genes. In matrix G, the contribution of a GO term j to a gene is weighted by its inverse document frequency, a quantity commonly used in text-mining . High-level GO terms present in many genes were therefore weighted down. The similarity score between two genes k 1 and k 2 was defined as the cosine of the two corresponding vectors (the row and the row in the G matrix).
Regulatory patterns of modules
ChIP-Seq data of 26 fly transcription factors and 79 worm transcription factors across various developmental stages (together 220 experiments in worm and 93 experiments in fly) were downloaded from the modENCODE consortium. For each ChIP-Seq experiment, binding targets of the transcription factors were identified by TIP  with a q-value cutoff of 0.01. The results of these experiments were superposed together to form the transcriptional regulatory networks for worm and fly (12,648 edges for worm and 1,187 edges for fly). The number of common transcription factors for a pair of genes was determined based on the resultant networks.
Comparison with single-species clustering
Standard clustering procedures, including k-means, hierarchical clustering and PAM (Partition Around Medoids), were applied for transcriptome profiling data from worm and fly. Resultant modules of size less than five genes were neglected. In summary, for all three methods, about 200 worm modules and 200 fly modules remained. For each combination of these modules, the number of orthologous pairs between worm and fly genes was counted. The number of orthologous pairs was then compared to the expected number , where n w and n f are the number of genes in the worm and fly modules, N w and N f are the total number of worm and fly genes, and O wf is the number of orthologous pairs between worm and fly. Only combinations with enrichment of orthologous pairs were kept for the display in Figure 6 (P < 0.05, hypergeometric test). For OrthoClust with κ = 0, modules of size less than 5 genes were also neglected in the comparison, resulting in 314 worm modules and 227 fly modules.
Comparison with network alignment
We applied IsoRank to align the worm and fly co-expression networks. The sequence identity between pairs of worm and fly proteins were downloaded from . We tuned the intrinsic parameter α but we did not find systematic trends. We then used α = 0.5 and looked for co-expression edges in two networks whose nodes are aligned. Disconnected components formed by these aligned edges were used as potential seeds of conserved modules because they consist of sets of worm genes and fly genes that are perfectly aligned.
To compare two sets of clusters A and B, all possible N(N - 1)/2 pairs of genes were divided into four categories: (I) assigned to the same module by both A and B; (II) assigned to the same module by A but not by B; (III) assigned to the same module by B but not by A; (IV) assigned to different modules by both A and B. Because the number of pairs in category IV (true negative to a certain extent) is orders of magnitude higher than the others, the overlap between A and B was defined as I/(I + II + III). The number of pairs in category I can be viewed as the true positive. The method is motivated by .
More details on inferring the functions of worm fly ncRNAs
Modules based on worm fly core set
OrthoClust was applied to the set of orthologs between worm and fly, consisting of 5,059 worm genes and 4,863 fly genes. The coupling constant was determined using the same scheme illustrated in the main text. A set of 21 modules that each had more than 15 genes and with enriched GO terms was derived. As expected, the similarity between genes within modules was higher than the similarity between genes across modules (P = 1 × 10-83, Wilcoxon test). To annotate the functions of a module, the enriched GO terms among genes were obtained using the tool REVIGO . The enriched GO terms were clustered into groups labeled by representative keywords given by REVIGO. The list of keywords was displayed using tag clouds (using the software TagCrowd ) in which the size of a keyword is proportional to the number of GO terms in the group.
Mapping ncRNAs to modules
Using RNA-Seq data generated by the modENCODE consortium, the expression profiles of ncRNAs were calculated under the same set of developmental stages as for the protein-coding genes. Annotation of ncRNAs was based on the latest version of wormbase  and flybase . The ncRNAs were then mapped to the 21 modules based on the correlation between expression levels. More specifically, the correlation between the expression of the ncRNAs and the expression of all protein-coding genes was calculated. A null distribution was constructed by randomly shuffling the expression values of the ncRNAs 10 times, calculating the correlation coefficients between the randomized expression profile with expression profiles of all the protein coding genes, and pooling all values together. A set of protein-coding neighbors of the ncRNAs was identified as the set of most correlated protein-coding genes with a false discovery rate of less than 5% with respect to the null distribution. The enrichment of the set of neighbors in every module was calculated by a hypergeometric test. The ncRNAs would be mapped to a given module if P < 0.01. An ncRNA could be mapped to multiple modules.
Enrichment of different classes of ncRNAs in modules
To obtain the enrichment of a particular class of ncRNAs (microRNAs, tRNAs, snoRNAs, lncRNAs) with respect to the set of all ncRNAs in a given module, a hypergeometric test was employed to calculate the significance of the fraction of mapped ncRNAs of this class to four classes in the module against the fraction of total mapped ncRNAs of this class to four classes across all modules.
The code used for optimizing cost function (in MATLAB and R) is available online .
We acknowledge Arif Harmanci, Michael Rutenberg Schoenberg and Gang Fang for valuable discussions. We acknowledge Wyatt Clark for critically reading the manuscripts. We acknowledge support from NIH and the AL Williams Professorship funds. This work was supported in part by the facilities and staff of the Yale University Faculty of Arts and Sciences High Performance Computing Center.
- 24.Kang HJ, Kawasawa YI, Cheng F, Zhu Y, Xu X, Li M, Sousa AMM, Pletikos M, Meyer KA, Sedmak G, Guennel T, Shin Y, Johnson MB, Krsnik Ž, Mayer S, Fertuzinhos S, Umlauf S, Lisgo SN, Vortmeyer A, Weinberger DR, Mane S, Hyde TM, Huttner A, Reimers M, Kleinman JE, Šestan N: Spatio-temporal transcriptome of the human brain. Nature. 2011, 478: 483-489.PubMedPubMedCentralCrossRefGoogle Scholar
- 27.Gerstein MB, Rozowsky J, Yan K-K, Wang D, Cheng C, Brown JB, Davis CA, Hillier L, Sisu C, Li JJ, Pei B, Harmanci AO, Duff MO, Djebali S, Alexander RP, Alver B, Auerbach R, Bell K, Bickel PJ, Boeck ME, Boley NP, Booth BW, Cherbas L, Cherbas P, Di C, Dobin A, Drenkow J, Ewing B, Fang G, Fastuca M, et al: Comparative analysis of the transcriptome across distant species. Nature. doi:10.1038/nature13424Google Scholar
- 29.Boyle AP, Araya CL, Brdlik C, Cayting P, Cheng C, Cheng Y, Gardner K, Hillier L, Janette J, Jiang L, Kasper D, Kawli T, Kheradpour P, Kundaje A, Li JJ, Ma L, Niu W, Rehm EJ, Rozowsky J, Slattery M, Spokony R, Terrell R, Vafeados D, Wang D, Weisdepp P, Wu Y-C, Xie D, Yan K-K, Feingold EA, Good PJ, et al: Comparative analysis of regulatory information and circuits across diverse species. Nature. doi:10.1038/nature13668Google Scholar
- 33.Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, Cabili MN, Jaenisch R, Mikkelsen TS, Jacks T, Hacohen N, Bernstein BE, Kellis M, Regev A, Rinn JL, Lander ES: Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009, 458: 223-227.PubMedPubMedCentralCrossRefGoogle Scholar
- 38.Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, et al: Landscape of transcription in human cells. Nature. 2012, 489: 101-108.PubMedPubMedCentralCrossRefGoogle Scholar
- 42.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E: Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008, 2008: P10008Google Scholar
- 49.Harris TW, Antoshechkin I, Bieri T, Blasiar D, Chan J, Chen WJ, De La Cruz N, Davis P, Duesbury M, Fang R, Fernandes J, Han M, Kishore R, Lee R, Müller H-M, Nakamura C, Ozersky P, Petcherski A, Rangarajan A, Rogers A, Schindelman G, Schwarz EM, Tuli MA, Van Auken K, Wang D, Wang X, Williams G, Yook K, Durbin R, Stein LD, et al: WormBase: a comprehensive resource for nematode research. Nucleic Acids Res. 2010, 38 (Database issue): D463-D467.PubMedPubMedCentralCrossRefGoogle Scholar
- 51.Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA: The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007, 8: R183-PubMedPubMedCentralCrossRefGoogle Scholar
- 58.TagCrowd. [http://tagcrowd.com/]
- 60.OrthoClust. [https://github.com/gersteinlab/OrthoClust]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.