Graph ranking for exploratory gene data analysis
 4.6k Downloads
 8 Citations
Abstract
Background
Microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes in a single experiment. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is inevitable to address the challenge. Gene selection has been investigated extensively over the last decade. Most selection procedures, however, are not sufficient for accurate inference of underlying biology, because biological significance does not necessarily have to be statistically significant. Additional biological knowledge needs to be integrated into the gene selection procedure.
Results
We propose a general framework for gene ranking. We construct a bipartite graph from the Gene Ontology (GO) and gene expression data. The graph describes the relationship between genes and their associated molecular functions. Under a species condition, edge weights of the graph are assigned to be gene expression level. Such a graph provides a mathematical means to represent both speciesindependent and speciesdependent biological information. We also develop a new ranking algorithm to analyze the weighted graph via a kernelized spatial depth (KSD) approach. Consequently, the importance of gene and molecular function can be simultaneously ranked by a realvalued measure, KSD, which incorporates the global and local structure of the graph. Overexpressed and underregulated genes also can be separately ranked.
Conclusion
The genefunction bigraph integrates molecular function annotations into gene expression data. The relevance of genes is described in the graph (through a common function). The proposed method provides an exploratory framework for gene data analysis.
Keywords
Gene Ontology Gene Expression Data Bipartite Graph Depth Function Differentially ExpressBackground
Introduction
Microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is necessary to address the challenge for two primary reasons. First, multivariate methods are prone to overfitting. This problem is aggravated when the number of variables is large compared to the number of examples, and even worse for gene expression data which usually has ten or twenty thousand genes but with only a very limited number of samples. It is not uncommon to use a variable ranking method to filter out the least promising variables before using a multivariate method. The second reason for ranking the importance of genes is that identifying important genes is, in and of itself, interesting. For example, to answer the question of what genes are important for distinguishing between cancerous and normal tissue may lead to new medical practices.
Gene selection has been investigated extensively over the last decade by researchers from the statistics, data mining and bioinformatics communities. There are basically two approaches. One approach treats gene selection as a preprocessing step. It usually comes with a measure to rank genes. Fold change is a simple measure used in [1]. Dudoit, et al. [2] performed a selection of genes based on the betweengroup and withingroup variance ratios. Golub, et al. [3] used a different method for standardizing the data for selecting genes. Pepe, et al. [4] considered two measures related to the Receiver Operating Characteristic curve (ROC) for ranking genes. Strength of statistical evidence, such as pvalues of hypothesis testing [5], are also commonly used measures for gene selection. Storey and Tibshirani [6] proposed a measure of significance called qvalue based on the concept of false discovery rate. The other common approach to gene selection embeds gene selection into a specific learning procedure. Fan and Li [7] proposed penalized likelihood methods for regression to select variables and estimate coefficients simultaneously. Lee, et al. [8] proposed a hierarchical Bayesian model for gene selection. They employed latent variables to specialize the model to a regression setting and used a Bayesian mixture prior to perform the variable selection. Recursive feature elimination (RFE) methods with support vector machines (SVM), e.g. [9, 10, 11, 12], have been shown to be successful for gene selection and classification. L_{1} SVMs perform variable selection automatically by solving a quadratic optimization problem, e.g. [13, 14, 15]. Diáz, et al. [16] applied a random forest algorithm for classification and at the same time for selecting genes based on the permuted importance score. Mukherjee and Roberts [17] provided a theoretical analysis of gene selection, in which the probability of successfully selecting relevant genes, using a given gene ranking function, is explicitly calculated in terms of population parameters. For a more comprehensive survey of this subject, the reader is referred to [18, 19], and [20].
In most of the cases, genes selected by the aforementioned procedures are not sufficient for accurate inference of the underlying biology, because biological significance does not necessarily have to be statistically significant [21]. For example, suppose the gene with low differential expression is a transcription factor that controls the expression of some other genes. The transcription factor itself may be activated by the treatment but its expression may not be significantly changed. Hence, an ideal selection procedure should be able to highlight the transcription factor. To do so, additional biological knowledge must be integrated into it. With the development of biological knowledge databases, biologically interesting sets of genes, for example genes that belong to a pathway or genes known to have the same molecular function, can be compiled, for example from Gene Ontology [22], see GO Consortium (2008). There have been many publications combining gene expression with GO lately. One common approach is to find enriched gene sets annotated by GO terms which are overrepresented among the differentially expressed genes in the analysis of microarray data. See [23, 24, 25, 26], and [27] for details of enrichment. The other approach is to use a GO graph to improve identification of differentially expressed genes. Morrison, et al. [28] constructed a genegene graph derived from GO and used GeneRank, which is a modification of PageRank (the ranking algorithm used in Google search engine), for prioritizing the importance of genes. Gene expression data was cleverly used to specify "the personalization vector" in PageRank. Ma et al. [29] first computed an individual score for each gene from gene expression profiles, then combined the scores of a gene and its direct and indirect neighbors in the genegene graph derived from GO or proteinprotein interaction network to obtain a more accurate gene ranking. Daigle and Altman [30] developed a probabilistic model that integrates biological knowledge with microarray data to identify differentially expressed (DE) genes. They introduced a latent binary variable (DE/not DE) and used a learning algorithm on a stochastic, binary state network to estimate ranking score. Srivastava, et al. [31] used the GO structure to compute the similarity between genes and combined gene expression data in a ridge regression for gene selection. Clearly, an approach integrating GO and gene data captures dependent structure of genes without sacrificing genelevel resolution. It provides more reliable results than the methods relying on gene expression data alone, which is justified later.
In this paper, we propose an exploratory framework of gene ranking that utilizes gene expression profiles and GO annotations. The contributions of this paper are described as follows.
Our contributions

Bigraph representation of biological information of genes. We extract biological information from the GO database. One of the three GO ontologies (molecular function) is used (the other two types of annotations biological process and cellular component can be used similarly). A bipartite graph is constructed with one partition being genes and the other molecular functions. If a gene is associated with a particular function, the gene and the function are joined by an edge. Such a graph structure represents speciesindependent biological knowledge among genes indirectly (through common functions). Furthermore, using gene expression studies, the weight of the edge is assigned to be the expression level of the gene associated with the edge. This integrates the speciesdependent information into the graph. The weighted graph conveys gene dependency structure nicely.

A new graph ranking algorithm. We introduce a new measure, kernelized spatial depth (KSD), to rank the nodes of a graph. Spatial depth (SD) provides a centeroutward ordering of a data set in an Euclidean space ℝ^{ d }. It is a global concept. KSD generalizes the notion of spatial depth by incorporating the local perspective of the data set. Applying KSD to a graph provides the ranking of nodes, which takes into consideration both global and local structures of the graph. For sparse graphs, the algorithm is efficient with computational complexity Open image in new window (n^{2}), where n is the number of nodes of the graph. The algorithm can be easily modified to handle dynamic data sets. It can also be parallelized to scale up for large data sets.

Better interpretation. Under a specified condition, not only is the importance of genes ranked, but the importance of functions is also ranked. This provides us with a better understanding and insight into the roles of various genes and molecular functions by analyzing bigraphs with gene expression profiles under different conditions. We demonstrate the performance of the proposed procedure using gene data from Gene Expression Omnibus (GEO). The new methods exhibit a higher level of biological relevance than competing methods.
Unlike a genegene network construction used in GeneRank, the genefunction bigraph structure has several advantages. It combines the gene expression profiles easily and naturally by assigning them to be weights of the graph. In addition, the importance of genes and molecular functions can be simultaneously ranked. Bipartite graph modeling was also used by Dhillon [32] and Zha, et al. [33] to cocluster documents and words due to those advantages. Tanay, et al. [34] formed a genecondition bigraph to find gene clusters in gene expression data.
The rest of this paper is organized as follows. After a brief introduction of some preliminaries on graphs, we introduce the KSD measure to rank vertices of a graph, followed by a discussion of choice of kernels and their comparison. In application, genefunction bigraphs are constructed to combine biological speciesindependent knowledge extracted from GO and speciesdependent information contained in gene expression profiles. We apply our KSD ranking method to real data sets. Our conclusions and discussion are given in the last section.
Methods
Preliminaries of graphs and a motivating example
A graph G consists of a set of vertices (nodes) V and a set of edges E that connect vertices. The vertices are entities of interest and the edges represent relationships between the entities. Edges can be assigned positive weights W to quantify how strong the relationships are. Such a graph is called a weighted graph. Unweighted graphs are just the special case with all the weights equally being 1.
A bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint sets V_{1} and V_{2} such that every edge connects a vertex in V_{1} to one in V_{2}. In our application, a bipartite graph is constructed with one set of vertices being genes and the other set of vertices being one of the Gene Ontology (GO) molecular functions.
The degree of a vertex v ∈ V denoted as d_{ v }is defined as the sum of the weights related to v, i.e. d_{ v }= Σ_{ u }W (v, u); (v, u) ∈ E. Obviously, for an unweighted graph, the degree of v is the number of incident edges.
Spatial depth and kernelized spatial depth
We first introduce spatial depth in the Euclidean space ℝ^{ d }, then generalize it to kernelized spatial depth, which is the spatial depth on the feature space induced by a positive kernel. In order to extend the concept of KSD to a graph, the kernel on the graph must be specified. We define several graph kernels and present the KSD algorithm to obtain the depth of every vertex of the graph.
Spatial depth
From the definition, it is not difficult to see that points deep inside a data cloud receive high depth and those on the outskirts get lower depth. Each observation from a data set contributes equally, as a unit vector, to the value of the depth function. In this sense, spatial depth takes a global view of the data set. On the one hand, the spatial depth downplays the significance of distance and hence reduces the impact of those extreme observations whose extremity is measured in (Euclidean) distance, so that it gains resistance against these extreme observations. Robustness is a favorite property of spatial depth [36]. Ding, et al. [37] constructed a robust clustering algorithm based on it. On the other hand, the robustness of the depth function trades off some distance measurement, resulting in certain loss of the measurement of (dis)similarity of the data points. To overcome this limitation of spatial depth, Chen, et al. [38] proposed kernelized spatial depth (KSD) incorporating into the depth function a distance metric (or a similarity measure) induced by a positive definite kernel function.
Kernelized spatial depth
where Open image in new window .
The value of KSD depends upon κ without knowing explicitly what the ϕ is. In ℝ^{ d }, one of the popular positive definite kernels is the Gaussian kernel κ (x, y) = exp(x  y^{2}/σ^{2}), which can be interpreted as a similarity between x and y, hence it encodes a similarity measure. For a graph, we must consider what a good similarity measure will be, and how to construct an appropriate kernel matrix efficiently.
Choice of graph kernels
Various kernels on graphs can be found in recent literature, for example [39, 40, 41], and [42]. Ando and Zhang [43] provide some theoretical insights into the role of normalization of the graph Laplacian matrix. We consider five Laplacian kernels, including complement Laplacian kernel, which is proposed here. Each kernel is described, followed by a comparison and discussion of computational issues of these kernels.
Laplacian kernel
where D is a diagonal matrix with the diagonal entities being the degrees, i.e. D_{ ii }= Open image in new window = Σ_{ j }W(i, j).
From the above result, we can see that the distance between two adjacent vertices in the feature space is larger than that of two disconnected vertices. The mapping ϕ reverses the relationship between two vertices in the graph. In this sense, we can view the Laplacian kernel as a dissimilarity matrix. In other words, a vertex close to the center in the graph turns into a vertex far from the center in the feature space. Therefore, a smaller KSD value indicates a higher rank of the vertex in the graph when choosing the Laplacian as the kernel. It is interesting but not consistent with the usual kernels that describe the similarity between two vertices. Next we look at several alternatives to Laplacian kernel.
Laplacian of complement graph kernel
There is no question that nI  E  L is symmetric and positive semidefinite. Notice that the Laplacian of the complement graph is defined in terms of negative Laplacian of the original graph. Hence it reverses the dissimilarity measure of L_{ G }. In other words, the Laplacian of the complement graph is a similarity matrix. Therefore, the larger KSD value with Laplacian of the complement as the kernel indicates the deeper the vertex is in the graph as we expect. This kernel is specially useful for dense graphs. The Laplacian of the complement of the graph may be a sparse matrix which leads to an efficient implementation of the KSD algorithm.
Diffusion Laplacian kernel
From Taylor expansion of exponential function, it is not difficult to show that K_{ D }is symmetric positive definite and all entries are nonnegative.
with Open image in new window .
Diffusion Laplacian kernel performs in an "opposite" way to the Laplacian kernel. Therefore like the Laplacian of the complement graph kernel, the larger KSD value using diffusion Laplacian kernel indicates the "central" vertex in the graph.
Pseudoinverse Laplacian kernel
where Λ^{} is a diagonal matrix with the (i, i) diagonal element being Open image in new window . For convenience, we define Open image in new window = 0 if λ_{ i }= 0. Clearly, K_{ P }is also positive semidefinite, which means that it is indeed a valid kernel.
Pstep random walk kernel
where p is a positive integer and a ≥ 2. The name of the kernel is based on the fact that (aI  ℒ)^{ p }is up to scaling terms equivalent to a pstep random walk on the graph with random restarts. Since it involves negative ℒ in the form, it is a similarity kernel.
In particular, a pstep random walk kernel with a = 2 and p = 1, K_{ R }= 2I  ℒ, converts the offdiagonal dissimilarites in a Laplacian kernel to offdiagonal similarities. It is simple in form and is much more attractive for practical purposes.
Ranking algorithm based on KSD for graphs
Given a graph G and a specified kernel, the following pseudocode describes the procedure to calculate the kernelized spatial depth values of all vertices.
Algorithm 1 KSD Algorithm
1 Get the Laplacian ℒ of the input graph G
2 Choose and compute the kernel matrix K
3 FOR (every vertex m in G )
4 FOR (every vertex i in G )
6 IF t = 0
7 α_{ i }= 0
8 ELSE
9 α_{ i }= 1/t
10 END
11 END
12 FOR (every pair of vertices i, j in G )
13 M_{ ij }= K_{ mm }+ K_{ ij } K_{ mi } K_{ mj }
14 END
16 END
17 OUTPUT D _{ κ }
From the above algorithm, the computation cost of KSD for all vertices depends on the sparseness of the kernel matrix. For a sparse kernel matrix, it is Open image in new window (n^{2}), otherwise it is Open image in new window (n^{3}). It is worthwhile to remark that the algorithm can be sped up by running it on multiple CPUs or computers even without the help of parallel programming techniques.
Comparison of kernels
In the real world, most networks (graphs) such as the world wide web, biological networks including the genefunction bipartite graphs we will construct later, are sparse, which means that the associated weight matrices are sparse. Complement Laplacian kernel is not suitable because of its expensive computation cost Open image in new window (n^{3}). Since the diffusion kernel and pseudoinversion kernel require spectral decomposition of ℒ, which has Open image in new window (n^{3}) complexity and also the resulting kernels usually are very dense, they are not attractive. The Laplacian kernel has some difficulty on interpretation, so we prefer to choose the pstep random walk kernel.
In our application work in the next section, we rank the importance of genes by KSD using the pstep random walk kernel with a = 2 and p = 1.
Application to gene data
In our application, gene expression involving budding yeast (Saccharomyces cerevisiae) cells treated with DNAreactive compounds cisplatin (CIS), methyl methanesulfonate (MMS), and bleomycin (BLE) to induce genotoxic stress will be compared with gene expression of Saccharomyces treated with DNA nonreactive ethanol (EtOH) and sodium chloride (NaCl) compounds to produce cytotoxic stress. Our goal is to identify a small number of biologically relevant genes capable of differentiating mechanisms of toxicity between the known genotoxic compounds from the cytotoxic compounds. In order to do so, we use the following basic methodology:

Construct an unweighted genefunction bigraph based on GO with one partition representing genes and the other representing molecular function.

Preprocess and combine data from the gene expression samples into one set per compound.

For each compound, add weights to the bigraph using the gene expression data.

Run the KSD algorithm on each bigraph to develop a gene expression profile of ranked genes for each compound.

Compare the ranked gene sets.
Details of these steps are provided below.
General construction of genefunction bigraph
In order to integrate biological information and gene expression data, one of gene ontologies – molecular function descriptions of genes are used. In the GO database, the ontologies are structured as rooted directed acyclic graphs (DAGs). The terms close to the root are more abstract than the terms far away from the root. We first extract the most specific functions associated with each gene to form the set of GO function terms. With one set of functions and the other set of genes, a bipartite graph is established. Consider Figure 3. Gene YGR098C is associated with the GO function term 0004197, which describes the cysteinetype endopeptidase activity. Genes YMR154C and YNL223W also have the same function. So in the bipartite graph, Gene YGR098C is more related to YMR154C and YNL223W than it is to YBL069W.
Algorithm 2 GeneFunction bigraph Construction Algorithm
0 Input c , user specified parameter
1 Input gene data
2 Extract associate GO function terms F
3 Form weighted bigraph G = (V, E, W )
4 FOR each term f _{ i } in F
5 Obtain all ancestors m of f _{ i } and their generation levels l _{ im }
6 END
7 FOR every pair i, j in F
8 Find the nearest common ancestor s
9 k = max (l_{ is }, l_{ js })
10 Add edges of f_{ j }and g_{ t }: (g_{ t }, f_{ i }) ∈ E with weights W_{ ti }× c^{ k }into G
11 Add edges of f_{ i }and g_{ t }: (g_{ t }, f_{ j }) ∈ E with weights W_{ tj }× c^{ k }into G
12 END
13 OUTPUT G
The construction of the genefunction bigraph combines gene expression profiles and topological similarity in a single framework. Khatri and Drăghici [45] summarized three ways to determine the abstraction level of annotation in their section 2.7. Our approach is a variation of their second method. The user may decide k, the bottomup level, for annotations. The difference is that we treat the children terms unequally, similar to the weight strategy presented in [24].
Figure 3 demonstrates how to build the structure of genefunction bigraph. The yellow rectangles represent genes at the bottom level. The above blue ellipses and arrows form a subgraph of the DAG in the GO database. Solid edges represent the association between gene and function. Dashed lines are added edges that reflect the semantic similarity of function annotations. The graph inside the red dashed box is the genefunction bipartite graph.
Preprocessing of gene expression data
Bigraphs for gene data under each treatment
In our application, we choose c = 1/5. Since r dramatically decreases on k for such choice of c, we truncate r to be zero for k > 1 to reduce computation memory and time. Under Algorithm 2, the bigraph under treatment MMS agent has total 5232 vertices including 4675 genes and 557 function terms. The number of edges are 22659. Hence the resulting bigraph is very sparse with sparsity 0.0017 comparing with 1 in the full graph (the graph with all pair edges). We use pstep random walk kernel to analyze the graph. Since we take log2 expression differences with respect to the control agent, genes with positive log2 expression difference are upregulated and downregulated genes have negative values. We are not able to directly assign weights of edges in the bigraph. We separate the bigraph into two subgraphs: one with all overexpressed genes and the other one with all underexpressed genes. For the subgraph containing "downregulated" genes, the weights are assigned to be the absolute values of log2 expression differences. Then we rank the important genes in those two graphs separately. It is reasonable to do so because we are interested in important induced genes and also repressed genes. All graph construction and algorithms are implemented using R and Bioconductor.
Validation of improvement using GO
Before we present the result on the genes that are able to potentially differentiate genotoxicity and cytotoxicity, we would like to demonstrate that integrating GO will provide more reliable results than methods only using gene expression data. We consider the three NaCl samples individually, ranking differentially expressed genes in each sample and comparing the degree of overlap of the top 100 gene lists.
For the simplest foldchange method, which ranks genes by the ratio of expression level of a NaCl treated sample over the mean expression in the control group, there are seven common genes appearing in the top 100 of the three samples, and only three overlapping in the top 50. When tstatistics are used for ranking genes, there are no genes in the overlap of the top 50 genes from the three samples, and only five genes in the overlap of the top 100 genes. Moreover, only one gene is identified in each sample by both methods. The reasons for such a poor performance include the noise level and experimental variability of microarrays. Ranking each gene independently is also one of the attributed reasons. Incorporating gene expression profiles and biological knowledge can improve performance.
By integrating GO annotations, a genefunction bigraph is constructed with weights being foldchanges or tstatistics for each sample. The KSD ranking on foldchange weighted graph provides an overlap of 60 genes in the top 100 and 32 in the top 50. There are 45 common genes in all the top 100 and 24 in the top 50 if we rank the tstatistic weighted bigraph. Furthermore, 38 common genes are identified in every bigraph based on each sample using either a fold change or tstatistic. For other compounds, we obtained a similar result: a small overlap for methods on gene data alone, a relatively larger overlap for our approach on the GO derived graph. While our testing used GO function annotations, similar results are expected with the other two ontologies. It is noted that there is a complete overlap if only GO information is used. Genefunction bigraphs which combine gene data with GO enhance the experimental signal and capture the dependent structure of genes. Hence, ranking on bigraphs improves the results.
Results
Top 10 induced (up) and repressed (down) genes for each agent.
Genotoxic agents  Cytotoxic agents  

MMS  Bleomycin  Cisplatin  EtOH  NaCI  
Up  YJL088W  YMR090W  YJL088W  YPRWsigma4  YDR256C 
YNL241C  YPR160W  YMR090W  YJL088W  YLR343W  
YER161C  YGR180C  YGR180C  YML010WA  YJR078W  
YJL101C  YNL202W  YER142C  YOL055C  YJL153C  
YDL142C  YGR256W  YNR019W  YLR067C  YNL275W  
YOR349W  YJR073C  YJL101C  YNL275W  YJL088W  
YDR019C  YOR100C  YJL026W  YNL036W  YER081W  
YDR001C  YDR018C  YKR076W  YER081W  YNL071W  
YBR045C  YJL026W  YLL060C  YLR237W  YBR221C  
YJL153C  YCR083W  YDR001C  YJL129C  YLR142W  
Down  YNL327W  YNL327W  YNL327W  YJL178C  YDR435C 
YFL017C  YGL028C  YDR044W  YDL142C  YER009W  
YNL141W  YNR067C  YHR128W  YIL162W  YNL327W  
YOR095C  YGR006W  YHL028W  YCR021C  YGL143C  
YOL152W  YIL149C  YMR006C  YGR180C  YCR021C  
YOR356W  YPL276W  YOR095C  YGR277C  YDR071C  
YLR456W  YEL048C  YKL029C  YHR008C  YOL143C  
YHR128W  YER052C  YAR050W  YER023W  YOR121C  
YBR038W  YGL063W  YLL061W  YIR044C  YMR034C  
YGL143C  YOR095C  YNL148C  YML120C  YGL055W 
Genes with similar responses under genotoxic or cytotoxic stress
Genotoxicity  Cytotoxicity  

Induced (Up)  Repressed (Down)  Induced (Up)  Repressed (Down) 
YJL088W  YNL327W  YJL088W  YCR021C 
YDR001C  YOR095C  YNL275W  YGR180C 
YJL178C  YGR036C  YER081W  YBR054W 
YMR090W  YHL028W  YOR298W  YNR001C 
YOR100C  YGL028C  YLR343W  YDR408C 
YLR178C  YIL149C  
YGL156W  YGL063W  
YJR073C  YGR006W  
YGR180C 
We enlarge the search of differentially regulated genes between the two groups to the top 100 genes. Eight other genes are capable of discriminating between genotoxic and cytotoxic agents. They behave similarly within group but totally different between groups. Genes overexpressed for genotoxic treatments but downregulated for cytotoxic agents include TFS1, NTH1, ATG27 and uncharacterized YMR090W. TFS1 is a Carboxy peptidase Y inhibitor, which is targeted to vacuolar membranes during stationary phase and involved in protein kinase A signaling pathway. NTH1 is required for thermotolerance and may mediate resistance to other cellular stresses. Type I membrane protein, ATG27, is involved in autophagy and the cytoplasmtovacuole targeting pathway. For gene YMR090W with unknown function, we should treat it with caution. GO term 0003674 is manually created for unknown molecular functions. Because our method utilizes the GO DAG structure, the identification of YMR090W may be caused by 0003674 (unknown function) but not by significant changes of mRNA levels. Further study about this gene is worthwhile.
Four genes PUS2, CAX4, WSC4 and MLP2 are induced for cytotoxic stress but repressed for genotoxic stress. PUS2 protein is a mitochondrial tRNA, associated with pseudouridine synthase activity targeted to mitochondria, specifically dedicated to mitochondrial tRNA modification. Response to decreased yeast viability and slow growth caused by cytotoxic stress, CAX4 is induced to increase the level of Nlinked glycosylation. WSC4 is an ER membrane protein involved in the translocation of soluble secretory proteins and insertion of membrane proteins into the ER, which plays an important role in the stress response. MLP2, a Myosinlike protein associated with the nuclear envelope, connects the nuclear pore complex with the nuclear interior and is involved in the Tel1p pathway that controls telomere length.
Significant important genes distinguishing genotoxicity and cytotoxicity
Gene ID  Gene Name  Description 

YGR180C*  RNR4  Ribonucleotidediphosphate reductase (RNR) 
YLR178C*  TFS1  Carboxypeptidase Y inhibitor 
YDR001C*  NTH1  Neutral trehalase, degrades trehalose 
YJL178C*  ATG27  Type I membrane protein 
YMR090W*  unkown function  
YGL063W  PUS2  Mitochondrial tRNA:pseudouridine synthase 
YGR036C  CAX4  Dolichyl pyrophosphate (DolPP) phosphatase 
YHL028W  WSC4  ER membrane protein 
YIL149C  MLP2  Myosinlike protein associated with the nuclear envelope 
Comparison with PageRank
We also use PageRank to analyze each weighted bigraph under each treatment. It yields very similar results as our KSD. Considering upregulated genes for MMS, 85 out of the top 100 ranked genes by PageRank coincide with the top 100 by KSD. For downexpressed genes in the MMS treatment, there are 77 common genes appearing in both top 100 lists by PageRank and KSD. The other compounds have a similar overlap in top 100 lists. PageRank and KSD produce similar ranking lists for gene data, so why do we need KSD?
There are two major advantages of KSD over PageRank. First, PageRank needs a damping parameter to be specified. From some empirical studies, the parameter being 0.85 (the default value in R) seems to work well on the balance between the convergence rate and stability in many applications. But there are some circumstances where 0.85 may be far from the "optimal" value. The choice of the damping parameter is a concern for PageRank and hence for GeneRank also. This is however not an issue for KSD if we use Laplacian, complement Laplacian or Psedoinverse kernels. Second, since spatial depth is a robust measure for centrality, we expect that KSD will inherit this nice property and obtain a more robust ranking result. To demonstrate the robustness, we design the following experiment to compare the sensitivity of our approach and PageRank against incorrect annotations on the artificial data.
Conclusion
The genefunction bigraph integrates molecular function annotations with gene expression data. The general relevance of genes is described in the graph (through a common function). Weights of the graph are assigned to be gene response expressions. The resulting bigraph includes more biological information than the gene data alone. Consequently, ranking on the bigraph may provide more biologically significant genes than ranking procedures based only on gene data. Also, we propose a new ranking algorithm for graphs based on the KSD measure. KSD balances the local and global topological structure of the graph, hence it provides a good and meaningful ordering of vertices of the graph. Experimental results on artificial data show that KSD is more robust than the wellknown PageRank against incorrect annotations. The proposed method provides an exploratory framework for gene data analysis.
Notes
Acknowledgements
Support under National Science Foundation Grant DMS0707074 is gratefully acknowledged by XD.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S11.
References
 1.Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001, 8: 37–52. 10.1089/106652701300099074CrossRefPubMedGoogle Scholar
 2.Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of American Statistical Association 2002, 97: 77–87. 10.1198/016214502753479248CrossRefGoogle Scholar
 3.Golub TR, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531CrossRefPubMedGoogle Scholar
 4.Pepe MS, Longton G, Anderson GL, Schummer M: Selecting differentially expressed genes from microarray experiments. Biometrics 2003, 59: 133–142. 10.1111/15410420.00016CrossRefPubMedGoogle Scholar
 5.Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data. Journal of Computational Biology 2000, 7(6):819–837. 10.1089/10665270050514954CrossRefPubMedGoogle Scholar
 6.Storey JD, Tibshirani R: Statistical significance for genomewide experiments. Proceedings of the Natinal Academy Sciences USA (PNAS) 2003, 100: 9440–9445. 10.1073/pnas.1530509100CrossRefGoogle Scholar
 7.Fan J, Li R: Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 2001, 96: 1348–1360. 10.1198/016214501753382273CrossRefGoogle Scholar
 8.Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK: Gene selection: a Bayesian variable selection approach. Bioinformatics 2003, 19: 90–97. 10.1093/bioinformatics/19.1.90CrossRefPubMedGoogle Scholar
 9.Brown P, et al.: Knowledgebased analysis of microarray gene expression data by using support vector machines. The Proceedings of the National Academy of Sciences of the USA (PNAS) 2000, 97: 262–267. 10.1073/pnas.97.1.262Google Scholar
 10.Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Machine Learning 2002, 46: 389–422. 10.1023/A:1012487302797CrossRefGoogle Scholar
 11.Ding Y, Wilkins D: Improving the performance of SVMRFE to select genes in microarray data. BMC Bioinformatics 2006, 7(Suppl 2):S12. 10.1186/147121057S2S12PubMedCentralCrossRefPubMedGoogle Scholar
 12.Furlanello C, Serafini M, Merler S, Jurman G: Entropybased gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics 2003., 4(54):Google Scholar
 13.Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 1996, 58: 267–288.Google Scholar
 14.Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 2005, 67: 301–320. 10.1111/j.14679868.2005.00503.xCrossRefGoogle Scholar
 15.Wang L, Zhu J, Zou H: Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 2008, 24: 412–419. 10.1093/bioinformatics/btm579CrossRefPubMedGoogle Scholar
 16.Díaz Uriarte R, de Andrés SA: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006, 7: 3. 10.1186/1471210573PubMedCentralCrossRefPubMedGoogle Scholar
 17.Mukherjee SN, Roberts SJ: A theoretical analysis of gene selection. Preceedings of IEEE Computational Systems Bioinformatics Conference (CSB) 2004, 131–141.Google Scholar
 18.Gentleman R, Irizarry RA, Carey VJ, Dudoit S, Huber W: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer; 2005.CrossRefGoogle Scholar
 19.Guyon I, Elisseeff A: An introduction to variable and feature selection. The Journal of Machine Learning Research 2003, 3: 1157–1182. 10.1162/153244303322753616Google Scholar
 20.Lee MLT: Analysis of microarray gene expression data. Boston: Kluwer; 2004.Google Scholar
 21.Zadeh SFM, Morradi MH: An evaluation of genes ranking methods by ontology. Proceedings of 8th International Conference on Signal Processing 2006, 4: 16–20.Google Scholar
 22.Zhang B, Schmoyer D, Kirov S, Snoddy J: GOTree Machine (GOTM): a webbased platform for interpreting sets of interesting genes using Gene Ontology hierarchies. BMC Bioinformatics 2004, 5: 16. 10.1186/14712105516PubMedCentralCrossRefPubMedGoogle Scholar
 23.Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol 2004, 5: R101. 10.1186/gb2004512r101PubMedCentralCrossRefPubMedGoogle Scholar
 24.Alexa A, Rahnenührer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 2006, 22: 1600–1607. 10.1093/bioinformatics/btl140CrossRefPubMedGoogle Scholar
 25.Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association. Bioinformatics 2007, 23(2):257–258. 10.1093/bioinformatics/btl567CrossRefPubMedGoogle Scholar
 26.Grossmann S, Bauer S, Robinson PN, Vingron M: An improved statistic for detecting overrepresented Gene Ontology annotations in gene sets. Proceedings of the Lecture Notes in Computer Science 2006, 85–98. full_textGoogle Scholar
 27.Trajkovski I, Lavrač N, Tolar J: SEGS: Search for enriched gene sets in microarray data. Journal of Biomedical Informatics 2008, 41: 588–601. 10.1016/j.jbi.2007.12.001CrossRefPubMedGoogle Scholar
 28.Morrison J, Breitling R, Desmond H, Gilbert D: GeneRank: Using search technology for the ananlysis of microarray experiments. BMC Bioinformatics 2005, 6: 233. 10.1186/147121056233PubMedCentralCrossRefPubMedGoogle Scholar
 29.Ma X, Lee H, Wang L, Sun F: CGI: a new approach for prioritizing genes by combining gene expression and proteinprotein interaction data. Bioinformatics 2007, 23(2):215–221. 10.1093/bioinformatics/btl569CrossRefPubMedGoogle Scholar
 30.Daigle BJ, Altman RB: MBISON: Microarrybased integration of data sources using networks. BMC Bioinformatics 2008, 9: 214. 10.1186/147121059214PubMedCentralCrossRefPubMedGoogle Scholar
 31.Srivastava S, Zhang L, Jin R, Chan C: A novel method incorporating gene ontology information for unsupervised clustering and feature selection. PLoS ONE 2008., 3(12): 10.1371/journal.pone.0003860Google Scholar
 32.Dhillon IS: Coclustering documents and words using bipartite spectral graph partitioning. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD) 2001, 269–274. full_textCrossRefGoogle Scholar
 33.Zha HY, He XF, Ding C, Simon H, Gu M: Bipartite graph partitioning and data clustering. Proceedings of 10th International Conference on Information and Knowledge Management (CIKM) 2001, 25–32.Google Scholar
 34.Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002, 18(suppl 1):S136S144.CrossRefPubMedGoogle Scholar
 35.Serfling R: A depth function and a scale curve based on spatial quantiles. In Statistical Data Analysis Based on the L1Norm and Related Methods Edited by: Dodge D. 2002, 25–38.CrossRefGoogle Scholar
 36.Dang X, Serfling R, Zhou W: Influence Functions of Some Depth Functions, with Application to LStatistics. Journal of Nonparametric Statistics 2009, 21(01):49–66. 10.1080/10485250802447981CrossRefGoogle Scholar
 37.Ding Y, Dang X, Peng H, Wilkins D: Robust Clustering in High Dimensional Data Using Statistical Depths. BMC Bioinformatics 2007, 8(Suppl 7):S8. 10.1186/147121058S7S8PubMedCentralCrossRefPubMedGoogle Scholar
 38.Chen Y, Dang X, Peng H, Bart H: Outlier detection with the kernelized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence 2009, 31(2):288–305. 10.1109/TPAMI.2008.72CrossRefPubMedGoogle Scholar
 39.Smola AJ, Kondor R: Kernels and Regularizations on Graphs. In learning theorm and kernel machines. BerlinHeidelberg: Springer Verlag; 2005.Google Scholar
 40.Ho N, Dooren P: On the pseudoinverse of the Laplacian of a bipartite graph. Applied Mathematics Letters 2005, 18(8):917–922. 10.1016/j.aml.2004.07.034CrossRefGoogle Scholar
 41.Agarwal A, Chakrabarti S: Learning random walks to rank nodes in graphs. 2007.CrossRefGoogle Scholar
 42.Kondor R, Lafferty J: Diffusion kernels on graphs and other discrete input spaces. Proceddings of the 19th International Conference on Machine Learning (ICML) 2002, 315–322.Google Scholar
 43.Ando RK, Zhang T: Learning on graph with Laplacian regularization. Proceedings of Neural Information Processing Systems conference (NIPS) 2006, 25–32.Google Scholar
 44.Chung FRK: Spectral Graph Theory. In CBMS Regional Conference Series in Mathematics 92. American Mathematical Society; 1997.Google Scholar
 45.Khatri P, Drăghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565PubMedCentralCrossRefPubMedGoogle Scholar
 46.Caba E, Dickinson DA, Warnes GR, Aubrecht J: Differentiating mechanisms of toxicity using global gene expression analysis in Saccharomyces cerevisiae. Mutation Research 2005, 575: 34–46.CrossRefPubMedGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.