Identifying stagespecific protein subnetworks for colorectal cancer
 2.3k Downloads
 2 Citations
Abstract
Background
In recent years, many algorithms have been developed for networkbased analysis of differential gene expression in complex diseases. These algorithms use proteinprotein interaction (PPI) networks as an integrative framework and identify subnetworks that are coordinately dysregulated in the phenotype of interest.
Motivation
While such dysregulated subnetworks have demonstrated significant improvement over individual gene markers for classifying phenotype, the current stateoftheart in dysregulated subnetwork discovery is almost exclusively limited to binary phenotype classes. However, many clinical applications require identification of molecular markers for multiple classes.
Approach
We consider the problem of discovering groups of genes whose expression signatures can discriminate multiple phenotype classes. We consider two alternate formulations of this problem (i) an allvsall approach that aims to discover subnetworks distinguishing all classes, (ii) a onevsall approach that aims to discover subnetworks distinguishing each class from the rest of the classes. For the onevsall formulation, we develop a setcover based algorithm, which aims to identify groups of genes such that at least one gene in the group exhibits differential expression in the target class.
Results
We test the proposed algorithms in the context of predicting stages of colorectal cancer. Our results show that the setcover based algorithm identifying "stagespecific" subnetworks outperforms the allvsall approaches in classification. We also investigate the merits of utilizing PPI networks in the search for multiple markers, and show that, with correct parameter settings, networkguided search improves performance. Furthermore, we show that assessing statistical significance when selecting features greatly improves classification performance.
Keywords
Mutual Information Classification Performance Discriminative Power Background Distribution Phenotype ClassIntroduction
Genomewide monitoring of mRNA expression, monitored using DNA microarrays and more recently deep sequencing, has proved quite useful in understanding the mechanistic bases of complex human diseases. Systematic analysis of differential gene expression in different phenotypic classes leads to identification of novel biomarkers, which serve as features for phenotype classification, as well as targets for therapeutic intervention. In previous studies, differential analysis of gene expression led to identification of biomarkers for a range of complex diseases, including Parkinson's disease [1], neuroblastoma [2], lung cancer [3] and breast cancer [4].
Traditional analyses generally take a univariate approach to study gene expression and identify genes with significant individual differential expression in the phenotype of interest. However, such univariate approaches are often limited in explaining the underlying mechanisms of complex diseases, which arise from the interplay among multiple genetic and environmental factors. For example, genes that cooperate or complement each other in pathogenesis may not necessarily be differentially expressed individually, but exhibit coordinated dysregulation when considered together.
In order to address the shortcomings of the univariate approaches, Chuang et al. develop an algorithm that integrates gene expression data with proteinprotein interaction (PPI) networks to identify reproducible breast cancer metastasis markers composed of multiple interacting proteins ("dysregulated subnetworks") [5]. They show that these subnetwork markers better predict breast cancer metastasis as compared to individually dysregulated genes. Motivated by the demonstrated promise of this approach, several other algorithms are developed for networkbased analysis of differential gene expression. In particular, Chowdhury et al. develop a setcover based heuristic for identification of genes that complement each other in discriminating phenotype and control samples [6]. Phuong et al. further improve on these algorithms by introducing a biclustering algorithm that also accounts for the noise in PPI networks by incorporating reliability scores for PPIs [7]. More recently, recognizing the shortcomings of greedy algorithms in identifying dysregulated subnetworks, Phuong et al. introduce a colorcoding based randomized algorithm to identify subnetworks that are highly discriminative of phenotype and control [8]. These methods are also extended to the identification of subnetwork expression signatures that can shed light into the regulatory logic of the relationship between the dysregulation of multiple genes and the disease phenoype. In particular, Chowdury et al. identify subnetworks whose combinatorial expression states are indicative of phenotype by using a branchandbound algorithm [9], Dutkowski et al. grow networkguided forests by training decision trees using interacting proteins [10].
All of the existing dysregulated subnetwork discovery algorithms are designed and validated for binary phenotype classes (e.g. cancerous vs. noncancerous, metastatic vs. nonmetastatic, drug responders vs. nonresponders) and prove to be promising in terms of accurate classification of samples. However, many progressive diseases such as glioblastoma, breast cancer and colorectal cancer require identification of molecular markers for multiple classes (such as the four stages in colorectal cancer according to Dukes' classification) for effective prognosis and treatment. This implies the necessity of a framework that can also work on datasets with more than two phenotype classes for networkbased discovery of disease markers. Although most of the existing algorithms can be applied to multiple phenotype classes in principle, no tool is readily available for this purpose. Furthermore, subnetwork discovery on multiclass datasets requires additional design choices and poses novel algorithmic challenges. These choices include designing criteria to evaluate the dysregulation of a subnetwork; i.e., are we interested in identifying subnetworks that can distinguish all classes from each other at once, or are we interested in identifying subnetworks that serve as indicators for specific classes. The algorithmic challenges, on the other hand, include unproportionately distributed samples across multiple classes. For these reasons, novel algorithms are needed that are robust and can work with datasets that are composed of different number of classes and sample distributions.
Contributions of this study
In this article, we introduce novel algorithms for networkbased analysis of differential gene expression on applications that involve multiple phenotype classes. As an important application, we focus particularly on identifying subnetworks that can discriminate different stages of human colorectal cancer (CRC) according to Dukes' classification. We first propose two formulations to generalize informationtheoretic measures of subnetwork dysregulation to multiple phenotype classes. These formulations differ in terms of how the target subnetworks discriminate phenotype classes from each other; namely we establish informationtheoretic criteria for allvsall and onevsall discriminative subnetworks. Then, we extend the setcover based algorithm by Chowdury et al. , NETCOVER, to identify onevsall discriminative subnetworks [6]. We also introduce a framework for assessing the statistical significance of the subnetworks identified by the setcover based algorithm. Using public CRC datasets composed of samples labeled with Dukes' four stages, we investigate the performance of the resulting algorithm, COBALT, in identifying subnetworks that are useful in predicting the stages of colon cancer samples. In particular we perform systematic computational experiments to investigate the following:

We compare the performance of allvsall and onevsall subnetworks in predicting phenotype and show that onevsall discriminative subnetworks are generally more reliable as features for classification.

We investigate the effect of using the PPI network to confine the space for searching groups of genes that are coordinately dysregulated subnetworks. We show that, while expansion of the search space through consideration of indirect interactions improve the classification performance of identified subnetworks, this improvement saturates after a point, demonstrating that PPI networks indeed provide a shortcut to the identification of dysregulated groups of genes. We also show that our efficient set cover based algorithm renders networkfree search feasible.

We investigate the effect of using statistically significant subnetworks (as opposed to highscoring subnetworks) as features for classification and show that assessment of statistical significance facilitates identification of more useful subnetwork features for classification.
In the next section, we start our discussion by proposing two alternate informationtheoretic formulations of subnetwork dysregulation. We also introduce our setcover based algorithm, COBALT, for the identification of onevsall discriminative subnetworks and propose methods for assessing the statistical significance of the identified subnetworks. Subsequently, in Results Section, we provide comprehensive experimental results on the classification performance of the subnetworks discovered by COBALT in predicting the stage of CRC on two gene expression datasets obtained from the Gene Expression Omnibus. We conclude the paper in Conclusion Section.
Methods
In this section, we start by introducing the mathematical background of the informationtheoretic formulation of coordinate dysregulation for a set of genes. Subsequently, we propose two alternate approaches for generalizing this notion to multiple phenotype classes. We then introduce COBALT, our setcover based algorithm that is specifically designed to identify stagespecific discriminative subnetworks. Finally, we introduce a framework for assessing the statistical significance of the identified subnetworks, and describe how these subnetworks can be utilized for classification of samples.
Dysregulation of subnetworks
For a given set $\mathcal{V}$ of genes and $\mathcal{U}$ of samples, let E_{ i } ∈ R^{u }represent the properly normalized gene expression vector for gene ${g}_{i}\in \mathcal{V}$, where E_{ i }(j) denotes the relative expression of g_{ i } in sample ${s}_{j}\in \mathcal{U}$. Assume that we have a set $\mathcal{T}$, composed of different classes for the phenotype of interest (such as the four stages in colorectal cancer according to Dukes classification) and the phenotype vector C annotates each sample with one of the labels in $\mathcal{T}$ , i.e., C(j) = t where t ∈ $\mathcal{T}$ .We also define the set of all samples for a specific phenotype class t as ${\mathcal{U}}^{\left(t\right)}=\left\{{s}_{j}\in \mathcal{U}:C\left(j\right)=t\right\}$.
Let $\mathcal{G}\left(\mathcal{V},\epsilon \right)$ denote a PPI network where the product of each gene g_{ i } ∈ $\mathcal{V}$ is represented by a node and each edge g_{ i }g_{ j } represents an interaction between the products of g_{ i } and g_{ j }. Given a PPI network and a gene expression dataset over multiple phenotype classes, we are interested in finding sets of genes that can together discriminate the phenotype classes with their gene expression signatures. In order to establish the functional relevance of these gene sets and search for these sets more efficiently, we confine the search space to PPI subnetworks, that is groups of proteins that are functionally interrelated through PPIs. Formally, a set $\mathcal{S}\subseteq \mathcal{V}$ of proteins is considered a subnetwork of interest if for all proteins g_{ i } ∈ $\mathcal{S}$, there is at least one other protein g_{ j } ∈ $\mathcal{S}$such that g_{ i } and g_{ j } are connected through at most ℓ hops in the PPI network. Here, ℓ is a parameter that adjusts the tradeoff between functional relevance and computational efficiency; a larger ℓ allows searching for functionally less related proteins at the cost of increasing the search space.
For a given subnetwork $\mathcal{S}\subseteq \mathcal{V}$, Chuang et al. define the subnetwork activity of $\mathcal{S}$as ${E}_{\mathcal{S}}={\sum}_{{g}_{i}\in \mathcal{S}}{E}_{i}/\sqrt{\left\mathcal{S}\right}$, that is the aggregate expression profile of the genes in $\mathcal{S}$[5]. Using subnetwork activity, they define an informationtheoretic measure to quantify the dysregulation of a subnetwork. This "additive" definition of dysregulation limits the framework to the identification of subnetworks with all genes in the subnetwork dysregulated in the same direction (i.e., all up or downregulated in the phenotype of interest), and alternate approaches that compute combinatorial expression signatures are shown to be more powerful [9, 10]. However, this additive formulation serves as a useful starting point to generalize subnetwork dysregulation to phenotypes that involve multiple classes. For this reason, we focus on additive subnetwork activity in this paper.
Allvsall discriminative power of a subnetwork
It is straightforward to generalize the informationtheoretic measure for the dysregulation of a subnetwork [5] to multiple phenotype classes. Namely, the mutual information between the subnetwork activity of S and the multiclass phenotype vector, i.e., Δ _{allvsall} (S) = I (E_{ S }, C) = H(C) − H(CE_{ S } ), provides a measure of the the reduction in the uncertainty about C given E_{ S }. Here, $H\left(X\right)={\sum}_{x\in \chi}p\left(x\right)\phantom{\rule{0.3em}{0ex}}\text{log}\left(p\left(x\right)\right)$ denotes the Shannon entropy of discrete random variable X that can take over values from the set $\mathcal{X}$ . In our case, the support set for the random variable C is $\mathcal{T}$, whereas the support set for the random variable E_{ S } is obtained by appropriately quantizing the expression levels.
Onevsall discriminative power of a subnetwork
Identifying onevsall discriminative subnetworks
The problems of identifying subnetworks with maximum ${\Delta}_{\mathsf{\text{allvsall}}}\left(S\right)$ or ${\Delta}_{\mathsf{\text{onevsall}}}\left(S\right)$ are intractable [5]. However, it is straightforward to generalize the greedy algorithm by Chuang et al. to solve both problems efficiently [5]. This greedy algorithm initializes a subnetwork with a single protein. It then grows the subnetwork by adding the protein in the neighborhood of the subnetwork (i.e., reachable from the subnetwork with ℓ hops) that improves the objective function (${\Delta}_{\mathsf{\text{allvsall}}}\left(S\right)$ or ${\Delta}_{\mathsf{\text{onevsall}}}\left(S\right))$ the most. The algorithm stops either when there is no more protein in the neighborhood to add, or the best improvement provided by a protein in the neighborhood is below a userdefined threshold.
While the explained greedy algorithm is quite effective in efficiently discovering highscoring subnetworks, it has several drawbacks [6]. First, this algorithm is biased toward identifying subnetworks with very few proteins that exhibit high dysregulation individually. This is because the algorithm lacks global awareness, i.e., it will stop expanding the subnetwork when the best candidate protein to add to the subnetwork has only marginal individual contribution, but may actually contribute a greater deal when additonal proteins are added. Second, this approach requires computation of mutual information for each and every candidate protein in the neighbourhood to be added to the growing subnetwork, which may prove to be costly when the algorithm needs to be run multiple times to assess statistical significance of identified subnetworks. Motivated by these observations, Chowdhury et al. develop a setcover based algorithm, NETCOVER, which is more effective in discovering proteins that complement each other in discriminating phenotype and control samples [6]. However, NETCOVER is designed for binary phenotype classes and it assumes that the samples are paired. Here, we argue that the algorithmic insights introduced by NETCOVER suit particularly well to the identification of onevsall discriminative subnetworks. Based on this observation, we develop COBALT, which generalizes NETCOVER to handle unpaired samples and multiple phenotype classes to identify onevsall discriminative subnetworks.
COBALT:Coverbased algorithm for identifying onevsall discriminative subnetworks
Recall that a onevsall discriminative subnetwork is defined as one with differential subnetwork activity in a specific phenotype class, as compared to all other classes. Since subnetwork activity is defined regularly, the genes in such a subnetwork have to be either all upregulated or all downregulated in the phenotype class of interest. Motivated by this observation, COBALT aims to identify subnetworks such that for each sample that belongs to the phenotype class of interest, there exists at least one gene in the subnetwork that is upregulated (or downregulated) in that sample. Such subnetworks are said to "cover" the entire patient population that represents the phenotype class of interest.
 1.
Initialize subnetwork: S_{ i } ← {g_{ i }}
 2.
Define the target phenotype class t for this subnetwork as the class that has the maximum fraction of samples positively covered by g_{ i }: t ← argmax_{ t }′ _{∈}$\mathcal{T}$$\left\{\left{\mathcal{P}}_{i}^{\left({t}^{\prime}\right)}\right/\left{\mathcal{U}}_{i}^{\left({t}^{\prime}\right)}\right\right\}$
 3.
Initialize the set of uncovered samples for class t: ${\mathcal{M}}^{\left(t\right)}\leftarrow {\mathcal{U}}^{\left(t\right)}\backslash {\mathcal{P}}_{i}^{\left(t\right)}$
 4.
Initialize the set of network neighbors: $\mathcal{Q}\leftarrow \left\{{g}_{j}\in \mathcal{V}:\delta \left({g}_{i},{g}_{j}\right)\le \ell \right\}$
 5.
For all genes g_{ j } ∈ $\mathcal{Q}$, compute ${\left({\mathcal{P}}_{j}^{\left(t\right)}\right)}^{\prime}\leftarrow {\mathcal{P}}_{j}^{\left(t\right)}\cap {\mathcal{M}}^{\left(t\right)}\mathsf{\text{and}}{\left({\mathcal{N}}_{j}^{\left(t\right)}\right)}^{\prime}\leftarrow {\mathcal{N}}_{j}^{\left(t\right)}\cap {\mathcal{M}}^{\left(t\right)}$
 6.
Find all the genes (can be multiple) in $\mathcal{Q}$ with maximum $\left{\left({\mathcal{P}}_{j}^{\left(t\right)}\right)}^{\prime}\left\right{\left({\mathcal{N}}_{j}^{\left(t\right)}\right)}^{\prime}\right$ and let g_{ k } be the gene among these genes with minimum ${\sum}_{{}_{{}_{{t}^{\prime}\in \mathcal{T}}\backslash \left\{t\right\}}}\left{\mathcal{P}}_{k}^{{\left(t\right)}^{\prime}}\right$(i.e., g_{ k } has minimum positive background coverage).
 7.
Expand the subnetwork with g_{ k }: S_{ i } ← S_{ i } ∪ {g_{ k }}
 8.
Update the set of uncovered positive samples for class t: ℳ ^{(t) }← ℳ ^{(t) }\ $\mathcal{P}$_{ k }^{(t)}
 9.
Update set of neighbouring genes: $\mathcal{Q}$ ← $\mathcal{Q}$ ∪ {g_{ j } ∈ $\mathcal{V}$ : δ(g_{ k }, g_{ j }) ≤ ℓ} \ {g_{ k }}
 10.
If $\mathcal{Q}$ = ∅ or M^{(t) }= ∅, return S_{ i }; otherwise, go to step (5).
This algorithm is also used for identifying the minimal negative covering subnetwork seeded at g_{ i } by simply replacing $\mathcal{Q}$ with $\mathcal{P}$ above.
Assessing statistical significance of subnetworks
In order to asses the significance of the identified subnetworks, we perform two distinct significance tests. Each significance test is performed by generating an empirical background distribution that carefully accounts for multiple hypothesis testing. The first background distribution is obtained by randomly permuting the class labels. The second background distribution, on the other hand, is obtained by permuting the gene expression profiles (the rows of the gene expression matrix, thereby randomly reestablishing the relationship between the expression profiles and the nodes in the PPI network ). After generating a large number of these randomized datasets, we use COBALT to identify classspecific subnetworks for the randomized datasets as well. We then use these subnetworks as the background distribution to test the statistical significance of the discriminative power of the stagespecific subnetworks identified on the actual dataset. This approach implicitly handles multiple hypothesis testing, since the background distribution is constructed using the most discriminative subnetworks that could be identified on each randomized dataset.
Note also that the cover provided by a subnetwork for a target phenotype class depends on the size of the subnetwork (i.e., the number of proteins in the subnetwork). In other words, if we construct subnetworks at random, we would expect larger subnetworks to have a higher coverage. Furthermore, in our experiments, we also observe that larger subnetworks tend to have higher discriminative power (Δ ). Motivated by these insights, we assess the statistical significance of a subnetwork as a function of its size. For this purpose, we stratify the subnetworks that compose each background distribution according to subnetwork size and compute the pvalue of a subnetwork S by comparing Δ(S) to the discriminative power of the background subnetworks that have similar size to the subnetwork of interest. More precisely, the pvalue of S is defined as the fraction of subnetworks with discriminative power greater than that of S among all subnetworks in the background set with size equal to that of S. A minimal covering subnetwork S discovered by COBALT is considered to be statistically significant if its pvalue is less than the significance threshold for both background populations.
Using identified subnetworks for classification
 1.
We first identify both the positive and negative covering subnetworks for each gene g_{ i } ∈ $\mathcal{V}$.
 2.In order to investigate the effect of statistical significance on the classification utility of subnetworks, we use two alternate strategies to extract a list of features from the sets of covering subnetworks found in (1):
 (a)
The first approach assumes that highscoring subnetworks are more useful for classification, as compared to significantly discriminative subnetworks. For each phenotype class t ∈ $\mathcal{T}$, we sort the subnetworks based on their allvsall $\left({\Delta}_{\mathsf{\text{all}}\mathsf{\text{vs}}\mathsf{\text{all}}}\left(S\right)\right)$ or onevsall $\left({\Delta}_{\mathsf{\text{all}}\mathsf{\text{vs}}\mathsf{\text{all}}}\left(S\right)\right)$ discriminative power. We then choose the top k positive and negative covering subnetworks for each phenotype class, giving us a total of 2*k*$\mathcal{T}$ features to be used in classification. Here, k is a user defined parameter to set the number of stagespecific subnetworks that are used for each class.
 (b)
The second approach assumes that assessment of statistical significance will facilitate selection of biologically more meaningful subnetworks, also providing more power in classification as compared to highscoring subnetworks. For this purpose, using the two proposed statistical significance tests discussed in the previous section, we identify subnetworks that are signficant according to both statistical tests and use all of these significantly discriminative subnetworks as features for classification.
 3.
Once we obtain a final list of features either using (a) or (b) at step (2), we compute the aggregate expression profiles (E_{ S } ) for each of these selected subnetworks and use these to construct feature vectors for each sample (where each feature represents the aggreate expression of one subnetwork).
 4.
Finally, we use these feature vectors to train and test classifiers for predicting the class of the phenotype of interest.
 (a)
Results and discussion
In this section, we first give brief information about the colorectal cancer in human (CRC) and introduce the two stagespecific CRC datasets and the PPI network we use in our experiments. Subsequently, we describe in detail the experimental framework used. After introducing the performance evaluation metrics used, we present our experimental results comparing onevsall and allvsall discriminative subnetworks, as well as the additive and setcover based algorithms that are used to discover these subnetworks. Next, we analyze the effect of the network distance parameter (ℓ) that adjusts the search space size when growing the subnetworks. Finally, we compare the performance of highscoring and statistically significant subnetworks in predicting the stages of samples.
Human colorectal cancer
Colorectal Cancer (CRC) is one of the most common causes of cancer related deaths in the western civilization [12]. Diagnosis of CRC is often difficult as the symptoms appear only at the advanced stages of the disease. Moreover, early diagnosis is very critical as the survival rate changes dramatically with the stage of the cancer. In fact, 5 year survival rates when diagnosis is made at the localized stage (cancer is confined in the primary site) and after cancer has metastasized are around 90% and 12% respectively [13]. These observations suggest that, for effective diagnosis, prognosis and treatment, accurate determination of disease stage is crucial.
There are different classification systems for the progression of colorectal cancer. Dukes' famous staging system classifies patients based on how far the cancer is spread [14]. TNM is another staging method providing a more comprehensive framework including information about the size and localization of the tumor, as well as the involvement of lymph nodes [15]. CRC Datasets we use in our experiments are classified by Dukes' staging system.
Datasets
We use two CRC microarray datasets obtained from the Gene Expression Omnibus [16] in our experiments. These datasets that contain labeled CRC samples with Dukes' 4stage classification are the following:

GSE14333 contains the expression profiles of 54675 genes in 290 samples.

GSE5206 contains the expression profiles of the same 54675 genes in 98 samples.
Number of samples labeled with each colorectal cancer stage based on Dukes' 4stage classification in the datasets used in our experiments.
stage  A  B  C  D  total 

GSE14333  44  94  91  61  290 
GSE5206  12  32  33  21  98 
The human proteinprotein interaction data used in our experiments is obtained from NCBI Entrez Gene Database [17]. This database integrates interaction data from several other databases available, such as HPRD, BioGrid, and BIND. We remove the nodes with no interactions to obtain a final PPI network that contains 8959 proteins and 33,528 interactions among these proteins.
Experimental design
COBALT is fully implemented in Matlab. We use this implementation to perform the following classification experiments:

Prediction of disease stage in GSE14333. Subnetworks discovered using GSE14333 are used to predict the stages of samples in the same dataset in a 10fold cross validation setting. Samples in each phenotype class are randomly separated into ten similarsized groups. In each iteration, one of the groups in each class is chosen to be the test data and the rest of the data is used to train the classifier.

Prediction of disease stage in GSE5206. Subnetworks discovered using GSE14333 are used to predict the stages of samples in GSE5206. In this crossclassification setting, the classifier is trained on GSE14333 and tested on the other dataset, GSE5206.
For both of these settings, we use a naive Bayesian classifier provided by Matlab's classify function. Using other classifier options provided by Matlab's classifier procedure only marginally effects the results (data not shown).
Classification performance
In order to obtain a comprehensive picture of the performance of different approaches, we list the precision and recall of the classification experiments for each phenotype class separately. Precision refers to the percentage of the correct predictions over all samples predicted belonging to the respective CRC stage, whereas recall refers to the percentage of the correctly predicted samples over all samples that are clinically diagnosed to belong the respective CRC stage. Please note that we set the network distance parameter ℓ = 3 in all experiments unless otherwise noted, since it provides the best performance as shown in the next section. When quantizing the expression values of a gene over all samples using Equation 2, we set α = 2 as the threshold parameter for the gene's dysregulation in a sample to be considered positively or negatively covering.
We compare COBALT with our implementation of the two additive greedy approaches explained in detail in the Methods section, namely additive_{ ova } and additive_{ ava }. additive_{ ova } and additive_{ ava } refer respectively to the algorithms that aim to identify onevsall and allvsall subnetworks by greedily maximizing the discriminative power of the subnetworks $\left({\Delta}_{\mathsf{\text{one}}\mathsf{\text{vs}}\mathsf{\text{all}}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}{\Delta}_{\mathsf{\text{allvsall}}}\right).$ In the first set of experiments, we use the 10fold cross validation framework for prediction of disease stage of samples in GSE14333, using the highscoring subnetworks extracted as features from the same dataset, i.e., we use the top scoring positive and negative subnetworks for each stage in COBALT (setting k = 1), top 2 subnetworks for each stage in additive_{ ova } and top 8 subnetworks for additive_{ ava } as features (a total of 8 features used in each method).
The effect of the PPI network on classification performance
In this section, we discuss the effect of the PPI network in the classification performance of subnetworks identified by COBALT. Since the use of the PPI network confines the search space to functionally related groups of proteins, these experiments provide insights into whether these functional constraints also improve the biological reproducibility and the utility of identified stagespecific subnetworks. For this purpose, we systematically evaluate the classification performance of the subnetworks for varying network distance parameter (ℓ) that adjusts the search space size when growing the subnetworks using COBALT. We also compare the subnetworks identified by the networkguided algorithm with groups of genes that are identified by using the same algorithm in a networkfree fashion. The setcover based algorithm implemented by COBALT is quite efficient, therefore a networkfree search for stagespecific groups of proteins is feasible.
The effect of using statistically significant subnetworks on classification performance
Contingency tables for prediction of CRC stages of samples in GSE14333.
Predicted Classes  

stage  A  B  C  D  recall  
Actual Classes  A  37  3  0  4  0.84 
B  3  74  11  6  0.78  
C  5  5  77  4  0.84  
D  4  7  3  47  0.82  
precision  0.75  0.83  0.84  0.77  
Predicted Classes  
stage  A  B  C  D  recall  
A  38  1  2  3  0.86  
Actual Classes  B  4  75  8  7  0.79 
C  3  2  83  3  0.91  
D  2  3  4  52  0.85  
precision  0.80  0.92  0.85  0.80 
Contingency table for prediction of CRC stages of samples in GSE5206 using the statistically significant features identified from GSE14333.
Predicted Classes  

stage  A  B  C  D  recall  
A  9  2  1  0  0.75  
Actual Classes  B  3  18  8  3  0.56 
C  5  3  20  5  0.60  
D  4  3  6  8  0.38  
precision  0.42  0.69  0.57  0.50 
Conclusions
In this article, we have proposed two alternate formulations of the discriminative power of subnetworks when working on multiclass phenotypes, namely, onevsall and allvsall. We then introduced our coverbased algorithm for networkguided disease marker discovery, for identifying subnetworks with onevsall discriminative power. Moreover, we have introduced a framework for assessing the statistical significance of the identified subnetworks. Systematic experiments on real multistaged CRC datasets show that the proposed algorithm outperforms the additive algorithms in terms of providing higher precision and recall in prediction of sample stages. The efficient implementation of the coverbased algorithm enabled us to show that using statistically significant subnetworks as features improves classification performance compared to using same number of highscoring subnetworks (in terms of mutual information with respect to the phenotype vector). We have also shown that guiding the subnetwork discovery search with the PPI network identifies subnetworks that are more informative (in terms of classification power) than the networks identified without the PPI network. We have also investigated the impact of different values of the network distance parameter, ℓ, and concluded that using ℓ = 3 is the most reasonable choice in terms of both classification performance and computational efficiency.
Notes
Acknowledgements
We would like to note the contribution of anonymous reviewers whose queries and suggestions have helped improve this article significantly. MK and SE are supported in part by the National Science Foundation grant CCF0953195. MK and MRC are also supported in part by the National Institutes of Health grant R01HL106798. SE is also supported in part by Choose Ohio First Scholarship. JSB and XG are supported by Case Comprehensive Cancer Center Core Grant (5P30 CA043703).
This article has been published as part of BMC Proceedings Volume 6 Supplement 7, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/6/S7.
Supplementary material
References
 1.Scherzer CR, Eklund AC, Morse LJ, Liao Z, Locascio JJ, Fefer D, Schwarzschild MA, Schlossmacher MG, Hauser MA, Vance JM, Sudarsky LR, Standaert DG, Growdon JH, Jensen RV, Gullans SR: Molecular markers of early Parkinson's disease based on gene expression in blood. Proceedings of the National Academy of Sciences. 2007, 104 (3): 955960. 10.1073/pnas.0610204104. [http://www.pnas.org/content/104/3/955.abstract]CrossRefGoogle Scholar
 2.Cheung IY, Feng Y, Gerald W, Cheung NKV: Exploiting gene expression profiling to identify novel minimal residual disease markers of neuroblastoma. Clinical Cancer Research. 2008, 14 (21): 70207027. 10.1158/10780432.CCR080541.PubMedCentralCrossRefPubMedGoogle Scholar
 3.Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, Lin L, Chen G, Gharib TG, Thomas DG, Lizyness ML, Kuick R, Hayasaka S, Taylor JM, Iannettoni MD, Orringer MB, Hanash S: Geneexpression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002, 8 (8): 816824. [http://dx.doi.org/10.1038/nm733]PubMedGoogle Scholar
 4.Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, BorresenDale AL, Brown PO, Botstein D: Molecular portraits of human breast tumours. Nature. 2000, 406 (6797): 747752. 10.1038/35021093. [http://dx.doi.org/10.1038/35021093]CrossRefPubMedGoogle Scholar
 5.Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Networkbased classification of breast cancer metastasis. Mol Syst Biol. 2007, 3 (140): [http://dx.doi.org/10.1038/msb4100180]Google Scholar
 6.Chowdhury A Salim, Koyutürk M: Identification of coordinately dysregulated subnetworks in complex phenotypes. Pacific Symposium on Biocomputing (PSB'10). 2010, 133144. [http://dx.doi.org/10.1142/9789814295291_0016]Google Scholar
 7.Dao P, Colak R, Salari R, Moser F, Davicioni E, Schanhuth A, Ester M: Inferring cancer subnetwork markers using densityconstrained biclustering. Bioinformatics. 2010, 26 (18): i625i631. 10.1093/bioinformatics/btq393. [http://bioinformatics.oxfordjournals.org/content/26/18/i625.abstract]PubMedCentralCrossRefPubMedGoogle Scholar
 8.Dao P, Wang K, Collins C, Ester M, Lapuk A, Sahinalp SC: Optimally discriminative subnetwork markers predict response to chemotherapy. Bioinformatics. 2011, 27 (13): i205i213. 10.1093/bioinformatics/btr245. [http://bioinformatics.oxfordjournals.org/content/27/13/i205.abstract]PubMedCentralCrossRefPubMedGoogle Scholar
 9.Chowdhury SA, Nibbe RK, Chance MR, Koyutürk M: Subnetwork state functions define dysregulated subnetworks in cancer. Journal of Computational Biology. 2011, 18 (3): 263281. 10.1089/cmb.2010.0269. [http://dx.doi.org/10.1089/cmb.2010.0269]PubMedCentralCrossRefPubMedGoogle Scholar
 10.Dutkowski J, Ideker T: Protein networks as logic functions in development and cancer. PLoS Comput Biol. 2011, 7 (9): e100218010.1371/journal.pcbi.1002180. [http://dx.doi.org/10.1371%2Fjournal.pcbi.1002180]PubMedCentralCrossRefPubMedGoogle Scholar
 11.Chvatal V: A greedy heuristic for the setcovering problem. Mathematics of Operations Research. 1979, 4 (3): 233235. 10.1287/moor.4.3.233. [http://dx.doi.org/10.2307/3689577]CrossRefGoogle Scholar
 12.Macdonald F, Ford C, Casson A: Molecular biology of cancer. 2004, Advanced text, BIOS Scientific Publishers, [http://books.google.com/books?id=SClAlGpFRIC]Google Scholar
 13.Howlader N, Noone A, Krapcho M, Neyman N, Aminou R, Waldron W, Altekruse S, Kosary C, Ruhl J, Tatalovich Z, Cho H, Mariotto A, Eisner M, Lewis D, Chen H, Feuer E, Cronin K, Edwards Be: SEER cancer statistics review, 19752008. 2010, [http://seer.cancer.gov/csr/1975_2008/]Google Scholar
 14.Dukes CE: The classification of cancer of the rectum. The Journal of Pathology and Bacteriology. 1932, 35 (3): 323332. 10.1002/path.1700350303. [http://dx.doi.org/10.1002/path.1700350303]CrossRefGoogle Scholar
 15.Sobin LH, Gospodarowicz MK, Wittekind C: TNM  classification of malignant tumours. 2010Google Scholar
 16.Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A: NCBI GEO: archive for functional genomics data sets10 years on. Nucleic Acids Research. 2010, 39 (Database): D1005D1010.PubMedCentralCrossRefPubMedGoogle Scholar
 17.Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: genecentered information at NCBI. Nucl Acids Res. 2007, 35 (suppl1): D2631. [http://nar.oxfordjournals.org/cgi/content/abstract/35/suppl_1/D26]PubMedCentralCrossRefPubMedGoogle Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.