Dissecting Pathway Disturbances Using Network Topology and Multi-platform Genomics Data

  • Yuping Zhang
  • M. Henry Linder
  • Ali Shojaie
  • Zhengqing Ouyang
  • Ronglai Shen
  • Keith A. Baggerly
  • Veerabhadran Baladandayuthapani
  • Hongyu Zhao
Article
  • 81 Downloads

Abstract

Complex diseases such as cancers usually result from accumulated disturbance of pathways instead of the disruptions of one or a few major genes. As opposed to single-platform analyses, it is likely that integrating diverse molecular regulatory elements and their interactions can lead to more insights on pathway-level disturbances of biological systems and their potential consequences in disease development and progression. To explore the benefit of pathway-based analysis, we focus on multi-platform genomics, epigenomics, and transcriptomics (-omics, for short) from 11 cancer types collected by The Cancer Genome Atlas project. Specifically, we use a well-studied oncogenic pathway, the BRAF pathway, to investigate the relevant copy number variants (CNVs), methylations, and gene expressions, and quantify their effects on discovering tumor-specific aberrations across multiple tumor lineages. We also perform simulation studies to further investigate the effects of network topology and multiple omics on dissecting pathway disturbances. Our analysis shows that adding molecular regulatory elements such as CNVs and/or methylations to the baseline mRNA molecules can improve our power of discovering tumorous aberrances. Also, incorporating CNVs with the baseline mRNA molecules can be more beneficial than incorporating methylations. Moreover, employing regulatory topologies can improve the discoveries of tumorous aberrances. Finally, our analysis reveals similarities and differences among diverse cancer types based on disturbance of the BRAF pathway.

Keywords

Data integration Multi-platform genomics Network topology Pathway analysis 

1 Introduction

Human diseases such as cancers have very complex etiologies involving numerous molecular and environmental factors and their interactions. They are usually the consequences of accumulated disturbance of pathways. Recent advances of cancer research in melanoma indicate that the BRAF pathway has one of the most prevalent genetic changes in the development and progression of melanoma [3]. Systematic analysis of the BRAF pathway and its activation in melanomas has stimulated drug developments including inhibiting RAF/MEK signaling. For instance, a known effective drug in melanoma patients, namely vemurafenib, targets the BRAF mutation. A recent clinical trial shows that vemurafenib can be effective across multiple cancer types beyond melanoma  [4]. Thus, systematically investigating the disturbances of the BRAF pathway over multiple cancer types will likely facilitate the treatment of cancers that have been less studied so far through systematic delineation of pathway aberrations in the BRAF pathway using data from multiple molecular resolutions.

Methodologically, extensive research efforts have been spent on assessing the accumulated effects of a set of genes that can be grouped together based on certain criteria, such as sharing common biological function, chromosomal location, or regulation  [5, 6, 8]. These existing pathway/gene-set-based methods are complementary to single-gene analyses. For example, Subramanian et al.  [11] proposed gene set enrichment analysis (GSEA) which uses a signed Kolmogorov–Smirnov statistic as the set-level statistic based on the distributions of the gene-level test statistic with and without the gene set of interest. Efron and Tibshirani [2] introduced a “maxmean” statistic to detect significant gene sets. These methods and their variants [12] employ a permutation-based procedure to obtain p values of gene sets. Some network-based pathway enrichment analysis methods have been proposed to leverage the information in the network structure (topology). For instance, Shojaie and Michailidis [9] proposed a gene set analysis method based on the underlying regulatory network, and extended this framework to the scenario with incomplete network information [7], as well as the analysis in complex experiments [10]. Specifically, this framework, named as network-based gene set analysis (NetGSA) framework, combines the ideas of gene-set analysis methods and network-based single-gene analysis. The existing pathway-based approaches can be potentially improved by simultaneously leveraging a priori biological network information and integrating different types of omic data.

In this paper, we investigate the disturbances of the BRAF pathway comparing to the normal tissues for multiple cancer types leveraging different types of omic data, including transcriptomic (mRNA), genomic (copy number), and epigenomic (methylation), as well as incorporating the underlying BRAF pathway topology. Through such analyses, we hope to identify the similarities and differences among multiple cancer types based on disturbance of the BRAF pathway. To investigate the effects of integrating diverse omic data and incorporating network topology in the analysis, we compare approaches with and without considering such aspects and quantify the effects. Specifically, our approach denoted by EMC-NetGSA simultaneously considers a priori biological network information and integrates the three types of omic data. Analogously, if we only integrate transcriptomic and genomic data, we name the approach as EC-NetGSA, while the approach integrating transcriptomic and epigenomic data is named as EM-NetGSA. The standard NetGSA approach [9] and GSEA [14] were performed using transcriptomic data, denoted by NetGSA and Gene-Set, respectively. We found that it is beneficial to discover tumor-specific pathway disturbances by leveraging a priori biological network information and integrating different types of omic data.

The rest of the paper is organized as follows. In Sect. 2, we first describe the biological context, and then introduce the analysis approaches. In the third section, we present our results and illustrate the biological insights gained by our analysis. In Sect. 4, we further investigate the effects of network topology and multiple omic data on dissecting pathway disturbances through simulations. We conclude our paper with discussion in the final section.
Table 1

Sample size of each cancer type for tumorous and normal tissues, respectively

Cancer types

Normal sample sizes

Tumor sample sizes

BLCA

15

406

BRCA

64

759

COAD

19

276

HNSC

20

513

KIRC

24

313

KIRP

21

273

LIHC

39

364

LUAD

20

448

PRAD

35

491

THCA

48

502

UCEC

22

172

2 Materials and Methods

2.1 BRAF Pathway and Multi-platform Genomics Data

We consider the cancer types where both tumorous and the corresponding normal tissues are available in The Cancer Genome Atlas (TCGA) [13] database. As for omic data, we consider methylation, copy number variant (CNV), and mRNA measurements, and only analyze cancers that have more than nine samples with methylation, CNV, and mRNA measurements for both tumorous and normal conditions. This resulted in 11 cancer types as summarized in Table 1, including bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), head and neck squamous cell carcinoma (HNSC), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), prostate adenocarcinoma (PRAD), thyroid carcinoma (THCA), and uterine corpus endometrial carcinoma (UCEC).

We focus on 10 genes: “NRAS,” “RAF1,” “BRAF,” “MAP2K1,” “MAP2K2,” “MAPK1,” “PIK3CA,” “PTEN,” “AKT1,” and “MTOR” that have been identified earlier as being closely linked to the behavior of the oncogene BRAF. The mRNAs of these genes serve as the skeleton of the BRAF pathway in our analysis. We also incorporate regulations from the upstream CNVs and methylations to the corresponding downstream gene expression. These result in the network topology as shown in Fig. 1 illustrated by KIRP.
Fig. 1

Weighted networks for KIRP in both tumorous and normal tissues. The width of edges is proportional to the absolute value of calculated partial correlation. The color and style of edges indicates the sign of partial correlation. Blue dashed line indicates negative correlation, which reflects inhibitive regulation; while red solid line indicates positive correlation, which reflects active regulation. Large white nodes indicate mRNAs, small black rectangles indicate CNVs, and small gray rectangles indicate methylations

2.2 Methods

2.2.1 EMC-NetGSA: Integrated NetGSA for mRNAs, CNVs, and Methylations

In this approach, we adapt and extend the NetGSA framework [10] using all three types of omics data, i.e., mRNAs, CNVs, and methylations, based on the BRAF pathway interactions with the three types of regulations illustrated in Fig. 1 using KIRP. This figure shows the weighted networks for KIRP in both tumorous and normal tissues. The width of edges is proportional to the absolute value of the calculated partial correlation. The color of edges indicates the sign of partial correlation. Blue dashed line indicates the negative correlation, which reflects inhibitive regulation; while red solid line indicates the positive correlation, which reflects active regulation. Large white nodes indicate mRNAs, small black rectangles indicate CNVs, and small gray rectangles indicate methylations. One can see different regulation patterns between tumorous and normal tissues illustrated by KIRP in Fig. 1.

The basic idea behind the NetGSA method is combining gene-set analysis method and network-based single-gene analysis. It can be used to test for changes in the overall behavior of arbitrary subnetworks or pathways. Specifically, the EMC-NetGSA approach considers \(p=p_\mathrm{e} + p_\mathrm{c} + p_\mathrm{m}\) biomolecules (including \(p_\mathrm{e}\) mRNAs, \(p_\mathrm{c}\) CNVs and \(p_\mathrm{m}\) methylations in current setting), of which measurements across n samples are organized in a \(p \times n\) data matrix \(\varvec{D}.\) Let \(\varvec{Y}_j^{(k)}\,(j=1,\ldots , n;\,k\in \{C,\, T\}\)) be the jth sample in the data matrix under condition k,  with the first \(n_1\) columns of \(\varvec{D}\) corresponding to condition C (normal control) and the last \(n_2 = n-n_1\) columns corresponding to condition T (cancer type of interest). The adjacency matrix of the network, denoted by \(\varvec{A},\) includes interaction relationships across all the three types of regulatory elements indicated in Fig. 1. The network topology is incorporated into the analysis through an influence matrix \(\varvec{\varLambda },\) which is calculated based on the corresponding adjacency matrix \(\varvec{A}.\) For any directed acyclic graph, \(\varvec{\varLambda } = (\varvec{I}-\varvec{A}^\mathsf {T})^{-1}\) as shown in [9]. In our analysis, we employ a weighted adjacency matrix with partial correlations as the weights. We use KIRP as an example to show the weighted networks in both tumorous and normal tissues (Fig. 1).
Fig. 2

Model illustration through a toy example. \(c_1,\, c_2\) indicate CNVs, \(m_1,\, m_2\) indicate methylations, and \(x_1\) and \(x_2\) indicate mRNAs. \(c_1 = r_{\mathrm{c}1};\, c_2 = r_{\mathrm{c}2};\, m_1 = r_{\mathrm{m}1};\, m_2= r_{\mathrm{m}2};\, x_1 = \rho _{\mathrm{c}1} r_{\mathrm{c}1} + \rho _{\mathrm{m}1} r_{\mathrm{m}1} + r_1;\, x_2 = \rho _{12} \rho _{\mathrm{c}1} r_{\mathrm{c}1} + \rho _{12} \rho _{\mathrm{m}1} r_{\mathrm{m}1} + \rho _{12} r_1 + \rho _{\mathrm{c}2} r_{\mathrm{c}2} + \rho _{\mathrm{m}2} r_{\mathrm{m}2} + r_2 \)

We consider a latent variable model, which is illustrated by a toy example in Fig. 2 with two genomic, two epigenomic, and two transcriptomic nodes. Ignoring condition label k,  let \(\varvec{r}\) indicate the baseline latent variable vector of regulatory nodes [assuming \(\varvec{r}\sim N(\varvec{\mu },\,\sigma _r^2 \varvec{I}_p)],\, \rho \) indicate a weight in the adjacency matrix. The underlying genomic signals for CNVs are \(c_1 = r_{\mathrm{c}1}\) and \(c_2 = r_{\mathrm{c}2}.\) Similarly, the underlying epigenetic signals for methylations are \(m_1 = r_{\mathrm{m}1}\) and \(m_2 = r_{\mathrm{m}2}.\) The transcriptomic signals for mRNAs are \(x_1 = \rho _{\mathrm{c}1} r_{\mathrm{c}1} + \rho _{\mathrm{m}1} r_{\mathrm{m}1} + r_1\) and \(x_2 = \rho _{12} \rho _{\mathrm{c}1} r_{\mathrm{c}1} + \rho _{12} \rho _{\mathrm{m}1} r_{\mathrm{m}1} + \rho _{12} r_1 + \rho _{\mathrm{c}2} r_{\mathrm{c}2} + \rho _{\mathrm{m}2} r_{\mathrm{m}2} + r_2.\) The adjacency matrix \(\varvec{A}\) characterizes the regulatory effects of each node on its immediate neighbors, while the influence matrix represents the propagated effect of each node on all other nodes in the network. In the order of \(c_1,\, m_1,\, x_1,\, c_2\), and \(m_2,\) we have the following \(\varvec{A}\) and \(\varvec{\varLambda }{\text {:}}\)
$$\begin{aligned} \varvec{A}= \begin{bmatrix} 0&\quad 0&\quad \rho _{\mathrm{c}1}&\quad 0&\quad 0&0 \\ 0&\quad 0&\quad \rho _{\mathrm{m}1}&\quad 0&\quad 0&\quad 0 \\ 0&\quad 0&\quad 0&\quad 0&\quad 0&\quad \rho _{12} \\ 0&\quad 0&\quad 0&\quad 0&\quad 0&\quad \rho _{\mathrm{c}2} \\ 0&\quad 0&\quad 0&\quad 0&\quad 0&\quad \rho _{\mathrm{m}2} \\ 0&\quad 0&\quad 0&\quad 0&\quad 0&\quad 0 \end{bmatrix},\quad \varvec{\varLambda } = \begin{bmatrix} 1&\quad 0&\quad 0&\quad 0&\quad 0&\quad 0 \\ 0&\quad 1&\quad 0&\quad 0&\quad 0&\quad 0 \\ \rho _{\mathrm{c}1}&\quad \rho _{\mathrm{m}1}&1&\quad 0&\quad 0&\quad 0 \\ 0&\quad 0&\quad 0&\quad 1&\quad 0&\quad 0 \\ 0&\quad 0&\quad 0&\quad 0&\quad 1&\quad 0 \\ \rho _{\mathrm{c}1}\rho _{12}&\rho _{\mathrm{m}1}\rho _{12}&\rho _{12}&\rho _{\mathrm{c}2}&\rho _{\mathrm{m}2}&1 \end{bmatrix}. \end{aligned}$$
The latent variable model is of the form
$$\begin{aligned} \varvec{Y}_{j}^{(k)}=\varvec{\varLambda }^{(k)}\varvec{\mu }^{(k)} + \varvec{\varLambda }^{(k)}\varvec{\gamma }_j + \varvec{\epsilon }_j, \end{aligned}$$
where \(k \in \{C,\,T\},\, j = \{1,\ldots , n\},\, \varvec{\gamma }_j\) is the vector of unknown random effects, \(\varvec{\gamma }_j \sim N(\mathbf {0},\, \sigma _r^2\varvec{I}_p),\) and \(\varvec{\epsilon }_j\) is the vector of random errors, \(\varvec{\epsilon }_j \sim N(\mathbf {0},\, \sigma _\epsilon ^2\varvec{I}_p).\) Obviously, \(E(\varvec{Y}_j^{(k)}) = \varvec{\varLambda }^{(k)} \varvec{\mu }^{(k)}\) and Var\((\varvec{Y}_j^{(k)})=\sigma _{r}^2 \varvec{\varLambda }^{(k)}\varvec{\varLambda }^{(k)\mathsf {T}} + \sigma _{\epsilon }^2\varvec{I}_{p}.\)
We rearrange data matrix \(\varvec{D},\) random effects \(\varvec{\gamma }_j,\) and random errors \(\varvec{\epsilon }_j\, (j \in \{1,\ldots , n \}\)) into \(np \times 1\) column vectors, denoted by \(\varvec{\mathcal {Y}},\, \varvec{\mathcal {G}}\), and \(\varvec{\mathcal {E}},\) respectively. Let \(\varvec{\beta }\) be the concatenated vector of means \(\varvec{\mu }^{(\mathrm{C})}\) and \(\varvec{\mu }^{(\mathrm{T})},\) which is the network contrast vector for one cancer type of interest and normal control. Then, we have the mixed linear regression model as follows:
$$\begin{aligned} \varvec{\mathcal {Y}} = \varvec{\varPsi } \varvec{\beta } + \varvec{\varPi } \varvec{\mathcal {G}} +\varvec{\mathcal {E}}, \end{aligned}$$
where
$$\begin{aligned} \varvec{\varPsi }= \left( \begin{matrix} \varLambda ^\mathrm{C} &{}\cdots &{} \varLambda ^\mathrm{C} &{} 0 &{} \cdots &{} 0 \\ 0 &{} \cdots &{} 0 &{} \varLambda ^\mathrm{T} &{} \cdots &{} \varLambda ^\mathrm{T} \end{matrix} \right) ^\mathsf {T}, \end{aligned}$$
$$\begin{aligned} \varvec{\varPi } = \mathrm{diag}\left( \varLambda ^\mathrm{C}, \ldots , \varLambda ^\mathrm{C},\, \varLambda ^\mathrm{T},\ldots , \varLambda ^\mathrm{T}\right) ^{\mathsf {T}}, \end{aligned}$$
\(\varvec{\mathcal {G}}\) is the vector of unknown random effects (\(\varvec{\mathcal {G}} \sim N(\mathbf {0},\, \sigma _r^2 \varvec{I})\)), and \(\varvec{\mathcal {E}}\) is the error term following a normal distribution \(\varvec{\mathcal {E}} \sim N(\mathbf {0},\, \sigma _{\epsilon }^2 \varvec{I}).\)
Then, hypothesis testing can be constructed as
$$\begin{aligned} H_0{\text {:}}\,\varvec{l}\varvec{\beta } =0\quad \text {versus}\quad H_1{\text {:}}\, \varvec{l}\varvec{\beta } \ne 0, \end{aligned}$$
(1)
where the contrast vector \(\varvec{l}\) is chosen by satisfying the constraint \(\mathbf {1}^{\mathsf {T}}\varvec{l}=0\) and consisting of the effects of influence matrix on the nodes of the (sub)network of interest. To test a specific subnetwork, it is desirable to account for all of the interactions within the subnetwork and to not include any effects from genes outside the subnetwork. That is, \(\varvec{l} = ({-}\varvec{b} \circ \varvec{b} \varLambda ^\mathrm{C},\, \varvec{b} \circ \varvec{b} \varLambda ^\mathrm{T}),\) where \(\varvec{b}\) is the indicator vector of the gene set and \(\circ \) is an element-wise product of vectors. In EMC-NetGSA, \(\varvec{b}=(1,\ldots , 1),\) which contains the whole network topology illustrated in Fig. 1 with p nodes. Thus, the hypothesis testing in  (1) is equivalent to the following:
$$\begin{aligned} H_0{\text {:}}\,\varvec{b}\left( \varvec{\varLambda }^\mathrm{T}\varvec{\mu }^\mathrm{T} - \varvec{\varLambda }^\mathrm{C}\varvec{\mu }^\mathrm{C}\right) =0\quad \text {versus}\quad H_1{\text {:}}\, \varvec{b}\left( \varvec{\varLambda }^\mathrm{T}\varvec{\mu }^\mathrm{T} - \varvec{\varLambda }^\mathrm{C}\varvec{\mu }^\mathrm{C}\right) \ne 0. \end{aligned}$$
Based on the theory of mixed linear models, the corresponding Wald test statistic is of the form:
$$\begin{aligned} T = \frac{l \hat{\beta } }{\mathrm{SE}(l\hat{\beta })}, \end{aligned}$$
where \(\mathrm{SE}(l\hat{\beta })\) represents the standard error of \(l \hat{\beta }.\) After further calculation, we get
$$\begin{aligned} T = \frac{\varvec{b}(\bar{\varvec{\mathcal {Y}}}^\mathrm{T}-\bar{\varvec{\mathcal {Y}}}^\mathrm{C})}{\sqrt{\hat{\sigma }_r^2\left[ \varvec{b}\left( \frac{1}{n_2}\varvec{\varLambda }^\mathrm{T}\varvec{\varLambda }^{\mathrm{T}^{\prime }}+\frac{1}{n_1}\varvec{\varLambda }^\mathrm{C}\varvec{\varLambda }^{\mathrm{C}^{\prime }}\right) \varvec{b}^{\mathsf {T}}\right] + \hat{\sigma }_{\epsilon }^2\left( \frac{1}{n_1}+\frac{1}{n_2}\right) \varvec{b}\varvec{b}^{\mathsf {T}}}}. \end{aligned}$$
Under the null hypothesis, T has an approximate t-distribution with the degrees of freedom df estimated using Satterthwaite approximation:
$$\begin{aligned} \mathrm{df} = \frac{\left( \frac{s_\mathrm{C}^2}{n_1}+\frac{s_\mathrm{T}^2}{n_2}\right) ^2}{\frac{1}{n_1-1}\left( \frac{s_\mathrm{C}^2}{n_1-1}\right) ^2 +\frac{1}{n_2-1}\left( \frac{s_\mathrm{T}^2}{n_2}\right) ^2 }, \end{aligned}$$
where \(s_\mathrm{C}^2 = \hat{\sigma }_r^2 \varvec{b}\varvec{\varLambda }^\mathrm{C}\varvec{\varLambda }^{\mathrm{C}^{\prime }}\varvec{b}^{\mathsf {T}}+\hat{\sigma }_{\epsilon }^2,\,s_\mathrm{T}^2 = \hat{\sigma }_r^2 \varvec{b}\varvec{\varLambda }^\mathrm{T}\varvec{\varLambda }^{\mathrm{T}^{\prime }}\varvec{b}^{\mathsf {T}}+\hat{\sigma }_{\epsilon }^2.\)

2.2.2 EC-NetGSA: Integrated NetGSA for mRNAs and CNVs

The only difference between EC-NetGSA and EMC-NetGSA is that we only integrate two types of measurements: mRNAs and CNVs. We eliminate the methylation nodes and their regulatory relationship to the corresponding mRNAs in Fig. 1. Thus, the network of interest consists of mRNA–mRNA interactions as well as CNV–mRNA interactions, which results in reduced weighted \((p_\mathrm{e}+p_\mathrm{c})\times (p_\mathrm{e}+p_\mathrm{c})\) adjacency matrices \(\varvec{A}_\mathrm{EC}^\mathrm{C}\) and \(\varvec{A}_\mathrm{EC}^\mathrm{T}\) as well as reduced \((p_\mathrm{e}+p_\mathrm{c})\times (p_\mathrm{e}+p_\mathrm{c})\) influence matrices \(\varvec{\varLambda }_\mathrm{EC}^\mathrm{C}\) and \(\varvec{\varLambda }_\mathrm{EC}^\mathrm{T}.\) Then, we employ the integrated NetGSA approach to this reduced BRAF network with CNVs and mRNAs by testing
$$\begin{aligned} H_0{\text {:}}\,\varvec{b}_\mathrm{EC}\left( \varvec{\varLambda }_\mathrm{EC}^\mathrm{T}\varvec{\mu }_\mathrm{EC}^\mathrm{T} - \varvec{\varLambda }_\mathrm{EC}^\mathrm{C}\varvec{\mu }_\mathrm{EC}^\mathrm{C}) =0\quad \text {versus}\quad H_1{\text {:}}\,\varvec{b}_\mathrm{EC}(\varvec{\varLambda }_\mathrm{EC}^\mathrm{T}\varvec{\mu }_\mathrm{EC}^\mathrm{T} - \varvec{\varLambda }_\mathrm{EC}^\mathrm{C}\varvec{\mu }_\mathrm{EC}^\mathrm{C}\right) \ne 0, \end{aligned}$$
where \(\varvec{b}_\mathrm{EC}\) is a (\(p_\mathrm{e}+p_\mathrm{c}\))-length vector with elements as 1s, \(\varvec{\mu }_\mathrm{EC}^\mathrm{C}\) is the baseline vector with \(p_\mathrm{e}+p_\mathrm{c}\) elements for the control condition, while \(\varvec{\mu }_\mathrm{EC}^\mathrm{T}\) is a baseline vector with \(p_\mathrm{e}+p_\mathrm{c}\) elements for a tumor condition.

2.2.3 EM-NetGSA: Integrated NetGSA for mRNAs and Methylations

In this approach, we integrate mRNAs and methylations instead, compared to EC-NetGSA. For each gene of interest, we eliminate the CNV nodes and their corresponding regulatory relationship to the downstream mRNAs in Fig. 1. Then, the network of interest is constructed by combining interactions among mRNAs and methylation–mRNA interactions, which results in reduced weighted \((p_\mathrm{e}+p_\mathrm{c})\times (p_\mathrm{e}+p_\mathrm{c})\) adjacency matrices \(\varvec{A}_\mathrm{EC}^\mathrm{C}\) and \(\varvec{A}_\mathrm{EC}^\mathrm{T}\) as well as reduced \((p_\mathrm{e}+p_\mathrm{c})\times (p_\mathrm{e}+p_\mathrm{c})\) influence matrices \(\varvec{\varLambda }_\mathrm{EC}^\mathrm{C}\) and \(\varvec{\varLambda }_\mathrm{EC}^\mathrm{T}.\) Finally, we employ the integrated NetGSA method to this reduced BRAF network with methylations and mRNAs by testing
$$\begin{aligned} H_0{\text {:}}\,\varvec{b}_\mathrm{EM}\left( \varvec{\varLambda }_\mathrm{EM}^\mathrm{T}\varvec{\mu }_\mathrm{EM}^\mathrm{T} - \varvec{\varLambda }_\mathrm{EM}^\mathrm{C}\varvec{\mu }_\mathrm{EM}^\mathrm{C}\right) =0\quad \text {versus}\quad H_1{\text {:}}\,\varvec{b}_\mathrm{EM}\left( \varvec{\varLambda }_\mathrm{EM}^\mathrm{T}\varvec{\mu }_\mathrm{EM}^\mathrm{T} - \varvec{\varLambda }_\mathrm{EM}^\mathrm{C}\varvec{\mu }_\mathrm{EM}^\mathrm{C}\right) \ne 0, \end{aligned}$$
where \(\varvec{b}_\mathrm{EM}\) is a (\(p_\mathrm{e}+p_\mathrm{m}\))-length vector with elements as 1s, \(\varvec{\mu }_\mathrm{EM}^\mathrm{C}\) is the baseline vector with \(p_\mathrm{e}+p_\mathrm{m}\) elements for the control condition, while \(\varvec{\mu }_\mathrm{EM}^\mathrm{T}\) is a baseline vector with \(p_\mathrm{e}+p_\mathrm{m}\) elements for a tumor condition.

2.2.4 NetGSA: NetGSA for mRNAs

To show the effects of integrative analysis of multiple omic data, we also consider the original NetGSA approach in [10] for the BRAF pathway shown in Fig. 1 with gene expression only. We update the data matrix accordingly which only consists of mRNA features. Then, we test the following hypothesis:
$$\begin{aligned} H_0{\text {:}}\,\varvec{b}_\mathrm{E}\left( \varvec{\varLambda }_\mathrm{E}^\mathrm{T}\varvec{\mu }_\mathrm{E}^\mathrm{T} - \varvec{\varLambda }_\mathrm{E}^\mathrm{C}\varvec{\mu }_\mathrm{E}^\mathrm{C}\right) =0\quad \text {versus}\quad H_1{\text {:}}\,\varvec{b}_\mathrm{E}\left( \varvec{\varLambda }_\mathrm{E}^\mathrm{T}\varvec{\mu }_\mathrm{E}^\mathrm{T} -\varvec{\varLambda }_\mathrm{E}^\mathrm{C}\varvec{\mu }_\mathrm{E}^\mathrm{C}\right) \ne 0, \end{aligned}$$
where \(\varvec{b}_\mathrm{E}\) is a (\(p_\mathrm{e}\))-length vector with elements as 1s, \(\varvec{\varLambda }_\mathrm{E}^\mathrm{C}\) is the \(p_\mathrm{e}\times p_\mathrm{e}\) influence matrix with mRNA–mRNA interactions only for the control condition, \(\varvec{\varLambda }_\mathrm{E}^\mathrm{T}\) is a \(p_\mathrm{e}\times p_\mathrm{e}\) influence matrix for a tumor condition, \(\varvec{\mu }_\mathrm{E}^\mathrm{C}\) is the baseline vector with \(p_\mathrm{e}\) elements for the control condition, while \(\varvec{\mu }_\mathrm{E}^\mathrm{T}\) is a baseline vector with \(p_\mathrm{e}\) elements for a tumor condition.

2.2.5 Gene-Set: Gene Set Tests for mRNAs

Finally, to demonstrate the effects of data integration of diverse omics and incorporating network topology in the analysis, we consider the following setting. In this scenario, we ignore network topology as shown in Fig. 1 and only focus on mRNAs. In other words, we treat the BRAF pathway as gene sets by ignoring the underlying network topology. The self-contained gene set test [14] is performed for the gene sets involved in the BRAF pathway shown in Fig. 1.

3 Results

We performed the analyses on the 11 cancer types described in Sect. 2.1 using the five analysis approaches discussed above. With 0.05 as the significance cutoff of p values, we found that the BRAF pathway differentially expressed in five tumors: BRCA, COAD, KIRC, KIRP, and LIHC comparing to the normal control (Fig. 3). The relative performances of these five approaches in these significant cancer types suggest the advantage of integrating the network topologies and additional types of regulatory elements in the analysis. Comparing EMC-NetGSA, EC-NetGSA, and EM-NetGSA with NetGSA shows that adding additional molecular regulatory elements such as CNVs and/or methylations to the baseline mRNA molecules can improve our power of discovering tumorous aberrances. Also, incorporating CNVs with the baseline mRNA molecules can be more beneficial than incorporating methylations suggested by the comparison between EC-NetGSA and EM-NetGSA. Moreover, employing regulatory topologies can improve the discoveries of tumorous aberrances as shown by the comparison between NetGSA-typed approaches (EMC-NetGSA, EC-NetGSA, EM-NetGSA, and NetGSA) with the Gene-Set approach. The deviation of each tumorous condition from the corresponding normal tissue is shown in Fig. 4, represented by the approximate T statistic in EMC-NetGSA. Bar height is the calculated statistic for each cancer type. As shown in Fig. 4, our analysis suggests inhibitive BRAF pathway aberrances in KIRP, COAD, KIRC, LIHC, and BRCA with p values smaller than 0.05. Such analysis shows the similarities and differences among these cancers in terms of the merits of BRAF pathway disturbance, which suggests directions of borrowing the knowledge of well-studied cancer types to less explored cancer types.
Fig. 3

Comparison of results for all approaches across the 11 cancer types. In each subplot, bar heights show the −log(p value) indicated in the Y-axis for the five approaches described in Sect. 2.2. The horizontal line in each subgraphs indicates the p value cutoff (0.05) used in the paper

Fig. 4

T statistic comparison for EMC-NetGSA across the 11 cancer types. Bar height is the calculated T statistic for each cancer type, respectively

4 Simulations

We performed simulations to investigate the effects of diverse types of regulatory elements and topologies in differential analysis. We adapt the simulation design in [9] and build network structures based on a binary tree. Specifically, the treatment group adjacency matrix is a four-level binary tree. The control group has the same network topology as the treatment group, except that all edges in the left branch of the tree (including the root) are removed. Expression correlations are set for both adjacency matrices such that (1) the association for the first three levels of the genes in the network (top 7 genes in the tree) is assumed to be 0.8, (2) genes in the next three levels (56 genes) have association equal to 0.5, and (3) the remainder of the genes are weakly associated with \(\rho =0.2.\) The data matrix is then augmented to include the corresponding observations for CNV and methylation data. A directed edge is added from each CNV and methylation node, to the corresponding expression node. We set the correlation between expression and CNV nodes as 0.5, while the correlation between expression and methylation nodes is −0.25. The simulated networks are shown in Fig. 5. We assume that the model for the observed, integrated data y is
$$\begin{aligned} y \sim N_{3p}\left( \varLambda \mu ,\, \sigma ^{2}_{\gamma }\varLambda \varLambda ^{\prime } + \sigma ^{2}_{\epsilon }\right) , \end{aligned}$$
(2)
where
$$\begin{aligned} \varLambda = \left( \varvec{I}_{3p} - \varvec{A}\right) ^{-1}, \end{aligned}$$
(3)
and \(\varvec{A},\,\varLambda \) vary according to whether the observed subject is in the control or the treatment group, and \(\sigma ^{2}_{\gamma } = 5,\, \sigma ^{2}_{\epsilon } = 0.5.\)
Fig. 5

Networks used in the simulation

We first take the sample size \(n_\mathrm{c}=n_\mathrm{t}=n=150,\) and generate n observations from the control group, and n observations from the treatment group, and consider these seven gene sets:
  1. (1)

    all genes in the network;

     
  2. (2)

    top one-third levels of the tree;

     
  3. (3)

    first two-third levels of the tree;

     
  4. (4)

    the last level of the tree;

     
  5. (5)

    left branch of the tree (including the root);

     
  6. (6)

    right branch of the tree (excluding the root);

     
  7. (7)

    20% of the genes in the network selected randomly.

     
We consider four mean scenarios. Denoting the means for gene expression, CNV, and methylation by \(\mu _{1},\, \mu _{2},\,\) and \(\mu _{3},\) respectively, the scenarios are
  1. (1)

    \(\mu _{1}^\mathrm{T}=\mu _{1}^\mathrm{C}=0,\, \mu _{2}^\mathrm{T}=\mu _{2}^\mathrm{C}=0,\, \mu _{3}^\mathrm{T}=\mu _{3}^\mathrm{C}=0;\)

     
  2. (2)

    \(\mu _{1}^\mathrm{T}=0.25,\, \mu _{2}^\mathrm{T}=1,\, \mu _{3}^\mathrm{T}=0.5\) for top one-third levels, same as scenario 1 for the rest of the tree;

     
  3. (3)

    \(\mu _{1}^\mathrm{T}=0.25,\, \mu _{2}^\mathrm{T}=1,\, \mu _{3}^\mathrm{T}=0.5\) for top two-third levels, same as scenario 1 for the rest of the tree;

     
  4. (4)

    \(\mu _{1}^\mathrm{T}=0.25,\, \mu _{2}^\mathrm{T}=1, \,\mu _{3}^\mathrm{T}=0.5\) for left branch of tree (including the root), same as scenario 1 for the rest of the tree.

     
We replicate the simulation 1000 times. For each replicate of the simulation, NetGSA is performed using all seven gene sets, in all four scenarios. The power is then calculated for all replicates, for each test in each scenario, by checking whether the B–H adjusted p value [1] is less than 0.05. Then, the power represents the proportion of simulations in which the null hypothesis is rejected. We consider gene-set scenarios where all members in a gene set are differential, some but not all members are differential, and none of the members are differential. For the first scenario, we expect the power to be close to 1. For the second scenario, we expect nonzero powers, ideally approaching 1. For the third scenario, we expect the power to equal 0. We summarized our results in Fig. 6 and Table 2 in the Appendix. The first mean scenario corresponds to \(\varLambda ^\mathrm{C} \mu ^\mathrm{C} =\varLambda ^\mathrm{T} \mu ^\mathrm{T}.\) All the methods have nominal power. In the second mean scenario, one-third levels of the tree are significant. Therefore, it is natural to expect significances for the overall expression levels of gene sets 1–3, 5, and 6, which contain the one-third levels of the tree, whereas it is natural to expect that gene set 4 is not significant. For gene set 7, it has a chance to contain differentially expressed genes. Thus, certain level of significance for gene set 7 is expected. The power patterns of EMC-NetGSA, EC-NetGSA, EM-NetGSA, and NetGSA are consistent with the expectation of all the cases. Moreover, incorporating more true regulations is beneficial to the analysis, whereas the Gene-Set approach does not show expected powers for gene sets 1 and 4–7. Similar analyses of results can be conducted for mean scenarios 3 and 4 too. Overall, in the simulation studies, the Gene-Set approach has low accuracy, with the method sometimes missing differentially expressed sets (gene set 2, scenario 3), or falsely identifying equal means as differentially expressed (gene set 4, method 3). The NetGSA methods have high power for all gene sets that are fully or partially differentially expressed. Gene set 4 in mean scenarios 1–3 is not differentially expressed, but the four NetGSA methods successfully indicate this with low power (low false-positive rates). Gene set 6 in mean scenario 4 is also not differentially expressed, and these four NetGSA methods reflect this with low power, and the powers are largely comparable. Three integrative methods (EMC-NetGSA, EC-NetGSA, EM-NetGSA) have slightly higher power values (higher rate of false positives), presumably because of the increased noise introduced by the network expansion.
Fig. 6

Simulation power by method, mean scenario, and gene set. The powers are calculated based on the B–H FDR controlling procedure [1] with a q value of 0.05. Balanced sample sizes with \(n_\mathrm{c}=n_\mathrm{t}=500\)

Furthermore, it is common for a real dataset that the sample sizes between normal and diseased groups are unbalanced, for example the TCGA dataset sample sizes as shown in Table 1. Thus, we also performed simulations for the scenarios with unbalanced sample sizes, one is the case where \(n_\mathrm{c}=50\) and \(n_\mathrm{t}=500,\) and the other is the case where \(n_\mathrm{c}=10\) and \(n_\mathrm{t}=500.\) We summarized the results in the Appendix; see Fig. 7 and Table 3 for the case of \(n_\mathrm{c}=50,\) as well as Fig. 8 and Table 4 for the case of \(n_\mathrm{c}=10.\) As expected, unbalanced sample sizes can affect the power of the test. Moreover, as sample sizes become more severely unbalanced, the power of the test for detecting gene sets containing differentially expressed genes reduces. Nevertheless, we can still observe the outperformance of incorporating the network topologies and additional omic data into the gene-set significance analyses.

5 Discussion

In this paper, we have investigated the effects of incorporating multiple omics and network topology on pathway differential analysis through simulations and the dissection of the BRAF pathway aberrancy among 11 cancer types. We found that adding additional regulatory elements such as CNVs and/or methylations to the baseline mRNA molecules can improve the statistical power of discovering tumorous aberrances. Incorporating CNVs with the baseline mRNA molecules is likely to be more beneficial than incorporating methylations. Moreover, employing regulatory topologies can dramatically improve the discoveries of tumorous aberrances comparing to gene set analysis approach.

Moreover, through integrating CNVs, methylations, gene expression, and pathway topologies, our analysis reveals similarities and differences among 11 cancer types based on disturbance of the BRAF pathway. Such analysis can be easily adapted to other important oncogenic pathways. Our analysis on the BRAF pathway serves as a pilot study, which suggests a promising system biology investigation approach on pan-caner analysis by exploring large-scale biological networks. Moreover, we used CNVs, methylations, and mRNAs as an illustration of integrative analysis. For future work, it is important to consider other types of regulatory elements, such as proteins. We expect that a comprehensive data integration of diverse regulatory elements will provide interesting and important insights on cancer biology.

Notes

Acknowledgements

We thank all the members of the Statistical and Applied Mathematical Sciences Institute (SAMSI) Data Integration: TCGA Working Group as part of the SAMSI Beyond Bioinformatics Program. We are grateful for the support of Dr. Sujit Ghosh at SAMSI. This research was partially supported by the InCHIP Faculty Affiliate Seed Grant at UConn (to YZ), Faculty Research Excellence Program Award at UConn (to YZ), the CICATS PreK Career Development Award at UConn (to YZ), and the Research Starter Grant in Informatics from PhRMA Foundation (to ZO). VB was partially supported by the following grants: NIH Grants R01 CA160736, R01CA194391, P30 CA016672, and NSF DMS 1463233 and the National Institutes of Health (NIH) Grants R01 GM59507 (to HZ), P01 CA154295 (to HZ), and P30 CA016359 (to HZ).

References

  1. 1.
    Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300MathSciNetMATHGoogle Scholar
  2. 2.
    Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Stat 1:107–129MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Fallahi-Sichani M, Moerke NJ, Niepel M, Zhang T, Gray NS, Sorger PK (2015) Systematic analysis of BRAFV600E melanomas reveals a role for JNK/c-Jun pathway in adaptive resistance to drug-induced apoptosis. Mol Syst Biol 11(3):797CrossRefGoogle Scholar
  4. 4.
    Hyman DM, Puzanov I, Subbiah V, Faris JE, Chau I, Blay JY, Wolf J, Raje NS, Diamond EL, Hollebecque A et al (2015) Vemurafenib in multiple nonmelanoma cancers with BRAF V600 mutations. N Engl J Med 373(8):726–736CrossRefGoogle Scholar
  5. 5.
    Liu L, Ruan J (2013) Network-based pathway enrichment analysis. In: 2013 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, p 218–221Google Scholar
  6. 6.
    Liu Q, Dinu I, Adewale AJ, Potter JD, Yasui Y (2007) Comparative evaluation of gene-set analysis methods. BMC Bioinform 8(1):431CrossRefGoogle Scholar
  7. 7.
    Ma J, Shojaie A, Michailidis G (2016) Network-based pathway enrichment analysis with incomplete network information. Bioinformatics 32(20):3165–3174CrossRefGoogle Scholar
  8. 8.
    Maciejewski H (2013) Gene set analysis methods: statistical models and methodological differences. Brief Bioinform. doi: 10.1093/bib/bbt002
  9. 9.
    Shojaie A, Michailidis G (2009) Analysis of gene sets based on the underlying regulatory network. J Comput Biol 16(3):407–426MathSciNetCrossRefGoogle Scholar
  10. 10.
    Shojaie A, Michailidis G (2010) Network enrichment analysis in complex experiments. Stat Appl Genet Mol Biol. doi: 10.2202/1544-6115.1483
  11. 11.
    Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102(43):15545–15550CrossRefGoogle Scholar
  12. 12.
    Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ (2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA 102(38):13544–13549CrossRefGoogle Scholar
  13. 13.
    Tomczak K, Czerwińska P, Wiznerowicz M (2015) The cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol 19(1A):A68Google Scholar
  14. 14.
    Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, Smyth GK (2010) ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics 26(17):2176–2182CrossRefGoogle Scholar

Copyright information

© International Chinese Statistical Association 2017

Authors and Affiliations

  • Yuping Zhang
    • 1
  • M. Henry Linder
    • 2
  • Ali Shojaie
    • 3
  • Zhengqing Ouyang
    • 4
  • Ronglai Shen
    • 5
  • Keith A. Baggerly
    • 6
  • Veerabhadran Baladandayuthapani
    • 7
  • Hongyu Zhao
    • 8
  1. 1.Department of Statistics, Institute for Systems Genomics, Center for Quantitative Medicine, Institute for Collaboration on Health, Intervention, and Policy, The Connecticut Institute for the Brain and Cognitive SciencesUniversity of ConnecticutStorrsUSA
  2. 2.Department of StatisticsUniversity of ConnecticutStorrsUSA
  3. 3.Department of BiostatisticsUniversity of WashingtonSeattleUSA
  4. 4.The Jackson Laboratory for Genomic MedicineFarmingtonUSA
  5. 5.Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer CenterNew YorkUSA
  6. 6.Department of Bioinformatics and Computational BiologyThe University of Texas MD Anderson Cancer CenterHoustonUSA
  7. 7.Department of BiostatisticsThe University of Texas MD Anderson Cancer CenterHoustonUSA
  8. 8.Department of Biostatistics, Yale School of Public Health, and Department of GeneticsYale School of MedicineNew HavenUSA

Personalised recommendations