# Dissecting Pathway Disturbances Using Network Topology and Multi-platform Genomics Data

- 81 Downloads

## Abstract

Complex diseases such as cancers usually result from accumulated disturbance of pathways instead of the disruptions of one or a few major genes. As opposed to single-platform analyses, it is likely that integrating diverse molecular regulatory elements and their interactions can lead to more insights on pathway-level disturbances of biological systems and their potential consequences in disease development and progression. To explore the benefit of pathway-based analysis, we focus on multi-platform genomics, epigenomics, and transcriptomics (-omics, for short) from 11 cancer types collected by The Cancer Genome Atlas project. Specifically, we use a well-studied oncogenic pathway, the BRAF pathway, to investigate the relevant copy number variants (CNVs), methylations, and gene expressions, and quantify their effects on discovering tumor-specific aberrations across multiple tumor lineages. We also perform simulation studies to further investigate the effects of network topology and multiple omics on dissecting pathway disturbances. Our analysis shows that adding molecular regulatory elements such as CNVs and/or methylations to the baseline mRNA molecules can improve our power of discovering tumorous aberrances. Also, incorporating CNVs with the baseline mRNA molecules can be more beneficial than incorporating methylations. Moreover, employing regulatory topologies can improve the discoveries of tumorous aberrances. Finally, our analysis reveals similarities and differences among diverse cancer types based on disturbance of the BRAF pathway.

## Keywords

Data integration Multi-platform genomics Network topology Pathway analysis## 1 Introduction

Human diseases such as cancers have very complex etiologies involving numerous molecular and environmental factors and their interactions. They are usually the consequences of accumulated disturbance of pathways. Recent advances of cancer research in melanoma indicate that the BRAF pathway has one of the most prevalent genetic changes in the development and progression of melanoma [3]. Systematic analysis of the BRAF pathway and its activation in melanomas has stimulated drug developments including inhibiting RAF/MEK signaling. For instance, a known effective drug in melanoma patients, namely vemurafenib, targets the BRAF mutation. A recent clinical trial shows that vemurafenib can be effective across multiple cancer types beyond melanoma [4]. Thus, systematically investigating the disturbances of the BRAF pathway over multiple cancer types will likely facilitate the treatment of cancers that have been less studied so far through systematic delineation of pathway aberrations in the BRAF pathway using data from multiple molecular resolutions.

Methodologically, extensive research efforts have been spent on assessing the accumulated effects of a set of genes that can be grouped together based on certain criteria, such as sharing common biological function, chromosomal location, or regulation [5, 6, 8]. These existing pathway/gene-set-based methods are complementary to single-gene analyses. For example, Subramanian et al. [11] proposed gene set enrichment analysis (GSEA) which uses a signed Kolmogorov–Smirnov statistic as the set-level statistic based on the distributions of the gene-level test statistic with and without the gene set of interest. Efron and Tibshirani [2] introduced a “maxmean” statistic to detect significant gene sets. These methods and their variants [12] employ a permutation-based procedure to obtain *p* values of gene sets. Some network-based pathway enrichment analysis methods have been proposed to leverage the information in the network structure (topology). For instance, Shojaie and Michailidis [9] proposed a gene set analysis method based on the underlying regulatory network, and extended this framework to the scenario with incomplete network information [7], as well as the analysis in complex experiments [10]. Specifically, this framework, named as network-based gene set analysis (NetGSA) framework, combines the ideas of gene-set analysis methods and network-based single-gene analysis. The existing pathway-based approaches can be potentially improved by simultaneously leveraging a priori biological network information and integrating different types of omic data.

In this paper, we investigate the disturbances of the BRAF pathway comparing to the normal tissues for multiple cancer types leveraging different types of omic data, including transcriptomic (mRNA), genomic (copy number), and epigenomic (methylation), as well as incorporating the underlying BRAF pathway topology. Through such analyses, we hope to identify the similarities and differences among multiple cancer types based on disturbance of the BRAF pathway. To investigate the effects of integrating diverse omic data and incorporating network topology in the analysis, we compare approaches with and without considering such aspects and quantify the effects. Specifically, our approach denoted by EMC-NetGSA simultaneously considers a priori biological network information and integrates the three types of omic data. Analogously, if we only integrate transcriptomic and genomic data, we name the approach as EC-NetGSA, while the approach integrating transcriptomic and epigenomic data is named as EM-NetGSA. The standard NetGSA approach [9] and GSEA [14] were performed using transcriptomic data, denoted by NetGSA and Gene-Set, respectively. We found that it is beneficial to discover tumor-specific pathway disturbances by leveraging a priori biological network information and integrating different types of omic data.

Sample size of each cancer type for tumorous and normal tissues, respectively

Cancer types | Normal sample sizes | Tumor sample sizes |
---|---|---|

BLCA | 15 | 406 |

BRCA | 64 | 759 |

COAD | 19 | 276 |

HNSC | 20 | 513 |

KIRC | 24 | 313 |

KIRP | 21 | 273 |

LIHC | 39 | 364 |

LUAD | 20 | 448 |

PRAD | 35 | 491 |

THCA | 48 | 502 |

UCEC | 22 | 172 |

## 2 Materials and Methods

### 2.1 BRAF Pathway and Multi-platform Genomics Data

We consider the cancer types where both tumorous and the corresponding normal tissues are available in The Cancer Genome Atlas (TCGA) [13] database. As for omic data, we consider methylation, copy number variant (CNV), and mRNA measurements, and only analyze cancers that have more than nine samples with methylation, CNV, and mRNA measurements for both tumorous and normal conditions. This resulted in 11 cancer types as summarized in Table 1, including bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), colon adenocarcinoma (COAD), head and neck squamous cell carcinoma (HNSC), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), prostate adenocarcinoma (PRAD), thyroid carcinoma (THCA), and uterine corpus endometrial carcinoma (UCEC).

### 2.2 Methods

#### 2.2.1 EMC-NetGSA: Integrated NetGSA for mRNAs, CNVs, and Methylations

In this approach, we adapt and extend the NetGSA framework [10] using all three types of omics data, i.e., mRNAs, CNVs, and methylations, based on the BRAF pathway interactions with the three types of regulations illustrated in Fig. 1 using KIRP. This figure shows the weighted networks for KIRP in both tumorous and normal tissues. The width of edges is proportional to the absolute value of the calculated partial correlation. The color of edges indicates the sign of partial correlation. Blue dashed line indicates the negative correlation, which reflects inhibitive regulation; while red solid line indicates the positive correlation, which reflects active regulation. Large white nodes indicate mRNAs, small black rectangles indicate CNVs, and small gray rectangles indicate methylations. One can see different regulation patterns between tumorous and normal tissues illustrated by KIRP in Fig. 1.

*n*samples are organized in a \(p \times n\) data matrix \(\varvec{D}.\) Let \(\varvec{Y}_j^{(k)}\,(j=1,\ldots , n;\,k\in \{C,\, T\}\)) be the

*j*th sample in the data matrix under condition

*k*, with the first \(n_1\) columns of \(\varvec{D}\) corresponding to condition C (normal control) and the last \(n_2 = n-n_1\) columns corresponding to condition T (cancer type of interest). The adjacency matrix of the network, denoted by \(\varvec{A},\) includes interaction relationships across all the three types of regulatory elements indicated in Fig. 1. The network topology is incorporated into the analysis through an influence matrix \(\varvec{\varLambda },\) which is calculated based on the corresponding adjacency matrix \(\varvec{A}.\) For any directed acyclic graph, \(\varvec{\varLambda } = (\varvec{I}-\varvec{A}^\mathsf {T})^{-1}\) as shown in [9]. In our analysis, we employ a weighted adjacency matrix with partial correlations as the weights. We use KIRP as an example to show the weighted networks in both tumorous and normal tissues (Fig. 1).

*k*, let \(\varvec{r}\) indicate the baseline latent variable vector of regulatory nodes [assuming \(\varvec{r}\sim N(\varvec{\mu },\,\sigma _r^2 \varvec{I}_p)],\, \rho \) indicate a weight in the adjacency matrix. The underlying genomic signals for CNVs are \(c_1 = r_{\mathrm{c}1}\) and \(c_2 = r_{\mathrm{c}2}.\) Similarly, the underlying epigenetic signals for methylations are \(m_1 = r_{\mathrm{m}1}\) and \(m_2 = r_{\mathrm{m}2}.\) The transcriptomic signals for mRNAs are \(x_1 = \rho _{\mathrm{c}1} r_{\mathrm{c}1} + \rho _{\mathrm{m}1} r_{\mathrm{m}1} + r_1\) and \(x_2 = \rho _{12} \rho _{\mathrm{c}1} r_{\mathrm{c}1} + \rho _{12} \rho _{\mathrm{m}1} r_{\mathrm{m}1} + \rho _{12} r_1 + \rho _{\mathrm{c}2} r_{\mathrm{c}2} + \rho _{\mathrm{m}2} r_{\mathrm{m}2} + r_2.\) The adjacency matrix \(\varvec{A}\) characterizes the regulatory effects of each node on its immediate neighbors, while the influence matrix represents the propagated effect of each node on all other nodes in the network. In the order of \(c_1,\, m_1,\, x_1,\, c_2\), and \(m_2,\) we have the following \(\varvec{A}\) and \(\varvec{\varLambda }{\text {:}}\)

*p*nodes. Thus, the hypothesis testing in (1) is equivalent to the following:

*T*has an approximate

*t*-distribution with the degrees of freedom df estimated using Satterthwaite approximation:

#### 2.2.2 EC-NetGSA: Integrated NetGSA for mRNAs and CNVs

#### 2.2.3 EM-NetGSA: Integrated NetGSA for mRNAs and Methylations

#### 2.2.4 NetGSA: NetGSA for mRNAs

#### 2.2.5 Gene-Set: Gene Set Tests for mRNAs

Finally, to demonstrate the effects of data integration of diverse omics and incorporating network topology in the analysis, we consider the following setting. In this scenario, we ignore network topology as shown in Fig. 1 and only focus on mRNAs. In other words, we treat the BRAF pathway as gene sets by ignoring the underlying network topology. The self-contained gene set test [14] is performed for the gene sets involved in the BRAF pathway shown in Fig. 1.

## 3 Results

*p*values, we found that the BRAF pathway differentially expressed in five tumors: BRCA, COAD, KIRC, KIRP, and LIHC comparing to the normal control (Fig. 3). The relative performances of these five approaches in these significant cancer types suggest the advantage of integrating the network topologies and additional types of regulatory elements in the analysis. Comparing EMC-NetGSA, EC-NetGSA, and EM-NetGSA with NetGSA shows that adding additional molecular regulatory elements such as CNVs and/or methylations to the baseline mRNA molecules can improve our power of discovering tumorous aberrances. Also, incorporating CNVs with the baseline mRNA molecules can be more beneficial than incorporating methylations suggested by the comparison between EC-NetGSA and EM-NetGSA. Moreover, employing regulatory topologies can improve the discoveries of tumorous aberrances as shown by the comparison between NetGSA-typed approaches (EMC-NetGSA, EC-NetGSA, EM-NetGSA, and NetGSA) with the Gene-Set approach. The deviation of each tumorous condition from the corresponding normal tissue is shown in Fig. 4, represented by the approximate

*T*statistic in EMC-NetGSA. Bar height is the calculated statistic for each cancer type. As shown in Fig. 4, our analysis suggests inhibitive BRAF pathway aberrances in KIRP, COAD, KIRC, LIHC, and BRCA with

*p*values smaller than 0.05. Such analysis shows the similarities and differences among these cancers in terms of the merits of BRAF pathway disturbance, which suggests directions of borrowing the knowledge of well-studied cancer types to less explored cancer types.

## 4 Simulations

*y*is

*n*observations from the control group, and

*n*observations from the treatment group, and consider these seven gene sets:

- (1)
all genes in the network;

- (2)
top one-third levels of the tree;

- (3)
first two-third levels of the tree;

- (4)
the last level of the tree;

- (5)
left branch of the tree (including the root);

- (6)
right branch of the tree (excluding the root);

- (7)
20% of the genes in the network selected randomly.

- (1)
\(\mu _{1}^\mathrm{T}=\mu _{1}^\mathrm{C}=0,\, \mu _{2}^\mathrm{T}=\mu _{2}^\mathrm{C}=0,\, \mu _{3}^\mathrm{T}=\mu _{3}^\mathrm{C}=0;\)

- (2)
\(\mu _{1}^\mathrm{T}=0.25,\, \mu _{2}^\mathrm{T}=1,\, \mu _{3}^\mathrm{T}=0.5\) for top one-third levels, same as scenario 1 for the rest of the tree;

- (3)
\(\mu _{1}^\mathrm{T}=0.25,\, \mu _{2}^\mathrm{T}=1,\, \mu _{3}^\mathrm{T}=0.5\) for top two-third levels, same as scenario 1 for the rest of the tree;

- (4)
\(\mu _{1}^\mathrm{T}=0.25,\, \mu _{2}^\mathrm{T}=1, \,\mu _{3}^\mathrm{T}=0.5\) for left branch of tree (including the root), same as scenario 1 for the rest of the tree.

*p*value [1] is less than 0.05. Then, the power represents the proportion of simulations in which the null hypothesis is rejected. We consider gene-set scenarios where all members in a gene set are differential, some but not all members are differential, and none of the members are differential. For the first scenario, we expect the power to be close to 1. For the second scenario, we expect nonzero powers, ideally approaching 1. For the third scenario, we expect the power to equal 0. We summarized our results in Fig. 6 and Table 2 in the Appendix. The first mean scenario corresponds to \(\varLambda ^\mathrm{C} \mu ^\mathrm{C} =\varLambda ^\mathrm{T} \mu ^\mathrm{T}.\) All the methods have nominal power. In the second mean scenario, one-third levels of the tree are significant. Therefore, it is natural to expect significances for the overall expression levels of gene sets 1–3, 5, and 6, which contain the one-third levels of the tree, whereas it is natural to expect that gene set 4 is not significant. For gene set 7, it has a chance to contain differentially expressed genes. Thus, certain level of significance for gene set 7 is expected. The power patterns of EMC-NetGSA, EC-NetGSA, EM-NetGSA, and NetGSA are consistent with the expectation of all the cases. Moreover, incorporating more true regulations is beneficial to the analysis, whereas the Gene-Set approach does not show expected powers for gene sets 1 and 4–7. Similar analyses of results can be conducted for mean scenarios 3 and 4 too. Overall, in the simulation studies, the Gene-Set approach has low accuracy, with the method sometimes missing differentially expressed sets (gene set 2, scenario 3), or falsely identifying equal means as differentially expressed (gene set 4, method 3). The NetGSA methods have high power for all gene sets that are fully or partially differentially expressed. Gene set 4 in mean scenarios 1–3 is not differentially expressed, but the four NetGSA methods successfully indicate this with low power (low false-positive rates). Gene set 6 in mean scenario 4 is also not differentially expressed, and these four NetGSA methods reflect this with low power, and the powers are largely comparable. Three integrative methods (EMC-NetGSA, EC-NetGSA, EM-NetGSA) have slightly higher power values (higher rate of false positives), presumably because of the increased noise introduced by the network expansion.

Furthermore, it is common for a real dataset that the sample sizes between normal and diseased groups are unbalanced, for example the TCGA dataset sample sizes as shown in Table 1. Thus, we also performed simulations for the scenarios with unbalanced sample sizes, one is the case where \(n_\mathrm{c}=50\) and \(n_\mathrm{t}=500,\) and the other is the case where \(n_\mathrm{c}=10\) and \(n_\mathrm{t}=500.\) We summarized the results in the Appendix; see Fig. 7 and Table 3 for the case of \(n_\mathrm{c}=50,\) as well as Fig. 8 and Table 4 for the case of \(n_\mathrm{c}=10.\) As expected, unbalanced sample sizes can affect the power of the test. Moreover, as sample sizes become more severely unbalanced, the power of the test for detecting gene sets containing differentially expressed genes reduces. Nevertheless, we can still observe the outperformance of incorporating the network topologies and additional omic data into the gene-set significance analyses.

## 5 Discussion

In this paper, we have investigated the effects of incorporating multiple omics and network topology on pathway differential analysis through simulations and the dissection of the BRAF pathway aberrancy among 11 cancer types. We found that adding additional regulatory elements such as CNVs and/or methylations to the baseline mRNA molecules can improve the statistical power of discovering tumorous aberrances. Incorporating CNVs with the baseline mRNA molecules is likely to be more beneficial than incorporating methylations. Moreover, employing regulatory topologies can dramatically improve the discoveries of tumorous aberrances comparing to gene set analysis approach.

Moreover, through integrating CNVs, methylations, gene expression, and pathway topologies, our analysis reveals similarities and differences among 11 cancer types based on disturbance of the BRAF pathway. Such analysis can be easily adapted to other important oncogenic pathways. Our analysis on the BRAF pathway serves as a pilot study, which suggests a promising system biology investigation approach on pan-caner analysis by exploring large-scale biological networks. Moreover, we used CNVs, methylations, and mRNAs as an illustration of integrative analysis. For future work, it is important to consider other types of regulatory elements, such as proteins. We expect that a comprehensive data integration of diverse regulatory elements will provide interesting and important insights on cancer biology.

## Notes

### Acknowledgements

We thank all the members of the Statistical and Applied Mathematical Sciences Institute (SAMSI) Data Integration: TCGA Working Group as part of the SAMSI Beyond Bioinformatics Program. We are grateful for the support of Dr. Sujit Ghosh at SAMSI. This research was partially supported by the InCHIP Faculty Affiliate Seed Grant at UConn (to YZ), Faculty Research Excellence Program Award at UConn (to YZ), the CICATS PreK Career Development Award at UConn (to YZ), and the Research Starter Grant in Informatics from PhRMA Foundation (to ZO). VB was partially supported by the following grants: NIH Grants R01 CA160736, R01CA194391, P30 CA016672, and NSF DMS 1463233 and the National Institutes of Health (NIH) Grants R01 GM59507 (to HZ), P01 CA154295 (to HZ), and P30 CA016359 (to HZ).

## References

- 1.Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300MathSciNetMATHGoogle Scholar
- 2.Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Stat 1:107–129MathSciNetCrossRefMATHGoogle Scholar
- 3.Fallahi-Sichani M, Moerke NJ, Niepel M, Zhang T, Gray NS, Sorger PK (2015) Systematic analysis of BRAFV600E melanomas reveals a role for JNK/c-Jun pathway in adaptive resistance to drug-induced apoptosis. Mol Syst Biol 11(3):797CrossRefGoogle Scholar
- 4.Hyman DM, Puzanov I, Subbiah V, Faris JE, Chau I, Blay JY, Wolf J, Raje NS, Diamond EL, Hollebecque A et al (2015) Vemurafenib in multiple nonmelanoma cancers with BRAF V600 mutations. N Engl J Med 373(8):726–736CrossRefGoogle Scholar
- 5.Liu L, Ruan J (2013) Network-based pathway enrichment analysis. In: 2013 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, p 218–221Google Scholar
- 6.Liu Q, Dinu I, Adewale AJ, Potter JD, Yasui Y (2007) Comparative evaluation of gene-set analysis methods. BMC Bioinform 8(1):431CrossRefGoogle Scholar
- 7.Ma J, Shojaie A, Michailidis G (2016) Network-based pathway enrichment analysis with incomplete network information. Bioinformatics 32(20):3165–3174CrossRefGoogle Scholar
- 8.Maciejewski H (2013) Gene set analysis methods: statistical models and methodological differences. Brief Bioinform. doi: 10.1093/bib/bbt002
- 9.Shojaie A, Michailidis G (2009) Analysis of gene sets based on the underlying regulatory network. J Comput Biol 16(3):407–426MathSciNetCrossRefGoogle Scholar
- 10.Shojaie A, Michailidis G (2010) Network enrichment analysis in complex experiments. Stat Appl Genet Mol Biol. doi: 10.2202/1544-6115.1483
- 11.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES et al (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102(43):15545–15550CrossRefGoogle Scholar
- 12.Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ (2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA 102(38):13544–13549CrossRefGoogle Scholar
- 13.Tomczak K, Czerwińska P, Wiznerowicz M (2015) The cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol 19(1A):A68Google Scholar
- 14.Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, Smyth GK (2010) ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics 26(17):2176–2182CrossRefGoogle Scholar