Copy number variation is highly correlated with differential gene expression: a pan-cancer study
- 165 Downloads
Cancer is a heterogeneous disease with many genetic variations. Lines of evidence have shown copy number variations (CNVs) of certain genes are involved in development and progression of many cancers through the alterations of their gene expression levels on individual or several cancer types. However, it is not quite clear whether the correlation will be a general phenomenon across multiple cancer types.
In this study we applied a bioinformatics approach integrating CNV and differential gene expression mathematically across 1025 cell lines and 9159 patient samples to detect their potential relationship.
Our results showed there is a close correlation between CNV and differential gene expression and the copy number displayed a positive linear influence on gene expression for the majority of genes, indicating that genetic variation generated a direct effect on gene transcriptional level. Another independent dataset is utilized to revalidate the relationship between copy number and expression level. Further analysis show genes with general positive linear influence on gene expression are clustered in certain disease-related pathways, which suggests the involvement of CNV in pathophysiology of diseases.
This study shows the close correlation between CNV and differential gene expression revealing the qualitative relationship between genetic variation and its downstream effect, especially for oncogenes and tumor suppressor genes. It is of a critical importance to elucidate the relationship between copy number variation and gene expression for prevention, diagnosis and treatment of cancer.
KeywordsCopy number variation Differential gene expression Concordance Pan-cancer
Copy number amplified and expression level upregulated genes
Breast invasive carcinoma
Cancer cell line encyclopedia
Cosmic cell lines project
Copy number variation
Database for annotation, visualization and integrated discovery
Copy number deleted and expression level downregulated genes
Head and neck squamous cell carcinoma
Liver hepatocellular carcinoma
Non-small cell lung cancer
Ovarian serous cystadenocarcinoma
Single nucleotide polymorphism
Search tool for the retrieval of interacting genes
The cancer genome atlas
Genetic structural variation in the human genome can be present in many forms, ranging from single nucleotide polymorphisms (SNPs) to large chromosome aberrance . In the past, SNPs are regarded as the predominant form of structural variation and account for much phenotypic variation [2, 3]. However, recent studies show the widespread existence of copy number variation (CNV) in individuals, and since that these observations have been extremely appreciated and expanded [4, 5, 6]. In general, CNV is defined as an amplifying or decreasing number of DNA segments that is 1 kb or larger in the human genome [1, 4, 5], which accounts for an important part of genetic structural variation. Currently great efforts in science community have been directed to catalog and characterize somatic CNV in a comprehensive manner [7, 8], which provides key knowledge on how they impact biological function, evolution and human diseases on genomic level.
It is generally accepted that somatic CNV is highly associated with the development and progression of numerous cancers by impacting gene expression level [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. Samulin Erdem et al.  found that Neurofascin (NFASC) gene is significantly amplified and overexpressed in non-small cell lung cancer (NSCLC) patients and the novel role of NFASC is identified in the regulation of cell motility and NSCLC migration. Dong et al.  analyzed the copy number alterations and differentially transcribed genes in esophageal cancer and observed a noteworthy association between CNV and differential gene expression for FAM60A, TFDP1, CDC25B and MCM2. Subsequently, FAM60A was identified as a potential prognostic factor with a striking correlation to overall survival and clinical-pathological parameters. Lines of evidence support differential gene expression might be a vital intermediate mechanism for CNV to exert effect on the downstream phenotype.
Despite a number of studies have explored CNV and differential gene expression of several classical oncogenes or tumor suppressor genes in different cancers [12, 14, 17, 18, 20], there has been no systematic study about the relationship between CNV and differential gene expression across a broader spectrum of cancer types and cell lines. It is unclear to what extent the expression level is affected by CNV for the whole genomics. Previous observations from single gene or single cancer type may not be representative for other genes or other types of cancer. Here we aimed to systematically investigate the specific relationship between somatic CNV and differential gene expression across cell lines and different cancer types for known genes. This study may help us better understand the correlation between CNV and differential gene expression and provide new insights into the mechanism of development and progression of cancer.
The copy number and mRNA expression data of Broad-Novartis Cancer Cell Line Encyclopedia (CCLE), NCI-60 and the Cancer Genome Atlas (TCGA) were collected from the cBioportal for Cancer Genomics (http://www.cbioportal.org/). The cell lines datasets of CCLE and NCI-60 contained 966 and 59 cell lines respectively and the TCGA datasets contained 31 types of cancer of 9159 samples (see Additional file 2: Table S1). Putative copy number calls were determined by using GISTIC 2.0 , while expression levels were quantified by RSEM  from RNA-Seq data for TCGA. Another independent dataset were curated from COSMIC Cell Lines Project (CCLP, v81, http://cancer.sanger.ac.uk/cell_lines) containing 1020 cell lines. The copy number was obtained from Affymetrix SNP6.0 array data with PICNIC .
Definition of four variation tendencies
For CCLE, NCI60 and TCGA datasets, gene-wise homozygous deletion or high level amplification were regarded as copy number amplified or deleted gene (Copy number values: − 2 = homozygous deletion; − 1 = hemizygous deletion; 0 = neutral / no change; 1 = gain; 2 = high level amplification). For CCLP, copy number amplification was obtained by the following criteria: (the average genome ploidy < =2.7 AND total DNA segment copy number > =5) OR (average genome ploidy > 2.7 AND total DNA segment copy number > =9). While the criteria for copy number deletion was: (the average genome ploidy < =2.7 AND total DNA segment copy number = 0) OR (average genome ploidy > 2.7 AND total DNA segment copy number < (average genome ploidy – 2.7)). Gene expression levels were quantified by RSEM from RNA-Seq data and mRNA Z scores were computed using the tumors samples that are diploid for the corresponding gene. For each gene, Z score = (x-u)/o, where u, o represent the average expression and standard deviation of this gene across samples, respectively; x represents the specific expression of this gene in a specific sample. Differentially expressed genes (DEGs) were further filtered out as Z scores more than 2 (upregulated genes) or less than − 2 (downregulated genes). Thus, the four variation tendencies were defined as follows: amplified AND upregulated; amplified AND downregulated; deleted AND upregulated; deleted AND downregulated.
Identification of amplified and upregulated genes (AUGs) and deleted and downregulated (DDGs)
copy number amplification and expression level upregulation ratio against copy number deletion and expression level downregulation, copy number deletion and expression level downregulation ratio against copy number amplification and expression level upregulation more than 50 % on each gene across 9159 tumor samples were applied to identify AUGs and DDGs with a higher ρ (> 0.4) and a higher number (> 146.5) of copy number amplification and expression level upregulation than the median level of 30 most popular oncogenes (Additional file 2: Table S4) or a higher ρ (> 0.41) and a higher number (> 18.5) of copy number deletion and expression level downregulation than the median level of 10 tumor suppressor genes (Additional file 2: Table S5). AUGs and DDGs were sorted by the amount of copy number amplification and expression level upregulation, copy number deletion and expression level downregulation respectively and 31 representative AUGs and 29 representative DDGs matched with KEGG genes were identified from top 100 highly concordant genes.
PPI network construction and functional enrichments
Search Tool for the Retrieval of Interacting Genes (STRING, http://www.string-db.org/) was used to construct the PPI network of FYTTD1. The line thickness indicates the strength of data support from the sources of text mining and experiments with a cutoff value of medium confidence (0.4). Then the functional enrichment results of KEGG pathways and Gene Ontology (GO) biological process were applied to the PPI network with false discovery rate less than 0.05.
R (version 3.4.2) including R packages of data.table for data cleaning and management, survival for survival analysis, ggplot2 for data visualization and GraphPad Prism 7.0 were used for the statistical analysis. ρ is equal to the Pearson correlation between the rank values of those two variables to assesses how well the relationship between two variables can be described using a monotonic function and were calculated by the function of cor() in R. Differences between two groups were determined using the Welch’s t-test (significant with p < 0.05).
Differential gene expression is highly associated with CNV across multiple cancer types and cancer cell lines
Most genes’ expression changes significantly correlated with their CNVs
Correlation validation of genes with a significant correlation between CNV and differential gene expression in literature
Interestingly, our analysis shows a fraction of genes present low Pearson’s r of fitting and low ρ (Additional file 1: Figure S3D; Figure S3E), suggesting stable expression of these genes despite of CNV. To gain further insight into those genes (top 1000 genes with Pearson’s r of fitting from the lowest to the highest), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was carried out by utilizing the Database for Annotation, Visualization and Integrated Discovery (DAVID) . Results demonstrated a strong enrichment of retinol metabolism, olfactory transduction, calcium signaling pathway, neuroactive ligand-receptor interaction, etc. (Additional file 2: Table S2; Table S3). Moreover, for a total number of 16,639 shared genes from cell lines and TCGA datasets, a high level of agreement on ρ was observed (Fig. 2c). Cell lines and tumors seem to exert similar trend. Wherein, 86.42% of shared genes showed a double-positive ρ among cell lines and TCGA datasets, while genes with a double-negative ρ only take up 1.63% in shared genes.
Identification of genes with high concordance between CNV and differential gene expression
Gene names in literature were transformed into official symbols. ρ: spearman’s correlation coefficient.
Variation tendency revalidation of genes with a significant correlation between CNV and differential gene expression in literature
Gene names in literature were transformed into official symbols. A&U represents the frequency of a gene with copy number amplified and expression level upregulated across 31 cancer types in TCGA, while D&D represents the frequency of a gene with copy number deleted and expression level downregulated. ↑ means the gene with copy number amplification and expression level upregulation in tumor, AUG; ↓ means the gene with copy number deletion and expression level downregulation in tumor, DDG.
Additionally, we analyzed the distribution of AUGs and DDGs across 22 chromosomes (Additional file 2: Table S7). The maximal proportion of highly concordant genes of AUGs and DDGs (21.62%) were located in chromosome 8, followed by chromosome 1 (16.54%), which is in coherence with the previous finding published on nature that chromosome 1 possesses the largest number of genes exerting a strong relevance to numerous diseases such as cancer, Alzheimer’s disease, Parkinson’s disease . As expected, the smallest chromosome 21 acquired the fewest percentage of highly concordant genes with 0.11%.
AUGs and DDGs involved in proteolysis dysfunction in tumor
Nine highly concordant genes of 5 AUGs and 1 DDG show the strong association with clinical overall survival (OS) in corresponding cancer patients
Validation studies of correlation between CNV and differential gene expression on another independent dataset, the COSMIC cell lines project (CCLP)
In this study, we provided strong evidences to support the high correlation between CNV and differential gene expression. This finding reveals the qualitative relationship between genetic variation and its downstream effect, especially for oncogenes and tumor suppressor genes, which is of a critical importance for prevention, diagnosis and treatment of cancer. First, by integrated analysis of CNV and differential gene expression of CCLE, NCI-60 and TCGA, it revealed a positive association between copy number and expression level with high Pearson’s r of fitting, positive ρ and significant p values. Besides, not only in cell lines but in patients copy number amplification strikingly harbored a higher expressed level compared to copy number deletion. Secondly, we investigated every gene over the relationship of CNV and differential gene expression across 9139 tumor samples and 1025 cell lines respectively. Our results showed the majority of genes the copy number displayed a positive linear influence on gene expression, indicating that genetic variation generated a direct effect on gene transcriptional level. In addition, we validated 10 genes with a significant correlation between CNV and differential gene expression through literature (Table 1). A strong correlation was confirmed combining ρ or Pearson’s r of fitting for 9 genes except the weak evidence for NFASC, possibly due to the difference of analytical method.
A recent study by GTEx consortium associated genetic variants with gene expression levels across 44 human healthy tissues and gene expression levels are found to be affected by local genetic variation for most genes based on eQTL analysis . Meanwhile, it was reported that copy number and expression levels had a strong positive correlation for 99% abundantly expressed human genes by integrating predicted copy number and corrected expression level from 77,840 expression profiles . Moreover, it has been widely reported that copy number is remarkably correlated with expression of protein in literature such as FGFR1 , HER2 [37, 38], MET , FADD , EGFR . Message RNA, as intermediates between genes and functional proteins, plays a vital role in proteins production. Thus, we speculated gene expression might be correlated with protein expression as well for the high correlation and concordance between CNV and differential gene expression across cell lines and TCGA datasets and between CNV and differential protein expression of these five genes in literature (Additional file 2: Table S9, Table S10). Notably, FGFR1, known as fibroblast growth factor receptor 1, has been discovered that its copy number amplification is strikingly correlated with FGFR1 gene upregulation and FGFR1 protein upregulation in tumor samples [18, 36]. It indicated that the dysregulation of protein might attribute to original copy number aberrance through the concordant differential gene expression.
However, a fraction of genes’ expression level did have nothing to do with copy number keeping in a stable expression level over various copy number. We think these genes might be involved in the maintenance of the basal cellular function such as metabolism and signal transduction by the results of significant KEGG pathway enrichment including retinol metabolism, olfactory transduction, calcium signaling pathway, neuroactive ligand-receptor interaction, etc. (Additional file 2: Table S2 and Table S3). Otherwise, it has been well documented that 24% of the 575 housekeeping (HK) genes accounted for the metabolic proteins and 19% for RNA-interacting proteins . Thus, we focus on the whole small nucleolar RNAs and found most genes were indeed expressed very stably versus CNV (Additional file 2: Table S11 and S12). Third, our results revealed the little existence of highly inconsonant genes of copy number amplification and expression level downregulation, copy number deletion and expression level upregulation (Additional file 1: Figure S4), which indicated that the copy number amplification barely causes gene expression downregulation and the copy number deletion hardly promotes gene expression upregulation. Otherwise, among the highly concordant genes with copy number amplification and expression level upregulation, copy number deletion and expression level downregulation, the frequency of copy number amplification and expression level upregulation evidently exceeded copy number deletion and expression level downregulation (Fig. 3b) possibly as a result of selection on deletions for it is unknown of the selective pressures on amplification . We attempt to revalidate the ten highly concordant genes in literature (9 AUGs, 1 DDG; Table 2), whose results was highly consistent with the variation trend in literature.
Note that although the sample sizes of CCLE, NCI-60 and 31 cancers in TCGA were discrepant (Additional file 2: Table S1), they still showed a similar tendency of the association between CNV and differential gene expression (Fig. 1a; Additional file 1: Figure S1 and Figure. S2). Moreover, we observed a high level of agreement between cell lines and TCGA datasets which showed a consistent distribution of genes in Fig. 2a and Additional file 1: Fig. 3a including the ρ for 16,639 shared genes from (Fig. 2c) and a comparable Pearson’s r of fitting (Fig. 2b; Additional file 1: Figure S3C and Figure S3D). Our results suggested that this phenomenon was well conserved within cell lines and tissues.
In total, we identified 925 highly concordant genes including 560 AUGs and 365 DDGs. For examples, numerous studies reported that DERL1 overexpression was significantly related to cancer cell proliferation , invasion [42, 43] and poor prognosis , which might be driven by copy number amplification for DERL1 obtained the majority of copy number amplification and expression level upregulation in many cancers (Fig. 4a). Obviously, CNV-driven differentially expressed genes (DEGs) might broaden our insights into the mechanism of tumorigenesis, migration, resistance, poor prognosis, etc. for the increasing studies on CNV-driven DEGs [16, 44, 45]. In our study, a large proportion of AUGs were affiliated with metabolic pathways especially in terms of Oxidative phosphorylation and GPI-anchor biosynthesis (Fig. 5a), which suggested the gained function of metabolism-related proteins in tumors to provide more energy for cancer cells [46, 47, 48]. In contrast, DDGs were significantly related with ubiquitin mediated proteolysis and wnt signaling pathway (Fig. 5b), whose dysfunction tend to lead to tumorigenesis , metastasis , resistance , etc. With respect to wnt signaling, lost function of DDGs such as inhibitory SMAD4 and APC would definitely enhance the function of wnt signaling leading to tumorigenesis [50, 51, 52, 53, 54, 55, 56, 57], while attenuated function of ubiquitin mediated proteolysis facilitate proliferation . Wherein, we found 10 highly concordant genes with a strong relation to patient overall survival including 5 AUGs and 1 DDG (Table 3), while FYTTD1 has been hardly reported to be associated with cancer. By further integrated analysis of CNV and differential gene expression of FYTTD1 in ESCA patients, we observed that 24.73% patients showed a high level of copy number amplification with a median Z score of 4.15 which means FYTTD1 was strikingly overexpressed (Additional file 1: Figure S6). Wherein, AU patients occupied an overwhelming part among these 48 copy number amplified patients (84.44%) indicating global effect of CNV on FYTTD1 gene expression, which may be a potential driver gene or prognostic marker in ESCA. Therefore, highly concordant genes of AUGs and DDGs may provide new insights into the development and progression of cancer.
Additionally, we utilized another independent dataset (CCLP) to revalidate the relationship between CNV and differential gene expression. Although CCLP applied a different algorithm to calculate copy number variation, it also showed a positive correlation between copy number and expression level (Fig. 6a). Our results demonstrated that gene expression levels of copy number amplification substantially surpassed gene expression levels of copy number deletion (Fig. 6b). Besides, copy number amplification and expression level downregulation, copy number deletion and expression level upregulation versus copy number aberrant counts took up the smallest part (1%). Concordantly, most genes showed an overwhelming level of either copy number amplification and expression level upregulation or copy number deletion and expression level downregulation (93%, ratio > 0.9), and it was hardly existed of genes with both high level of copy number amplification and expression level upregulation, copy number deletion and expression level downregulation (Fig. 6c).
In conclusion, this study demonstrated the close correlation between CNV and differential gene expression. Moreover, this trend is consistent across cell lines and patient samples. For the majority of genes, copy number shows a positive linear influence on gene expression (copy number amplification and expression level upregulation, copy number deletion and expression level downregulation), while copy number amplification barely causes gene expression downregulation and the copy number deletion hardly promotes gene expression upregulation. Furthermore, both AUGs and DDGs are remarkably enriched in the ubiquitin-proteasome system. In addition, we identified amplification and overexpression of FYTTD1 is highly related with poor prognosis in ESCA, which may be a potential prognostic marker in ESCA. Whereas, more in-depth studies are needed to further reveal molecular mechanisms between CNV and differential gene expression. Overall, it is of a critical importance to elucidate the relationship between copy number variation and gene expression for prevention, diagnosis and treatment of cancer.
Conceived and designed the experiments: XF DX XS. Collected and analyzed the data: XS NL. Wrote the paper: XS NL JL1 JL2 RX NA XF. All authors read and approved the final manuscript.
This work was supported by the National Natural Science Foundation of China (No. 81774153) and the National Youth Top-notch Talent Support Program (W02070098). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 9.Yang L, Wang YZ, Zhu HH. Chang Y. Li LD: Chen WM, et al. PRAME Gene Copy Number Variation Is Related to Its Expression in Multiple Myeloma. DNA Cell Biol; 2017.Google Scholar
- 14.Kuzyk A, Booth S, Righolt C, Mathur S, Gartner J, Mai S. MYCN overexpression is associated with unbalanced copy number gain, altered nuclear location, and overexpression of chromosome arm 17q genes in neuroblastoma tumors and cell lines. Genes Chromosomes Cancer. 2015;54:616–28.PubMedCrossRefPubMedCentralGoogle Scholar
- 17.Budczies J, Bockmayr M, Denkert C, Klauschen F, Groschel S, Darb-Esfahani S, et al. Pan-cancer analysis of copy number changes in programmed death-ligand 1 (PD-L1, CD274) - associations with gene expression, mutational load, and survival. Genes Chromosomes Cancer. 2016;55:626–39.PubMedCrossRefPubMedCentralGoogle Scholar
- 19.Zhao N, Wilkerson MD, Shah U, Yin X, Wang A, Hayward MC, et al. Alterations of LKB1 and KRAS and risk of brain metastasis: comprehensive characterization by mutation analysis, copy number, and gene expression in non-small-cell lung carcinoma. Lung Cancer. 2014;86:255–61.PubMedPubMedCentralCrossRefGoogle Scholar
- 33.Roeten MSF, Cloos J. Jansen G. Cancer Chemother Pharmacol: Positioning of proteasome inhibitors in therapy of solid malignancies; 2017.Google Scholar
- 38.Lee MJ, Kim N, Choung HK, Choe JY, Khwarg SI, Kim JE. Increased gene copy number of HER2 and concordant protein overexpression found in a subset of eyelid sebaceous gland carcinoma indicate HER2 as a potential therapeutic target. J Cancer Res Clin Oncol. 2016;142:125–33.PubMedCrossRefPubMedCentralGoogle Scholar
- 47.Iommarini L, Ghelli A, Gasparre G, Porcelli AM. Mitochondrial metabolism and energy sensing in tumor progression. Biochim Biophys Acta. 1858;2017:582–90.Google Scholar
- 49.Puram SV, Tirosh I, Parikh AS, Patel AP, Yizhak K, Gillespie S, et al. Single-Cell Transcriptomic Analysis of Primary and Metastatic Tumor Ecosystems in Head and Neck Cancer. Cell. 2017.Google Scholar
- 53.Ahmed S, Bradshaw AD, Gera S, Dewan MZ, Xu R. The TGF-beta/Smad4 signaling pathway in pancreatic carcinogenesis and its clinical significance. J Clin Med. 2017;6:5.Google Scholar
- 58.Mofers A, Pellegrini P, Linder S, D’Arcy P. Proteasome-associated deubiquitinases and cancer. Cancer Metastasis Rev. 2017.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.