Background

Glioma is a common type of primary central nervous system (CNS) tumor which arises from glial cells [1]. Following the World Health Organization (WHO) classification in 2007, gliomas can be subdivided into grade II, grade III, and grade IV (glioblastoma multiforme, GBM), depending on the degree of aggressiveness [2, 3]. In “The Cancer Genome Atlas” (TCGA) database, grade II and III are classified as lower-grade glioma (LGG), and grade IV as GBM. Despite developments in therapies that include surgical resection, chemotherapy, and radiotherapy, the median survival and prognosis remain poor, particularly for glioblastoma patients [4, 5]. The median overall survival time (mOS) of GBM is approximately 1.25 years [5, 6], and that of LGG is 6.5–8 years [7, 8]. Thus, it is important to elucidate the survival events of glioma, which could potentially aid in the diagnosis and prognosis of glioma patients.

Patient survival time with regards to tumor progression is associated with various subtypes and grades of the tumor [2]. The histological classification of tumor subtypes is important to guide treatment decisions, which are often combined with several clinical prognostic features. In neuro-oncological practice, however, no clear national consensus for adult glioma diagnosis has been reached and the diagnosis is subject to interobserver variation [9, 10]; only utilizing histological information in studying various types of gliomas is restricted. On the other hand, previous studies have shown that gene expression profiling provides an objective method to classify tumors [11, 12]; it is better to correlate gene expression profiling, rather than tumor histology, with prognosis [13]. Moreover, it may even be utilized to predict patients’ prognosis from various points of view [14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]. Comparing these gene lists published from 2004 to 2016, it is observed that the genes identified from various research groups are quite different. This observation indicates that glioma patients’ overall survival (OS) is correlated with many kinds of events caused by various expression profiles of multiple genes. Therefore, extraction of comprehensive survival-related genes associated with gliomas is required, and it is possible for researchers to carry out further relevant studies. In addition, most previous studies [14, 18,19,20,21,22,23, 25,26,27, 29] have only utilized microarray datasets, rather than different kinds of datasets such as next-generation sequencing (NGS) data, to screen expression profiles of genes might have unexpected data bias; generally, utilizing NGS to detect gene signatures might be more precise than array data.

In this study, we aimed to identify common genes correlated with the overall survival of gliomas following the association of their expression profiles and patients’ survival time. Candidate genes were extracted from GBM and LGG study cohorts after analysis of NGS datasets from TCGA and validation by microarray datasets from Gene Expression Omnibus (GEO). Of these survival-related genes, the critical ones, which were potential biomarkers, were further analyzed and filtered, and then used to construct the survival-relevant risk models for clinical application against gliomas.

Methods

Patients and gene expression datasets

Publicly available gene expression datasets of patients with glioma were obtained from TCGA (https://cancergenome.nih.gov/) and the GEO (https://www.ncbi.nlm.nih.gov/geo/). From TCGA projects (TCGA-LGG and TCGA-GBM), level 3 RNA-Seq datasets and their clinical information were used to investigate the relationship between gene expression and patient survival. The microarray datasets (GSE16011, GSE4412, and GSE4271) from the GEO were utilized to confirm and validate the results obtained based on the TCGA datasets. Notably, grades II and III of gliomas were included in the TCGA-LGG project, whereas grade IV was studied in a separate project, namely TCGA-GBM. Following the project definition, patients from GEO datasets could be divided into LGG and GBM categories. The sample sizes of various datasets are summarized in Table 1, and the detailed clinical and histological characteristics of patients are listed in Table 2.

Table 1 Statistics of datasets from TCGA and GEO databases
Table 2 Clinical and histological characteristics of patients with glioma

In this study, NGS datasets were used for the main analysis because of their advantages of low data bias and large sample size. The median OS of patients with gliomas based on their various histological subtypes was estimated using the Kaplan-Meier curve (Fig. 1). The median OS of patients with LGG and those with GBM were determined as approximately 2700 and 450 days, respectively.

Fig. 1
figure 1

OS curve of various histological subtypes of gliomas (TCGA samples). LGG was divided into three subtypes: astrocytoma, oligoastrocytoma, and oligodendroglioma. X-axis: patients’ OS duration (days); Y-axis: patients’ survival rate

The level-3 data (RNA-Seq) obtained from TCGA utilized the fragments per kilobase of transcript per million mapped reads (FPKM) [30] to determine the expression level of genes. The formula for FPKM is as follows:

$$ \mathrm{FPKM}=\frac{total\ fragments}{mapped\ reads\ (millions)\ast exon\ length\ \left( kilobase\ pair\right)} $$

After exclusion of genes that were not expressed in all patients, 19,924 genes were eligible for further analysis of the LGG and GBM cohorts from TCGA projects. Gene expression analysis of microarrays belonging to the GBM and LGG populations from the GEO database were first normalized using the R function normalize.quantiles [31].

Analysis workflow of this study

This study was divided into two major parts. In the first part, survival-related genes were identified and their effectiveness in relation to the survival of patients with GBM and LGG was evaluated. In the second part, a representative subset of these genes that could help in differentiating between high- and low-risk patients was identified (Fig. 2). The first part focused on profiling gene signatures that corresponded with the patients’ OS; these genes were termed survival-related genes. The performance evaluation of survival predictors for GBM and LGG indicated that these genes were closely correlated with patient outcomes (survival time). Subsequently, gene set enrichment analysis using various tools was performed on these genes (not shown in the Fig. 2). In the second part, we used additional microarray datasets to further filter genes whose expression trends between the LGG and GBM groups were consistent with the results of the analysis of the NGS datasets and may have been applicable for classification of patients into risk groups (high-risk patients have shorter OS; low-risk patients have longer OS). The size of the gene set was gradually scaled down and is annotated in Fig. 2 after several filtering steps using various criteria.

Fig. 2
figure 2

System workflow. The left-hand-side figure was to identify the shared survival-related genes from LGG and GBM. The gene set was scaled down against a series analysis method. Then, the importance of candidate genes was proved through the performance estimation of survival predictors. The right-hand-side figure shows the extraction of survival-relevant biomarker representatives from these genes, which could be used in clinical practice

Identification of significant survival-related genes

The Cox proportional hazards regression model (“Cox model” hereafter; survival analysis) was used to identify possible factors that might be associated with patients’ OS duration. In this study, univariate Cox regression analysis [32] was performed to assess the expression profiles of genes that might be significantly correlated with the survival time of patients with GBM or LGG. Subsequently, these putative survival-related genes were ranked and filtered by applying stringent criteria (hazard ratio [HR] > 1; Wald test, p < 0.01). Each extracted gene was consequently analyzed to evaluate the correlation of its expression level with various survival durations in patients. Here, the median OS (in days) of GBM and LGG groups would be set as an important time point for both groups, respectively, to separate patients into shorter or longer survival durations, to recognize that the expression levels of genes differed significantly between the survival durations (number of days to death less or more than the median OS). The Student’s t-test (p < 0.05) was conducted to select statistically significant candidate genes.

Building survival predictive models for patients with GBM and LGG

The predictive model for survival analysis in this study was built using randomForestSRC [33,34,35], a nonparametric machine learning method. Moreover, because it can combine the results of many survival trees, this model is arguably more objective than other methods. Accordingly, the expression profiles of candidate genes related to survival durations were used to construct survival predictors for GBM and LGG. To assess the performance of the predictors, 1000 repetitions of five-fold cross-validation were performed, 80% of the samples were employed as the training dataset to train the model, and the remaining 20% served as the validation dataset. Receiver operating characteristic curves (ROC) obtained from the 1000 iterations were evaluated using a boxplot with their area under curve (AUC) values. The performance could be used to realize the importance of these candidate genes to GBM and LGG.

Gene set enrichment analysis

Ingenuity pathway analysis (IPA) software (Qiagen), GeneAnalytics [36], and DAVID [37, 38] were used to analyze the biological roles and molecular functions of candidate genes identified from patients with glioma. Survival-related genes common to both LGG and GBM could be useful in realizing shared functions; the pathways in both study cohorts were related to patient survival.

In this study, multiple gene set enrichment analysis tools were applied to increase the consistency and accuracy of the results. The functions and pathways that the gene set was involved were identified using at least two kinds of tools.

Gene expression level analysis between GBM and LGG

The survival-related genes with varying expression levels in case of relative high-risk (GBM) and relative low-risk (LGG) of gliomas would be further analyzed and could be used as putative biomarkers. A previous study demonstrated that the following five endogenous control genes were not differentially expressed between the glioma and normal brain: TBP, IPO8, GAPDH, RPL13A, and SDHA [39]. Therefore, the log2-fold changes in the expression of these survival-related genes relative to those of the control genes were calculated; there were p × q unique features (the signatures of genes were higher or lower than those of the control genes) for each patient, when p survival-related genes and q control genes were present. For each feature, the percentages of patients with high and low expression were calculated and screened. If a gene expression was both high (or low) in over 50% of patients with GBM and low (or high) in over 50% of those with LGG, compared with the expression of the control genes, the log2-fold change value was used as a feature in this study. Subsequently, features with different expression levels between GBM and LGG were retained as the candidates of risk descriptors. For instance, the survival-related gene TIMP1- which had high expression in 98% of patients with GBM but low expression in 60% of patients with LGG compared with the reference gene TBP - was retained. In addition to RNA-Seq datasets (TCGA), three distinct microarray datasets (GEO) were utilized to validate the consistency of various gene signatures in both classes of patients, in order to increase the data strength.

Survival risk relevant genes identification

The median OS days of GBM and LGG were notably different, implying that patients with GBM have a shorter survival time (relative to high-risk) and those with LGG have longer survival time (relative to low-risk). Under this assumption, survival-related genes were first filtered (using the method mentioned in the previous section) as possible descriptors to classify patients into risk groups. However, the effectiveness of these genes needed to be determined for further analysis using a statistical model. A logistic regression model (Y = X1 × β1 + X2 × β2 +  …  + Xn × βn + k) was applied to evaluate the importance of these features, namely the survival-related genes versus the control genes. Here, Y is the estimated value of glioma prognosis risk (GBM defined as 1, LGG defined as 0), X represents the value of the log2-fold change of each feature, β is the unknown coefficient, and k is the unknown constant. The Akaike information criterion (AIC) was utilized to evaluate the relative quality of all models, which were constructed with various combinations of features. While repeating the process (backward elimination) to construct the logistic regression model, features with low predictive value for glioma prognosis were excluded each time until the number of features that provided the smallest AIC values was reached. Consequently, these features would be capable of recognizing the survival risk of patients with GBM or LGG; thus, the expression level of those genes relative to that of the control genes could be correlated to patients’ survival.

Differentiation of patients into different risk groups

After the candidate features (from previous section: Survival risk relevant genes identification) had been identified, they could be used directly to create risk models for GBM and LGG. Logistic regression was applied to construct both risk models. Here, the outcome variable Y was the estimated GBM or LGG prognosis risk (patients can’t live over mOS are relative high risk and can live longer than mOS are low risk); for the GBM risk model, survival durations shorter than 450 days were defined as 1 and those longer than 450 days are defined as 0. Similarly, for the LGG risk model, survival durations shorter than 2700 days were defined as 1 and those longer than survival durations longer than 2700 days were defined as 0. The variable X would be substituted into the log2-fold change value of the candidate features. The other variables, such as β and k were then estimated with the R package generalized linear models (glm) function for GBM and LGG risk model, respectively. In addition, 1000 repetitions of the five-fold cross-validation were run to evaluate the GBM and LGG models, which were used to classify patients into different risk groups.

Results

GBM and LGG shared key survival-related genes

In this study, using gene expression profiling, we identified 104 genes that were significantly correlated with OS in patients with GBM and those with LGG. After application of the stringent criteria to filter the putative survival-related genes using the Cox model, the expression signatures of 582 and 5461 genes were identified and correlated to OS in case of GBM (n = 152) and LGG (n = 511), respectively. Subsequently, 266 genes were obtained through the gene lists from both study cohorts. However, only 104 of these genes were also significantly differentially expressed (t-test, p < 0.05) before and after the median OS time in the GBM and LGG study cohorts; these 104 survival-relevant genes are listed in Additional file 1: Table S1.

Effectiveness estimation of 104 genes for GBM and LGG survival

In order to estimate the effectiveness of the 104 shared survival-related genes, two survival prediction models were constructed with 1000 iterations of five-fold cross-validation for the GBM and LGG cohorts. The area under the curve (AUC) value distribution under the 1000-time simulation was illustrated using the boxplot and summarized (Fig. 3; Table 3). The mean AUC values of the GBM and LGG models were estimated approximately from 0.7 to 0.8 and the standard deviations were from 0.05 to 0.09. Therefore, it was seen that the 104 genes affected the survival durations of patients with GBM and LGG to a certain extent.

Fig. 3
figure 3

Performance estimation of survival prediction models using the 104-gene group (TCGA samples). The X-axis represents two survival prediction models that were constructed with the 104 survival-related genes; one model was constructed for GBM and the other, for LGG. The Y-axis represents the distribution of AUC values after 1000 repetitions of 5-fold cross-validation

Table 3 Capability estimation of 104 key genes for GBM and LGG survival prediction

Pathway involvement and function category of survival-related genes

The 104 genes identified were common regulators related to the survival of GBM and LGG and were further analyzed for their involvement in pathways and possible biological roles. The results overlapped in at least two of the three tools; eight pathways were identified as core pathways (Table 4). Half of these pathways were signal transduction pathways correlated with cell survival, death, and growth. The molecular and cellular functions of the 104-gene group could be characterized using 23 biological functions (Table 5). In addition, IPA analysis revealed that these genes were able to be correlated to several mechanism disorders such as those related to immunity, inflammation, tissue connectivity, cellular movement, immune cell trafficking, cell death and survival, and cell-to-cell signaling and interaction.

Table 4 Pathways summarized from the enrichment analysis of the 104 survival-related genes
Table 5 Molecular and cellular functions summarized from the enrichment analysis of the 104 survival-related genes

The candidate patients’ severity-relevant features

Unsupervised clustering with the 104 survival-related genes in all glioma patients (GBM and LGG) revealed that the expression levels of these genes would be higher in most of GBM cases and lower in LGG cases. This property could be applied as an indicator to distinguish patients’ risk (Fig. 4). Expression level analysis of the 104 shared genes relative to the expression of the 5 control genes were conducted among all patients with LGG (n = 516) and GBM (n = 154) from TCGA. Eighty-six of these genes, however, were screened between the different study cohorts for their signatures; subsequently, the other genes would be skipped here because they could not be validated with different datasets from GEO. For each feature, the selection criteria applicable state that more than 50 % of patients with GBM and LGG must have a different expression tendency relative to that of the control genes. Consequently, 19 features (with 16 genes involved) that met these criteria were filtered and then validated using various microarray datasets (Table 6). Obviously, two control genes, GAPDH and RPL13A, were filtered out in this study, because the expression levels of survival-related genes relative to both these control genes did not have clear differences in case of GBM and LGG. Additionally, 16 genes involved in these features had a higher expression in GBM than in LGG.

Fig. 4
figure 4

Heatmap view of the unsupervised clustering of 670 patients with glioma with expression profiles of the 104-gene group (TCGA samples). In the heatmap, the Y-axis represents the 104 genes and the X-axis represents patients with glioma. The expression levels from low to high are represented as a color gradient from green to red, respectively. There are three color bars of the heatmap utilizes different colors to represent IDH status (wild type and mutation), risk group (high/low), and patients with LGG and GBM

Table 6 Candidate features have different expression level between GBM and LGG

Effectiveness features for evaluating risks of patients with glioma

Based on the assumption that patients with GBM (n = 154) have a higher risk (short median OS) than those with LGG (n = 516) (longer median OS), the construction of a logistic regression model with various combinations of features was repeated. The ten smallest features, namely, CTSZ/IPO8, EFEMP2/IPO8, ITGA5/IPO8, KDELR2/SDHA, MDK/IPO8, MICALL2/TBP, MAP 2 K3/TBP, PLAUR/TBP, SERPINE1/TBP, and SOCS3/IPO8 were utilized to construct the risk model with the lowest AIC value which was 239.51. Therefore, utilizing the signatures of ten genes relative to the three control genes would have the capability to evaluate patient risks.

Patients’ risk distinguishable with ten gene signatures

After screening the importance of features with the logistic regression model, ten of these features would be used to construct the risk models for GBM and LGG. In the GBM study cohort, when the probability that patients belong to the high-risk group was less than 0.35, they would be clustered into the relatively low-risk group, whereas the LGG risk model attempted to identify relatively high-risk patients when this probability was larger than 0.9. Therefore, GBM patients (n = 154) could be divided into the high-risk group (n = 135) and low-risk group (n = 19); LGG patients (n = 516) could also be divided into the low-risk group (n = 364) and high-risk group (n = 152). The risk groups shown in the Kaplan-Meier curve (Fig. 5 and Fig. 6) were significantly different (Log rank test, p < 0.01). Moreover, the models were evaluated by repeating the 5-fold cross-validation 1000 times; the average AUC value of ROC in GBM was 0.986, and that in LGG was 0.982. In addition to testing effectiveness of candidate genes which were used to construct risk models against TCGA datasets (RNA-Seq), microarray dataset GSE16011 which include large cases (GBM, n = 159; LGG, n = 109) and with various grades (G2, G3, and G4) was used to validate it. However, datasets from different platforms, the distribution of overall gene expression level would be varies. Thus, building different risk models for GBM and LGG for various platforms is required. Consequently, the Kaplan-Meier curve showed that GBM and LGG from GSE16011 can be well distinguished into different risk groups (log rank test, p < 0.01) using the candidate features (Fig. 7).

Fig. 5
figure 5

Patients with GBM were divided into high- and low-risk groups identified based on ten genes (TCGA samples). X-axis: patients’ OS duration (days); Y-axis: patients’ survival rate. Log rank test between high-risk (n = 135) and low-risk (n = 19) groups was significant difference (p < 0.01)

Fig. 6
figure 6

Patients with LGG were divided into high- and low-risk groups identified based on ten genes (TCGA samples). X-axis: patients’ OS duration (days); Y-axis: patients’ survival rate. Log rank test between high-risk (n = 121) and low-risk (n = 395) groups was significant difference (p < 0.01)

Fig. 7
figure 7

Patients with GBM and LGG were divided into high- and low-risk groups identified based on ten genes (GEO samples). X-axis: patients’ OS duration (days); Y-axis: patients’ survival rate. a GBM, log rank test between high-risk (n = 142) and low-risk (n = 17) groups was significant difference (p < 0.01). b LGG, log rank test between high-risk (n = 25) and low-risk (n = 84) groups was significant difference (p < 0.01)

Discussion

In this study, we identified 104 common survival-related genes from patients with gliomas. The effectiveness of these genes was evaluated by constructing prediction models, and the AUC values were estimated to be approximately 0.7 and 0.8 for the GBM and LGG models, respectively, after 1000 iterations of 5-fold cross-validation. The heatmap (Fig. 4) has shown that expression profiles of these genes are associated with the IDH1 and risk status among patients with gliomas; most of patients with GBM are wild-type IDH1 and have short survival time (high risk), but patients with LGG are mutant-type IDH1 and survive long (low risk). Most of these genes were involved in cell-related signaling pathways that affect cellular proliferation, apoptosis, and angiogenesis. Moreover, of the 104 survival-related genes, 10 could potentially distinguish patients with GBM or LGG into high- and low-risk groups. The expression levels of these ten genes were higher and the survival duration was shorter in patients with high-grade glioma than in those with lower grade glioma.

Identification of survival-related genes in gliomas has been ongoing over the past decade. However, the gene lists identified by our study and the other various research groups [14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] differ considerably; only 13 common genes (BMP2, CLIC1, EST, IGFBP2, LDHA, LGALS1, MET, MSN, TGALN2, TIMP1, TNC, UPP1, ZYX) could be identified in at least three studies. These differences may be attributed to two major factors. First, researchers have analyzed glioma datasets from various perspectives; for instance, some studies have discussed some aspects only in patients with high-grade gliomas [14, 20, 23,24,25,26,27, 29], or LGG [21, 28], or at specific checkpoints such as the mitotic spindle checkpoint [19] or ion channel [27]. Second, studies have analyzed different types of datasets obtained from various high-throughput platforms such as microarray or next-generation data. Because of technical limitations, the expression profiles of the same genes detected from different datasets may be inconsistent. For instance, the sequencing data have higher stochastic variability than array data, which would result in a lack of reads in short or low abundance genes [40]. On the other hand, microarray data of gene expression can be affected by probes’ cross-hybridization, nonspecific hybridization, redundancy, and annotation [41]. Rather than NGS, microarray analysis has been selected as the initial screening method in most relevant studies. Recently, various research groups have started using NGS data (e.g., RNA-Seq) as the main analysis platform and microarray data as an adjuvant platform to verify results.

Accurate survival prediction through comprehensive indicators is vital for patients with glioma. However, the 104-gene group identified in this study would include parts of those indicators and was also crucial for survival prediction in both types of gliomas; its effectiveness in analyzing the GBM and LGG cohorts demonstrated that there other specific survival-related genes might exist. However, these 104 genes were the basic factors for patients’ survival in case of GBM and LGG, because the average AUC value under multiple times of simulation could reach 0.7–0.8. These genes could be used together with self-specific genes of each type of glioma, to elaborate the regulation networks in various mechanisms. In addition, recent studies have identified crucial glioma imaging features from magnetic resonance imaging (MRI) and have correlated them with patient survival [42, 43]. The association of imaging and genomic features could be realized and applied in the field of radiogenomics.

The expression level analysis of survival-related genes could have implications; high signatures of genes in patients would be indicative of shorter survival durations in contrast to low signatures of genes, where patients have a longer survival time. Moreover, most of these genes were highly expressed in GBM and the converse is true in case of LGG. However, it is difficult to set the cutoff values to indicate whether gene expression was high or low, because of individual differences. Therefore, the reference genes (called control genes in this study) would be the target of comparison for survival-relevant genes. Furthermore, in order to have objective indicators consequently, different datasets were used to validate the results from NGS and the minor effectiveness of genes, which was decided by the logistic regression model, was removed.

Survival-relevant risk models were constructed for GBM and LGG; the evaluation related to model performance was larger than 0.95 (average AUC value), which means it could successfully classify patients into different risk groups. In GBM, the survival durations of the low-risk group were better than the high-risk group, and its median OS was larger than 450 days (1.2 years). On the other hand, in LGG, the survival durations of the high-risk group were worse than that in the low-risk group and the median OS time was shorter than 2700 days (7.4 years). Recent studies [44,45,46,47] have demonstrated that gliomas could be divided into multiple subtypes based on various molecular features such as IDH1 mutation/wild type and chromosome 1p/19q noncodeletion/codeletion. Generally, IDH1 mutation and 1p/19q codeletion are favorable prognostic factors for patients with gliomas. Patients with GBM (n = 154) from TCGA, in addition to the following status: chromosome 1p/19q was not available (NA), all showed noncodeletion; only 10 patients showed IDH1 mutation. In the low-risk group (n = 19) of GBM, as identified using the logistic regression model, 8 patients had the IDH1 mutant. This result indicated that 80% of patients with IDH1 mutation would be clustered into the relative low-risk group of GBM against the risk prediction model. Unlike GBM, IDH1 was mutated in most patients (n = 419) and fewer patients (n = 94) had the wild-type IDH1 in the LGG (n = 516) datasets from TCGA. In addition, with regards to the status observation of chromosome 1p/19q for patients with LGG, parts (n = 347) of them were noncodeletion and the others (n = 169) were codeletion. According to the relative high-risk group (n = 121) of LGG clustered by our constructed model, most of patients had wild-type IDH1 (n = 73) and 1p/19q-noncodeletion (n = 108), accounting for 60.3 and 89.3% of the cases, respectively (Table 7). Therefore, the molecular features of the high- and low-risk groups identified using the signatures of these ten genes were in accordance with the results of previous studies. However, the risk model could not directly classify patients with wild-type IDH1 into different risk groups well. For instance, 22.3% (n = 21) and 77.7% (n = 73) of patients with wild-type IDH1 in the LGG cohort (n = 94) were identified as low and high risk, respectively (Table 7); the survival curves of both risk groups could not be separated significantly, especially those representing less than 750 days of survival durations (Fig. 8). Therefore, further identification of other effective predictors is required to evaluate how patients with better survival can be efficiently distinguished from patients with glioma having wild-type IDH1. However, the risk model could classify patients with mutant-type IDH1 into different risk groups well in LGG cohort (n = 419). There are 47 (11.2%) and 372 (88.8%) of patients belong to high- and low-risk groups, respectively (Fig. 9). The high-risk group of patients with IDH1 mutation might be associated with DNA methylation, which has indicated in the study based on the molecular profiling analysis [48].

Table 7 IDH1 and 1p/19q status of high- and low-risk groups of patients with LGG and those with GBM
Fig. 8
figure 8

High- and low-risk groups of patients with LGG having wild-type IDH1 identified based on ten genes (TCGA samples). X-axis: patients’ OS duration (days); Y-axis: patients’ survival rate. Both risk groups could not be separated well; log rank test, p = 0.248 was not statistically significant

Fig. 9
figure 9

High- and low-risk groups of patients with LGG having mutant-type IDH1 identified based on ten genes (TCGA samples). X-axis: patients’ OS duration (days); Y-axis: patients’ survival rate. Log rank test between high-risk (n = 47) and low-risk (n = 372) groups was significant difference (p < 0.05)

The expression level of the ten aforementioned genes tended to gradually decrease from GBM to LGGs (Fig. 10). The OS duration of patients decreased upon high gene expression but increased upon low gene expression. Several of these genes have been reported in previous studies; for instance, EFEMP2 was indicated as a potent oncogene in glioma and a target for glioma treatment [49], the overexpression of SERPINE1 (PAI-1) was significantly correlated with shorter survival durations in patients with GBM [50], the expression of ITGA5 might be correlated to the regulation of cell proliferation and invasiveness in GBM, because targeting ITGA5 using miR-330-5p could affect these cell events [51], and the biological function of PLAUR could be related to glioma cell invasion and angiogenesis [52]. In addition, the ten genes could be mapped to the various hallmarks of cancer with literature survey (Table 8). Therefore, this 10-gene group might have potential prognostic value for patients with glioma.

Fig. 10
figure 10

Expression levels of ten genes decreased from GBM to LGGs (TCGA samples). All patients with glioma (n = 247) were dead (121 patients with GBM and 126 patients with LGG). The X-axis represents the patients’ OS duration and the Y-axis represents their gene expression levels (FPKM). The X-axis labels from left to right are: “Before median overall survival (OS) of GBM,” “After median OS of GBM,” “Before median OS of LGG,” and “After median OS of LGG”

Table 8 The cancer hallmarks mapping of the ten genes

Conclusions

In summary, the 104 genes identified, which are common between patients with GBM and those with LGG, can be used as core genes related to patient survival. Of these, 10 genes (CTSZ, EFEMP2, ITGA5, KDELR2, MDK, MICALL2, MAP 2 K3, PLAUR, SERPINE1, and SOCS3) can potentially serve as indicators to classify patients with gliomas into different risk groups and could be used to estimate the prognosis of patients with gliomas. Moreover, the expression profiles of these potential biomarkers could be correlated to the molecular subtypes of patients, such as IDH1/2 mutation/wild type and chromosome 1p/19q codeletion/noncodeletion.