Expression patterns of small numbers of transcripts from functionally-related pathways predict survival in multiple cancers
Genetic profiling of cancers for variations in copy number, structure or expression of certain genes has improved diagnosis, risk-stratification and therapeutic decision-making. However the tumor-restricted nature of these changes limits their application to certain cancer types or sub-types. Tests with broader prognostic capabilities are lacking.
Using RNAseq data from 10,227 tumors in The Cancer Genome Atlas (TCGA), we evaluated 212 protein-coding transcripts from 12 cancer-related pathways. We employed t-distributed stochastic neighbor embedding (t-SNE) to identify expression pattern difference among each pathway’s transcripts. We have previously used t-SNE to show that survival in some cancers correlates with expression patterns of transcripts encoding ribosomal proteins and enzymes for cholesterol biosynthesis and fatty acid oxidation.
Using the above 212 transcripts, t-SNE-assisted transcript pattern profiling identified patient cohorts with significant survival differences in 30 of 34 different cancer types comprising 9350 tumors (91.4% of all TCGA cases). Small subsets of each pathway’s transcripts, comprising no more than 50–60 from the original group, played particularly prominent roles in determining overall t-SNE patterns. In several cases, further refinements in long-term survival could be achieved by sequential t-SNE profiling with two pathways’ transcripts, by a combination of t-SNE plus whole transcriptome profiling or by employing t-SNE on immuno-histochemically defined breast cancer subtypes. In two cancer types, individuals with Stage IV disease at presentation could be readily subdivided into groups with highly significant survival differences based on t-SNE-based tumor sub-classification.
t-SNE-assisted profiling of a small number of transcripts allows the prediction of long-term survival across multiple cancer types.
KeywordsCancer metabolism Myc Notch PI3 kinase TP53 Transcriptional profiling T-SNE Wnt
Fatty acid oxidation
Fragments per kilobase-million upper quartile
Ribosomal protein transcript
The Cancer Genome ATLAS
Triple negative breast cancer
t-distributed stochastic nearest neighbor embedding
Molecular genetic advances, particularly next-generation DNA and RNA sequencing, have identified gene copy number variations, recurrent mutations, rearrangements and transcript expression differences in many cancers. These may be associated with specific tumor subtypes, biological behaviors, therapeutic responses and outcomes not otherwise revealed by more traditional histologic or immuno-histochemical assessments [1, 2, 3, 4, 5]. However, such tests tend to focus upon and be of value for only specific cancer types or subtypes and are generally not more broadly applicable. There is thus a clear need to assess these parameters more globally and across multiple cancers with a common and preferably small set of genes. The availability of such a test could greatly simplify and expand the molecular evaluation of cancers, further improve prognostication and therapeutic stratification and aid in decisions regarding the frequency and intensity of post-therapy follow-up.
Using the machine learning algorithm “t-distributed stochastic neighbor embedding” (t-SNE)  we have previously demonstrated that the expression patterns of ribosomal protein transcripts (RPTs) differ among normal tissues and cancers in distinct and reproducible ways that are largely independent of absolute expression levels [7, 8]. Most cancers contained two-five distinct RPT t-SNE expression patterns (“clusters”) and in seven cancer types these correlated with long-term survival . More recently, we made similar observations with transcripts encoding enzymes involved in cholesterol biosynthesis and fatty acid oxidation (FAO) .
Ribosomal biogenesis, cholesterol biosynthesis and FAO are only three of numerous growth- and metabolism-related pathways that are de-regulated in cancer [8, 9, 10, 11]. To investigate whether other transcripts are similarly informative of long-term survival, we used t-SNE-assisted clustering to ascertain the expression patterns of 212 protein-coding transcripts from twelve additional cancer-related pathways. We found these patterns to be predictive of survival for 30 of the 34 distinct cancers, comprising 91.4% of the 10,227 tumors from The Cancer Genome Atlas’ (TCGA) PanCancer Atlas collection .
Selection of pathways and RNAseq data
Transcripts for 8 of the 12 cancer-related pathways listed in Additional file 1: Table S1 were obtained from Sanchez et al. . Additional transcripts encoding enzymes of the Purine and Pyrimidine Biosynthetic Pathways and Pentose Phosphate Pathway were selected because of their established roles in providing critical anabolic precursors for nucleic acid synthesis [12, 13, 14]. TCA Cycle transcripts were selected because oxidative phosphorylation is often altered in cancer cells as they reprogram glucose, fatty acid and glutamine metabolism . RNAseq expression data (FPKM-UQ) were taken from the TCGA GDC PANCAN dataset and accessed through the UCSC Xenabrowser . These represent RNAseq results from 10,227 untreated primary tumors. The only exception to this was uveal melanomas where all tumors were metastatic (SI Appendix Table S2). Expression values were initially stored as the base-two logarithm of the incremented-by-one FPKM-UQ value. The inverse of this transformation was applied to the values to obtain the true FPKM-UQ values.
Depiction of cancer pathway transcript patterns
Prior to t-SNE visualization, RNA expression data were centered and normalized for each pathway . Briefly, every primary tumor sample was assigned an “expression vector” in n-dimensional space for each pathway, where n was equal to the number of genes in the pathway and each element of each vector was equal to the FPKM-UQ expression value of a particular gene. For each pathway-cancer type combination, the associated expression vectors were centered by subtracting from each one the mean value of all the vectors. The centered vectors were then normalized by their magnitudes. The result was that all centered expression vectors were projected onto a hyper-sphere in n-dimensional space. For each cancer type and pathway, the vectors on this hypersphere were the input to t-SNE. t-SNE analyses of each pathway’s transcript patterns were performed using Tensorboard Release 1.12.0  in three dimensions to maximize the appreciation of the compactness and separateness of the resulting clusters. Multiple t-SNE runs were initially attempted with perplexities ranging between 5 and 30, and learning rates of 0.1, 1, 10, or 100. For each cancer-pathway combination, parameters that produced obviously distinguishable clusters were selected for further validation by multiple runs. Cancer-pathway combinations for which no set of parameters could be found that produced obviously distinguishable clusters were rejected for further analysis. We heuristically defined an obvious cluster as a densely distributed collection of points in the embedding space separated from other such dense regions by clearly discernable regions containing no embedded points. For the final selected parameters, t-SNE was run for at least 2500 iterations and until the embedding stabilized. After embedding, the number of clusters was recorded. Members of the clusters were specified using Gaussian mixture models implemented through MATLAB’s “fitgmdist” and ‘cluster’ functions. Though a Gaussian mixture models were used to assign samples to clusters we refer to the clusters as “t-SNE clusters”. The default “K-means++” algorithm was used to set initial conditions in all cases. In some cases, the output t-SNE data were randomly perturbed by 5% of the radius of the smallest sphere that contained all the output points before clustering. The number of Gaussian components used was equal to the number of clusters previously identified. For each t-SNE profile, every combination of full or diagonal covariance matrices, shared or unshared covariance and the application or non-application of the aforementioned perturbation were iteratively tried when fitting the Gaussian mixture model, for a total of eight attempts with different parameter settings. The output that best preserved the unity of the obviously distinguishable clusters in the t-SNE were chosen for display in all figures and for further analysis. Finally, the aforementioned perturbation was applied to the actual output t-SNE scatterplot displayed in the figures in cases where clusters were so dense as to prevent its individual component members from being readily visualized. The parameters used for each tSNE and Gaussian mixture clustering are listed in Additional file 1: Table S3. Parametric t-SNE  was used to confirm the clusters found with the initial t-SNE assisted clustering, using the same perplexity as in that initial analysis, where three-quarters of the data were used for training and one quarter was withheld as a test set.
Comparing t-SNE clusters
Clinical and survival data for TCGA cancer cohorts were accessed using the UCSC Xenabrowser  under the data heading “Phenotypes”. Kaplan-Meier survival curves for each each t-SNE cluster were compared using Mantel-Haenszel (log-rank) methods through the “Matsurv” function on the MATLAB file exchange  and confirmed in Graphpad Prism 7.
Random forest analyses
To identify genetic features that differed the most among t-SNE clusters, a random forest classifier model [20, 21] was employed through MATLAB’s ‘TreeBagger’ function in the ‘Statistics and Machine Learning Toolbox’, with ‘NumTrees’ equal to 100, ‘OOBPredictorImportance’ turned on, ‘NumPredictorsToSample’ set to ‘all’, and ‘PredictorSelection’ set to ‘interaction-curvature’. The importance of the transcripts in distinguishing the clusters from one another were indicated by the ‘OOBPermutedPredictor’ field of the object returned by the ‘TreeBagger’ function.
Comparing t-SNE clusters with hierarchical groups
To investigate relationships between t-SNE clusters and the entire expressed protein-coding genome, four cancer types were selected for full transcriptome visualization by hierarchically clustered heat maps. RNAseq-based heat maps of the cancers of interest were downloaded from the TCGA Next-Generation Heat Map Compendium . We selected the platform “RNA Expression” and heat map type selected as “Gene/Probe vs Sample”. Tumor and t-SNE samples represented in this heat map had significant overlap. Samples were pre-divided into hierarchical groups (hereafter referred to as “Dendros” to avoid confusion with the t-SNE clusters). Individual tumors within each Dendro were then identified according to the t-SNE clusters with which they associated. Significance of survival differences between these groups within each Dendro was assessed in Graphpad Prism 7 using log-rank tests.
Transcript patterns from cancer-related pathways correlate with survival
Certain RPT transcripts disproportionately and recurrently shape t-SNE clusters [8, 9]. We therefore applied a Random Forest classifier  to identify such key transcripts in each of the above 12 pathways. These were relatively few in number, ranging from as few as one-two to as many as four-six depending both on the tumor type and the specific pathway (Additional file 1: Figures S25–36). Thus, a much smaller subset of the original 212 transcript collection, comprising no more than 50–60 members, contributed disproportionately to the t-SNE profiles of most cancers.
Additional predictive value from sequential t-SNE analysis and whole transcriptome profiling
Related findings were made in clear cell kidney cancer, where whole transcriptome profiling generated 4 dendrograms (Dendro 1–4) with Dendro 1 having particularly unfavorable survival (Additional file 1 Figure S39A&B). Unlike the more random distribution of t-SNE clusters seen in Fig. 5a, Dendro 1 group was overly populated by Pyrimidine Biosynthetic Pathway t-SNE Cluster 2 tumors (also with unfavorable outcomes-Fig. 2) whereas the Dendro 3 group with a favorable outcome contained a preponderance of t-SNE 1 cluster tumors, also with more favorable outcomes. However, both t-SNE groups could be further sub-divided into distinct survival cohorts when categorized by their respective dendro group (Additional file 1 Figure S39 C&D). Additional variations of these general themes were seen with Myc Pathway transcripts in sarcomas and TCA Cycle Pathway transcripts in Bladder Cancer (Additional file 1: Figures S40 & S41). t-SNE-based analysis is thus comparable and in some cases even superior to whole transcriptome profiling for forecasting long-term survival. As with sequential t-SNE profiling, the two methods can be used in tandem to better define tumor subgroups and long-term survival.
T-SNE compliments sub-classification and clinical staging for certain cancers
Triple-negative breast cancer (TNBC), which represents 10–20% of all tumors, is defined by the lack of immuno-histochemical staining for the estrogen and progesterone receptors and the cell surface epidermal growth factor receptor HER2. It has the most unfavorable outcome of all breast cancer subtypes due primarily to its propensity for early metastatic recurrence [26, 27]. In contrast, the Luminal A form, representing 50–60% of all cases, has the most favorable long-term survival [28, 29]. Belying the apparent simplicity of this long-standing classification scheme, however, is the fact that TNBC and Luminal A variants have each been recently sub-classified into several distinct molecular entities based on whole transcriptomic profiling [26, 27, 30, 31].
t-SNE-based profiling of breast cancers with Myc Pathway member transcripts did not initially identify groups with significantly different survival (Fig. 3). However, the analysis of Luminal A tumors but not TNBCs with this pathway’s transcripts did further enhance survival prediction (Fig. 6d&e). Taken together, these results demonstrate that, at least in the case of breast cancer, well-defined molecular subtypes could be further categorized by the subsequent interrogation with t-SNE-based transcriptional profiling.
On average, Random Forest classification had shown that approximately three Wnt Pathway transcripts were the major determinants of t-SNE cluster profiles among the 12 different cancer types, including all breast cancers, where differential survival among Clusters was observed (Fig. 3). The most prominent of these transcripts were Sfrp2, Ctnnb1 and Dkk1/3 (Feature Importance > 1, Additional file 1 Figure S26). In the case of TNBC, however, this patterning was determined exclusively by Sfrp2 (Fig. 6f). Consistent with this, Cluster 5 tumors expressed the highest levels of Sfrp2 transcripts (Fig. 6g).
t-SNE clusters generated by Myc Pathway transcripts in 11 relevant tumor types were also determined by an average of three transcripts/tumor type with the most common ones being Myc, N-Myc and Mxd2 (Additional file 1 Figure S34). The t-SNE clusters of Luminal A cancers, in contrast, were more driven by Myc and Mxd2 (Fig. 6h). Interestingly, the Cluster 1 tumors of this subset, which expressed high levels of Myc and Mxd2 were associated with the worst prognosis (Fig. 6i&j).
Similar findings were made in head and neck squamous cell cancers (SI Appendix Table S20 where t-SNE profiling with Myc Pathway transcripts had previously identified four distinct clusters with significant survival differences (Figs. 1 and 2). As with bladder cancers, the primary tumors from 247 Stage IV cancers were randomly distributed among these groups (P = 0.075, Fig. 7d). Among these tumors, however, t-SNE Cluster 4 was associated with a significantly longer median survival (2120 days) than the other clusters (combined median survival = 915 days).
Molecular tests such as MammaPrint™ and ThyroSeq™ have proven highly useful in guiding the diagnosis and prognosis of select cancers [4, 5]. In the former case, a 70 gene expression signature in Stage I and II breast cancer can accurately predict the likelihood of recurrence following surgical extirpation and thus inform the need for adjuvant chemotherapy . ThryoSeq™ utilizes a collection of ~ 140 gene copy number variations, fusions and transcript expression differences to diagnose and classify malignant thyroid nodules of indeterminate histology . Despite their utility, these tests are relevant only to their respective tumor types or, more specifically, certain stages or subtypes and lack broader applicability.
We have demonstrated here the feasibility of predicting survival in multiple cancer types based on the expression patterns of small, functionally related subsets of a 212 member transcript collection. These were drawn from 12 canonical pathways with well-established roles in cancer cell proliferation, survival and metabolism [11, 12, 13, 14, 15, 25, 32, 33]. However, unlike whole transcriptome profiling where gene expression levels correlate with survival in specific cancers (Fig. 5a, Additional file 1 Figures S39-S41) , the value of the analyses reported here lies in the expression patterns of small numbers of transcripts across multiple tumor types. Indeed, in 30 of 34 cancers, these patterns were so highly predictive of survival that transcripts from a single pathway sufficed for this purpose. Examples include the Cell Cycle Pathway (15 transcripts) in acute myelogenous leukemia, the PI3K Pathway (18 transcripts) in low-grade gliomas and any one of nine pathways, each comprised of 6–30 transcripts, in clear cell kidney cancer (Fig. 3). Indeed, of the 30 cancer types for which t-SNE-assisted profiling was useful, an average of 3.7 pathways/tumor type correlated with survival, thus providing coverage of 91.4% of all cancers archived in TCGA. Our previous t-SNE profiling with transcripts encoding ribosomal proteins and enzymes involved in cholesterol biosynthesis and FAO [7, 8, 9] was also prognostic for 17 of the listed cancers and also did not include any of the four not covered by the current 12 pathways (Fig. 3). The future addition of new pathways may eventually prove to be of prognostic value in these four cancer types. It is worth considering the possibility that the failure of this approach in testicular germ cell tumor may reflect this cancer’s extraordinarily high cure rate . For these reasons, the current numbers must be considered provisional and likely to expand. The precise fraction of cancers for which t-SNE profiling will prove useful is also likely to change somewhat given that the TCGA database is biased both for and against particular cancer types (for example, it excludes many rare cancers and most pediatric cancers).
Unsurprisingly, many of above pathways’ transcripts encode known oncoproteins and tumor suppressors such as Myc, PTEN, p53 and IDH1/2 whose mutation, expression level and/or de-regulation frequently correlate with various individual cancers and their outcomes (SI Appendix Table S1) [11, 35, 36, 37, 38, 39, 40, 41]. However, we show that an additional and more powerful prognostic aspect of these transcripts resides in the patterns they assume relative to other transcripts in their respective pathways. These patterns likely serve as surrogate reporters for the unique transcriptional and post-transcriptional environments that characterize each cancer type and that dictate its relevant behaviors in much the same way as does whole transcriptome hierarchical clustering [42, 43, 44]. Such patterns are undoubtedly determined by numerous interdependent factors including chromatin conformation; the binding and activities of promoter-proximal complexes such as RNA polymerase II and Mediator; the binding and activities of adjacent transcriptional factors; the long-range contribution of protein-bound enhancers and super-enhancers and the regulation of all these by post-translational modifications, metabolites and additional tissue-specific proteins [42, 45, 46, 47, 48]. Differences in mRNA splicing and stability further influence mature transcript expression levels in tissue- and tumor-specific ways [49, 50]. That transcript patterns may reflect a more complex control than do absolute levels is suggested by the fact that, in at least some cases, these two do not correlate (Fig. 5a). Based on presumably similar regulatory dependencies, it seems likely that t-SNE patterns will also correlate with other important tumor behaviors such as therapeutic responses and their durability.
Also to be emphasized is that the entire 212 transcript repertoire reported here is unnecessary for assessing any individual tumor type. Rather, only those pathways of previously demonstrated predictive value for a particular tumor type need be selected (Additional file 1 Figures S25–36). In the case of low-grade gliomas and clear cell renal cancer, for example, this could be as many as nine pathways or as few as a single one for colo-rectal and prostate cancers (Fig. 3).
In some cases, additional prognostic information was extracted using sequential t-SNE analysis or whole transcriptome profiling (Figs. 4 and 5 and Additional file 1 Figures S37-S41). It is in tumor types such as pancreatic ductal adenocarcinoma where particular t-SNE profiles are more evenly distributed across the entire transcriptome spectrum that the combined advantages of these two independent approaches are likely to have the greatest impact (Fig. 5).
Importantly, more traditional and clinically well-integrated ways of classifying tumor can also be complemented using the t-SNE-based profiling described here so as to allow for the identification of more or less challenging tumor subsets. This was well-illustrated for breast cancer where the TNBC and Luminal A subtypes, already long-known as having distinct outcomes [23, 24, 26, 27, 28, 29], could both be subdivided, albeit with different set of transcripts (Fig. 6). The high-risk TNBC group was particularly interesting as these patients’ t-SNE profiles and their long-term survival, were entirely driven by the expression of Sfrp2 (Fig. 6b,c & g). Sfrp2 (“secreted frizzled-related protein”) is a cell surface protein that is highly expressed by breast cancer-associated endothelial cells and correlates inversely with survival . Monoclonal antibody-mediated inhibition of Sfrp2 has been show to reduce tumor growth and prolong survival in a mouse model of TNBC . This suggests the intriguing possibility that some transcripts that are predictive of survival may not necessarily be expressed by actual tumor cells but rather by stromal elements that play critical roles in maintaining tumor growth, nutrition and oxygenation .
For both bladder and head and neck squamous cell cancer, t-SNE profiling also complemented and strengthened the well-recognized prognostic power of classical clinical staging, which is largely predicated on well-established clinico-pathologic criteria such as tumor size and location, local invasion, lymphatic involvement and distant metastatic spread, with the latter being indicative of the most advanced, i.e. Stage IV, disease . The fact that the t-SNE clusters of stage IV tumors were indistinguishable from those of their less advanced counterparts (Fig. 7b&d) argues that, rather than simply being correlated with and perhaps the result of more advanced disease, transcript expression patterns are a fundamental property of their respective tumor that likely precedes the onset of metastatic dissemination. More extensive testing involving additional cancers and transcript pathways will be required to determine how t-SNE-based analyses can best be integrated with these more traditional types of evaluation so as to establish the best clinical practice.
It is important to emphasize that, like all other clinic-pathologic, biochemical and molecular analyses, the results generated by t-SNE profiling must be interpreted cautiously and in light of other factors that are not necessarily accounted for by the analysis itself and that, either individually or together, could affect long-term outcomes. These might include such non-mutually exclusive factors as age and frailty, co-existing organ dysfunction that limits chemotherapy dosing, disparities in the quality of care or the inability to initiate or continue treatment.
Collectively, our results demonstrate that the expression patterns rather than the absolute levels of small, functionally related sets of transcripts can be used to achieve highly accurate projections of long-term survival in the vast majority of cancer patients. In most instances, several different pathways can be selected for the analysis of any particular tumor type. Together, the pathways can be used to predict survival in at least three and as many as 14 different tumor types for which the approach is applicable. Additional versatility is demonstrated by the fact that tandem t-SNE profiling or t-SNE profiling in conjunction with whole transcriptome analysis affords even greater refinement of survival prediction. This remains true when t-SNE profiling is combined with more traditional forms of tumor assessment such as immuno-histochemical staining and clinic-pathologic staging. Future efforts should continue to focus on and improve the benefits offered by such combinatorial analyses. While the prognostic advantages of these sequential approaches may initially be somewhat limited in their statistical power by relatively small patient numbers, this is likely to diminish with the accrual of additional data.
JAM performed analyses and developed software. HW performed analyses. DPN provided statistical analysis and consultation. WC, QY, PCL and PVB performed analyses and interpretation. EVP conceived the study and supervised research. EVP and JAM wrote the manuscript. All authors corrected, read and approved the final manuscript.
Supported by NIH grant CA174713 and by a Hyundai Hope on Wheels Scholar Grant (both to E.V.P.) and NIH Award P30CA047904 to The University of Pittsburgh Hillman Cancer Center. The funding bodies were neither involved in the design of the study, nor in the collection, analysis, or interpretation of data nor in the writing of the manuscript.
Ethics approval and consent to participate
No ethics approval was required for this work. All utilized public omics data sets were generated by others who obtained ethical approval.
Consent for publication
The authors declare that they have no competing interests.
- 6.van der Maaten LJP, Hinton GE. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–605.Google Scholar
- 9.Wang H, Lu J, Dolezal JM, Kulkarni S, Chen A, Gorka J, et al. Inhibition of hepatocellular carcinoma by metabolic normalization. 2019; (submitted for publication).Google Scholar
- 16.University of California Santa Cruz Xenabroswer. https://xenabrowser.net/. Accessed 15 March 2019.
- 17.Tensorboard. https://www.tensorflow.org/guide/summaries_and_tensorboard. Accessed 15 September 2018.
- 18.van der Maaten L. Learning a parametric embedding by preserving local structure. J Mach Learn Res. 2009;5:384–91.Google Scholar
- 19.Berglund A. Matsurv. https://www.mathworks.com/matlabcentral/fileexchange/64582-matsurv. Accessed 15 March 2019.
- 27.Bauer KR, Brown M, Cress RD, Parise CA, Caggiano V. Descriptive analysis of estrogen receptor (ER)-negative, progesterone receptor (PR)-negative, and HER2-negative invasive breast cancer, the so-called triple-negative phenotype: a population-based study from the California Cancer registry. Cancer. 2007;109:1721–8.CrossRefGoogle Scholar
- 54.American Joint Committee on Cancer. https://cancerstaging.org/references-tools/pages/what-is-cancer-staging.aspx. Accessed 15 March 2019.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.