Analyzing the similarity of samples and genes by MG-PCC algorithm, t-SNE-SS and t-SNE-SG maps
Abstract
Background
For analyzing these gene expression data sets under different samples, clustering and visualizing samples and genes are important methods. However, it is difficult to integrate clustering and visualizing techniques when the similarities of samples and genes are defined by PCC(Person correlation coefficient) measure.
Results
Here, for rare samples of gene expression data sets, we use MG-PCC (mini-groups that are defined by PCC) algorithm to divide them into mini-groups, and use t-SNE-SSP maps to display these mini-groups, where the idea of MG-PCC algorithm is that the nearest neighbors should be in the same mini-groups, t-SNE-SSP map is selected from a series of t-SNE(t-statistic Stochastic Neighbor Embedding) maps of standardized samples, and these t-SNE maps have different perplexity parameter. Moreover, for PCC clusters of mass genes, they are displayed by t-SNE-SGI map, where t-SNE-SGI map is selected from a series of t-SNE maps of standardized genes, and these t-SNE maps have different initialization dimensions. Here, t-SNE-SSP and t-SNE-SGI maps are selected by A-value, where A-value is modeled from areas of clustering projections, and t-SNE-SSP and t-SNE-SGI maps are such t-SNE map that has the smallest A-value.
Conclusions
From the analysis of cancer gene expression data sets, we demonstrate that MG-PCC algorithm is able to put tumor and normal samples into their respective mini-groups, and t-SNE-SSP(or t-SNE-SGI) maps are able to display the relationships between mini-groups(or PCC clusters) clearly. Furthermore, t-SNE-SS(m)(or t-SNE-SG(n)) maps are able to construct independent tree diagrams of the nearest sample(or gene) neighbors, where each tree diagram is corresponding to a mini-group of samples(or genes).
Keywords
PCC MG-PCC t-SNE-SSP t-SNE-SGI A-valueAbbreviations
- 2D
two dimensional
- A-value
the quantifying criterion of the projecting maps
- MG-PCC
the algorithm using PCC to put the nearest neighbors into the same mini-groups
- N-genes
the normalized gene
- O-genes
the original gene
- PCA
the principal component analysis
- PCA-S
PCA of the standardized samples
- PCC
Person correlation coefficient
- S-genes
the standardized genes
- the standardized samples
the standardized samples
- t-SNE
t-statistic Stochastic Neighbor Embedding
- t-SNE-N
t-SNE of N-genes
- t-SNE-O
t-SNE of O-genes
- t-SNE-SG
t-SNE of standardized genes
- t-SNE-SGI
t-SNE-SG that its A-value is the smallest
- t-SNE-SS
t-SNE of standardized samples
- t-SNE-SSP
t-SNE-SG that its A-value is the smallest
Background
With the rapid development of high-throughput biotechnologies, we were easily able to collect a large amount of gene expression data with many subjects of biology or medicine [1]. Here, we aimed at these gene expression data sets that came from tumoral and normal samples, where these data sets were often characterized by mass genes but with relatively small amounts of samples, their rows were corresponding to genes, and columns were representing samples [2]. For these gene expression data sets, they usually incorporated several thousands of probes associated with more and less relevance for cancers [3]. Thus, the filtering approaches applied to each probe before data analysis, with the aim to find differentially expressed genes, such as T-statistics, Significance Analysis, Adaptive Ranking, Combined Adaptive Ranking and Two-way Clustering [4, 5]. For samples of gene expression data sets, a major challenge was how to resolve their subtypes, and compare in different diseased states [4, 6]. Much work had been done on exploratory subtypes of cancers, such as Hierarchical clustering, K-means, penalised likelihood methods and the random forest [7, 8]. Moreover, to determine the intrinsic dimensionality of genes, the clustering analysis was used to search for patterns and group genes into expression clusters that provided additional insight into the biological function and relevance of genes that showed different expressions [9, 10, 11, 12, 13]. Furthermore, to display classification of genes(or samples) in a meaningful way for exploration, presentation, and comprehension in diseased states and normal differentiation, many dimension reduction techniques were used to embed high-dimensional data for visualization in 2D(two dimensional) spaces [14, 15, 16, 17], and had been successful in complementing clusters of Euclidean distance [14], such as Hierarchical clustering dendrograms, PCA(principal component analysis), t-SNE, heat maps, and network graphs [14, 15, 16, 17, 18].
For samples of gene expression data sets, their dimensionality often resulted in their different types to be isometric by Euclidean distance [9]. Thus, in the process of samples and genes clustering analysis, PCC commonly used also [10, 12, 13]. The simplest way to think about PCC was to plot curves of two genes, with PCC telling us how similar the shapes of their two curves were. But for PCC clusters of gene expression data, many projection techniques gave them poor visualizations usually [16]. To efficiently map clusters of PCC, PCC had been defined by transformed genes, such as PCCF(PCC of F-points) and PCC-MCP(PCC of multiple-cumulative probabilities) [19, 20]. Moreover, PCA-F and t-SNE-MCP-O gave good visualizations for clusters of PCCF and PCC-MCP, respectively. However, for PCC clusters of the original gene expression points, PCA-F and t-SNE-MCP-O gave them poor visualizations also [19, 20].
Here, for samples of gene expression data sets, we used MG-PCC algorithm to divide them into different mini-groups, where the similarities of samples were defined by PCC measure, and the idea of MG-PCC algorithm is that the nearest neighbors should be in the same mini-groups. That is, for any sample of a mini-group, its nearest neighbor was in the mini-group also. Moreover, we used t-SNE-SSP maps to display the relationships of mini-groups, where t-SNE-SSP map was selected from a series of t-SNE maps of standardized samples, these t-SNE maps had different perplexity parameter, and the initialization dimensions of these t-SNE maps were thirty. In t-SNE, the perplexity might be viewed as a knob that sets the number of effective nearest neighbors. It was comparable with the number of nearest neighbors that was employed in many manifold learners [21, 22].
Furthermore, for gene clusters that were generated from PCC, we attempted to use t-SNE-SGI maps to display them, where t-SNE-SGI maps were selected from a series of t-SNE maps of standardized genes. Compared to t-SNE-SSP maps, t-SNE-SGI map was selected from these t-SNE maps that had the same perplexity parameter, but different initialization dimensions, where the perplexity parameter of these t-SNE maps were the dimensions of genes. In fact, for gene expression data sets under different samples, their genes were mass and dense, and the performance of t-SNE with these data sets required a larger perplexity.
Here, we used A-value to select the t-SNE-SSP and t-SNE-SGI maps, where A-value was modeled from areas of clustering projections, and a t-SNE map was selected as t-SNE-SSP(or t-SNE-SGI) if its A-value was the smallest compared to others. Furthermore, for clusters with different clustering number, their t-SNE-SGI maps might come from the different t-SNE maps.
To evaluate the reliability of the MG-PCC and t-SNE-SSP, we applied them to gene expression data sets of lung cancers [23, 24]. Results showed that MG-PCC algorithm was able to put tumor and normal samples into their respective mini-groups, and t-SNE-SSP maps gave these mini-groups clear boundaries also, which helped us to mine the subtypes of cancers. Moreover, for PCC clusters of genes, t-SNE-SGI maps gave them better visualizations compared to t-SNE of the original and normalized genes, which made clustering and visualizing techniques better integration. Furthermore, for the nearest sample(or gene) neighbors, t-SNE-SS(m)(or t-SNE-SG(n)) maps were able to give them independent tree diagrams, where each tree diagram was corresponding to a mini-group of samples(or genes).
Materials and methods
Data and data source
The first data set GDS3837 provides insight into potential prognostic biomarkers and therapeutic targets for non-small cell lung carcinoma, where it has 54674 genes, 60 normal and 60 tumor samples that are taken from nonsmoking females [23, 24]. The second data set GDS3257 provides insight into the molecular basis of lung carcinogenesis induced by smoking, where it has a total of 22283 genes, and contains 107 samples that are taken from former, current and never smokers [23, 24], where GDS3837 and GDS3257 can be downloaded from NCBI’s GEO Database.
The details of A_{k}(k=1, 2, 3, 4 and 5)
A _{ k} | NO. of | NO. of | Tumor | Control | P-value |
---|---|---|---|---|---|
genes | samples | samples(B_{k}) | samples(C_{k}) | P-value | |
A _{1} | 1355 | 31 | 16 TN-smokers | 15 NN-smokers | < 10^{−5} |
(GDS3257) | (GDS3257) | ||||
A _{2} | 1129 | 36 | 18 TF-smokers | 18 NF-smokers | < 10^{−5} |
(GDS3257) | (GDS3257) | ||||
A _{3} | 2055 | 40 | 24 TC-smokers | 16 NC-smokers | < 10^{−5} |
(GDS3257) | (GDS3257) | ||||
A _{4} | 817 | 76 | 18 TF-smokers, | 18NF-smokers | < 10^{−5} |
24 TC-smokers | 16NC-smokers | < 10^{−5} | |||
(GDS3257) | (GDS3257) | ∗ | |||
A _{5} | 1739 | 120 | 60 TN-smokers | 60 NN-smokers | < 10^{−12} |
(GDS3837) | (GDS3837) |
Methods
S-points
MG-PCC algorithm
where X_{j1},X_{j2},⋯,X_{j(t−1)} belong to the first mini-group. Continuously, the first mini-group is completely built until no sample satisfies Eq. (5).
The remaining samples repeat above step until all mini-groups are completely built. For a mini-group, it is completely built if no sample satisfies Eqs. (4) or (5), that is, a mini-group contains two genes at least. Similarly, MG-Euclidean algorithm can be used to construct mini-group also, where the algorithm uses Euclidean distance to define the similarities of samples.
The A-value
a_{i} is A-value of the i-th mini-group, a is A-value of the data set, v is the number of mini-groups.
In general, for adjacent mini-groups, there is often some overlap for their convex hulls. Thus, A-value is smaller, the consistency between points and projections is more valid.
The t-SNE-SSP and t-SNE-SGI
Using t-SNE requires tuning some parameters, notably the perplexity and initialization dimension. Although t-SNE results are robust to the settings of parameters, in practice, we still have to interactively choose parameters by visually comparing results under multiple settings. For mini-groups and clusters of samples that are generated from PCC, we empirically validate that t-SNE maps of the standardized samples with an appropriate perplexity can clearly display them, where the initialization dimension of these t-SNE maps is thirty. But for PCC clusters of genes, t-SNE maps of S-genes with an appropriate initialization dimension can give them good visualizations, where the perplexity parameter of these t-SNE maps is the dimensions of genes.
Here, for mini-groups and clusters of samples that are generated from PCC, their t-SNE-SSP map is selected from a series of t-SNE-SS(k) maps by A-value, where t-SNE-SS(k) is t-SNE map of the standardized samples, its initialization dimensions are thirty, its perplexity parameter is k, and the value of k ranges from 3 to 30. That is, for t-SNE-SS(t), it is selected as t-SNE-SSP if its A-value is the smallest compared to other t-SNE-SS(k). Similarly, for PCC clusters of genes, their t-SNE-SGI map is selected from a series of t-SNE-SG(i) maps by A-value also, where t-SNE-SG(i) is t-SNE map of S-genes, its perplexity parameter of these t-SNE maps is the dimensions of genes, its initialization dimensions is i, the value of i ranges from 3 to the dimensions of genes.
Accuracy, F-Measure, RI and NMI
For t-SNE maps, since they are able to give good visualizations for clusters of Euclidean distance, they can be successful in complementing these PCC clusters that are relative consistency with Euclidean ones. Here, we use Accuracy, F-Measure, RI(Rand index) and NMI(Normalized mutual information) [25, 26] (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html) to evaluate the consistency of clusters between PCC and Euclidean distance, where clusters of Euclidean distance are seen as the gold standard of genes. In general, Accuracy is a simple and transparent evaluation measure, RI penalizes both false positive and false negative decisions during clustering, F-Measure in addition supports differential weighting of these two types of errors, and NMI can be information theoretically interpreted, where the detailed explanation of these four criteria are explained see in [25, 26], (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html) and their matlab codes are available at Additional file 1. Furthermore, the higher value of these four criteria means that the more consistency of clusters between PCC and Euclidean distance.
Results
The reliability of mini-groups
Statistics of the mini-groups of 5 sample data sets
data-i | Algorithm: MG- | NO. of mini-groups | NO. of misjudged tumor samples | NO. of misjudged normal samples |
---|---|---|---|---|
data-1 | PCC | 4 | 0 | 0 |
Euclidean-1 | 3 | 0 | 0 | |
Euclidean-2 | 4 | 0 | 0 | |
Euclidean-3 | 4 | 0 | 0 | |
data-2 | PCC | 7 | 0 | 3 |
Euclidean-1 | 6 | 0 | 3 | |
Euclidean-2 | 6 | 0 | 2 | |
Euclidean-3 | 7 | 0 | 0 | |
data-3 | PCC | 5 | 0 | 0 |
Euclidean-1 | 5 | 0 | 0 | |
Euclidean-2 | 4 | 0 | 0 | |
Euclidean-3 | 6 | 0 | 0 | |
data-4 | PCC | 13 | 0 | 1 |
Euclidean-1 | 13 | 0 | 1 | |
Euclidean-2 | 12 | 0 | 1 | |
Euclidean-3 | 10 | 0 | 0 | |
data-5 | PCC | 23 | 2 | 2 |
Euclidean-1 | 23 | 2 | 3 | |
Euclidean-2 | 24 | 2 | 2 | |
Euclidean-3 | 24 | 0 | 4 |
The clustering feature of S-genes
Statistics of Accuracy, F-Measure, RI and NMI of S-genes, N-genes and O-genes
Data | NO. of clusters | Genes | Accuracy | F-Measure | RI | NMI |
---|---|---|---|---|---|---|
data-6 | 6 | S-genes | 0.956 | 0.956 | 0.323 | 0.888 |
7 | S-genes | 0.794 | 0.788 | 0.092 | 0.672 | |
8 | S-genes | 0.803 | 0.799 | 0.094 | 0.660 | |
6 | N-genes | 0.321 | 0.267 | -0.084 | 0.422 | |
6 | O-genes | 0.318 | 0.263 | -0.081 | 0.240 | |
data-7 | 6 | S-genes | 0.901 | 0.897 | 0.237 | 0.744 |
7 | S-genes | 0.825 | 0.814 | 0.119 | 0.742 | |
8 | S-genes | 0.811 | 0.813 | 0.096 | 0.683 | |
6 | N-genes | 0.305 | 0.271 | -0.079 | 0.487 | |
6 | O-genes | 0.308 | 0.264 | -0.079 | 0.284 | |
data-8 | 5 | S-genes | 0.956 | 0.956 | 0.293 | 0.911 |
6 | S-genes | 0.947 | 0.947 | 0.257 | 0.782 | |
7 | S-genes | 0.937 | 0.937 | 0.181 | 0.733 | |
8 | S-genes | 0.884 | 0.882 | 0.137 | 0.718 | |
6 | N-genes | 0.275 | 0.251 | -0.084 | 0.540 | |
6 | O-genes | 0.281 | 0.249 | -0.089 | 0.340 | |
data-9 | 5 | S-genes | 0.977 | 0.977 | 0.503 | 0.842 |
6 | S-genes | 0.827 | 0.830 | 0.126 | 0.814 | |
7 | S-genes | 0.947 | 0.947 | 0.374 | 0.741 | |
8 | S-genes | 0.962 | 0.962 | 0.575 | 0.693 | |
9 | S-genes | 0.893 | 0.891 | 0.224 | 0.692 | |
8 | N-genes | 0.258 | 0.245 | -0.070 | 0.442 | |
8 | O-genes | 0.229 | 0.225 | -0.086 | 0.3043 | |
data-10 | 3 | S-genes | 0.901 | 0.897 | 0.132 | 0.902 |
4 | S-genes | 0.829 | 0.824 | 0.083 | 0.807 | |
5 | S-genes | 0.716 | 0.669 | 0.054 | 0.601 | |
6 | S-genes | 0.779 | 0.776 | 0.115 | 0.714 | |
7 | S-genes | 0.895 | 0.894 | 0.092 | 0.683 | |
8 | S-genes | 0.593 | 0.623 | -0.008 | 0.674 | |
3 | N-genes | 0.299 | 0.290 | -0.087 | 0.637 | |
3 | O-genes | 0.277 | 0.203 | -0.081 | 0.088 |
In general, for data with a normal distribution, the patterns revealed by the clusters under PCC and Euclidean roughly agreed with each other. But for O-genes and N-genes of complex gene expression data sets, results showed that their PCC and Euclidean clusters had significant differences.
The reliability of A-value
Selecting t-SNE-SSP maps by A-value
Moreover, for 2 clusters of data-4 and data-5 according to normal and tumor samples, their A-values of different t-SNE-SS(k) maps were showed in Fig. 2 (a) and (b) also, where these A-values were showed by red lines. From Fig. 2 (a) and (b), t-SNE-SS(30) was not the optimal 2D maps for any data set also. For 2 clusters of data-5, A-values of its t-SNE-SS(30) was 0.58779, while t-SNE-SS(18) was 0.52373. That is, t-SNE-SS(18) was more appropriate for displaying 5 clusters of data-5.
Selecting t-SNE-SGI maps by A-value
Here, for gene clusters of data-7 and data-9, their A-values of t-SNE-SG(i) maps were showed in Fig. 2 (c) and (d) respectively, where O-genes of each data set were divided into 3 and 5 clusters by K-means with PCC. Figure 2 (c) and (d) showed that t-SNE-SG(m) maps were not the optimal 2D maps for any clustering result, t-SNE-SG(4) maps were t-SNE-SGI maps of 3 clusters of data-7 and data-9, t-SNE-SG(7) map was t-SNE-SGI maps of 5 clusters of data-7, and t-SNE-SG(8) map was t-SNE-SGI maps of 5 clusters of data-9, respectively.
By Accuracy, F-Measure, RI and NMI, we demonstrated that PCC and Euclidean clusters of S-genes were relative consistent, which enabled t-SNE-SG(i) maps for displaying PCC clusters. But for t-SNE map with the randomly choosing parameters, it could give poor visualization for PCC clusters, which could lead to misinterpretation of clusters. Here, we used A-value to quantify the quality of t-SNE-SG(i) maps, which enabled t-SNE-SGI maps to project genes of the same clusters together, and neighbor clusters in adjacent regions.
The biological reliability of t-SNE-SSP maps
The consistency between MG-PCC algorithm and t-SNE-SSP maps
Comparison of t-SNE-SSP and PCA-S
In fact, for the optimization criterion of PCA, the relationship of distant points was able to depict as accurately as possible, while small inter-point distances might be distorted [14]. Moreover, there might be no single linear projection that gave a good view for most gene expression data [14]. Thus, for complex gene expression data sets, many linear projection methods might fail.
The reliability of t-SNE-SGI maps
Compared to K-means clustering analysis, MG-PCC algorithm does not estimate the number of clusters. But for genes, MG-PCC algorithm generates a large number of mini-groups, which can make genes with the similar biological function into different mini-groups. Thus, MG-PCC algorithm is not appropriate to cluster genes.
Comparison of t-SNE-SGI, t-SNE-N and t-SNE-O maps
For PCC clusters of data-6, 9 and 10, when t-SNE-N and t-SNE-O maps gave them poor visualizations also. The reason was that PCC and Euclidean clusters of O-genes and N-genes had significant differences.
Constructing the nearest sample neighbor map by t-SNE-SS(m)
Figure 8 showed that sample neighbors had created several independent tree diagrams. In fact, each tree diagram was corresponding to a mini-group of samples. Thus, the combination of t-SNE-SS(m) map and MG-PCC algorithm was able to help us to search subtypes of samples.
Constructing the nearest gene neighbor map by t-SNE-SG(n)
Based on GDS3837, GDS3257 and GDS3054, nine differentially expressed genes that were associated with lung cancer had been extracted, where these 9 genes that were smoking independent, and they were AGER, CA4, EDNRB, FAM107A, GPM6A, NPR1, PECAM1, RASIP1 and TGFBR3 [16]. Here, we used t-SNE-SG(n) map to display these nine mini-groups that contained nine specific genes (Fig. 9(b)). From Fig. 9(b), these nine independent tree diagrams might help us to search correlation genes.
Discussion
For samples of gene expression data sets of cancers, there are no clear boundary between subtypes of samples usually [7]. The reason is that the high dimensions of samples often results in the different subtypes to be isometric [9]. Here, we use MG-PCC algorithm to divide samples into mini-groups, and results show that the algorithm can put tumor and normal samples into their respective mini-groups. In fact, MG-PCC algorithm puts the nearest neighbors in the same mini-groups, which can distinguish the inconspicuous differences of different subtypes of samples. However, when MG-PCC algorithm applies genes, it generates a large number of mini-groups. That is, for genes with similar expression patterns, they may be put to different mini-groups, which make difficult to group genes with the similar biological function together. The reason is that MG-PCC algorithm does not presuppose the number of mini-groups, and the similar genes are not necessarily the nearest neighbors. Moreover, for the large number of mini-groups, any dimension reduction technique may give messy visualizations for the entire data set. Thus, MG-PCC algorithm is not appropriate to divide genes.
To efficiently display mini-groups of samples that are generated from MG-PCC algorithm, we firstly verify that PCC and Euclidean clusters of the standardized samples are more consistent compared to the original and normalized ones, and PCC of the standardized samples are the same as the original and normalized ones. Since t-SNE maps have been successful in displaying clusters of Euclidean distance, t-SNE maps of the standardized samples can give good visualizations for mini-groups also. However, for t-SNE maps of the standardized samples, they have significant difference for different parameters, and most of them give poor visualizations for mini-groups also. To select the optimal t-SNE maps of mini-groups, t-SNE-SSP are constructed secondly, where t-SNE-SSP maps are selected from these t-SNE maps of the standardized samples with different perplexity parameter. Results show that that t-SNE-SSP maps give mini-groups of samples good visualizations, and give PCC clusters of samples good visualizations also. However, for t-SNE-SSP maps, when we use them to display PCC clusters of genes, they give fuzzy visualizations. The reason may be that the dimensions of samples are far more than ones of genes. To efficiently map PCC clusters of genes, t-SNE-SGI maps are constructed, where t-SNE-SGI maps are selected from these t-SNE maps of the standardized genes with different initialization dimensions. By several gene expression data sets of cancers, we verify that SNE-SGI maps can give PCC clusters of genes good visualizations. Furthermore, we use t-SNE-SS(m) and t-SNE-SG(n) maps to display the nearest neighbor of samples and genes respectively, which make the relationships between samples(or genes) easy to visualize and understand. In total, for gene expression data sets of cancers, these four types of t-SNE maps identify them easy and intuitive.
Conclusion
In this article, we use MG-PCC algorithm to divide samples of gene expression data sets into mini-groups, and t-SNE-SSP to display the relationships of these mini-groups. Moreover, we provide t-SNE-SGI maps to display PCC clusters of genes, and t-SNE-SS(m) and t-SNE-SG(n) maps to display the nearest neighbor of samples and genes respectively. In total, for MG-PCC algorithm and these four types of t-SNE maps, they can help us to understand the entire gene expression data sets when they coordinate with each other.
Notes
Acknowledgements
This work rests almost entirely on open data. Contributors were gratefully acknowledged. Moreover, we deeply thank Mrs Xianchun Sun (Haidian district, Beijing garrison district, the fourth leaving cadre rehabilitation center) and Miss Tian wei (Nanjing NO.9 High School, PR China.) that carefully review our manuscript.
Funding
This work was supported by Major Program of National Natural Science Foundation of China (2016YFA0501600).
Availability of data and materials
The data sets were collected from the NCBI database. The more detailed report on data set was included in the article, in “Materials and methods” section.
Authors’ contributions
XJ analyzed and discussed the model, and wrote the manuscript. QH performed a portion of the model. ZL supervised the study. All co-authors actively commented and improved the manuscript, as well as finally read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declared that they had no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
References
- 1.Brazma A, Vilo J. Gene expression data analysis. Febs Lett. 2000; 480(1):17–24.CrossRefGoogle Scholar
- 2.Yu X, Yu G, Wang J. Clustering cancer gene expression data by projective clustering ensemble. PLoS ONE. 2017; 12(2):e171429.Google Scholar
- 3.Grimes ML, Lee WJ, van der Maaten L, Shannon P. Wrangling phosphoproteomic data to elucidate cancer signaling pathways. PLoS ONE. 2013; 8(1):e52884.CrossRefGoogle Scholar
- 4.Shaik JS, Yeasin M. A unified framework for finding differentially expressed genes from microarray experiments. BMC Bioinformatics. 2007; 8:347.CrossRefGoogle Scholar
- 5.Kong X, Mas V, Archer KJ. A non-parametric meta-analysis approach for combining independent microarray datasets: application using two microarray datasets pertaining to chronic allograft nephropathy. BMC Genomics. 2008; 9:98.CrossRefGoogle Scholar
- 6.Cavalli F, Hubner JM, Sharma T, Luu B, Sill M, Zapotocky M, Mack SC, Witt H, Lin T, Shih D, et al.Heterogeneity within the PF-EPN-B ependymoma subgroup. Acta Neuropathol. 2018; 136(2):227–37.CrossRefGoogle Scholar
- 7.Tishchenko I, Milioli HH, Riveros C, Moscato P. Extensive Transcriptomic and Genomic Analysis Provides New Insights about Luminal Breast Cancers. PLoS ONE. 2016; 11(6):e158259.CrossRefGoogle Scholar
- 8.Zucknick M, Richardson S, Stronach EA. Comparing the characteristics of gene expression profiles derived by univariate and multivariate classification methods. Stat Appl Genet Mol Biol. 2008; 7(1):e7.CrossRefGoogle Scholar
- 9.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999; 286(5439):531–7.CrossRefGoogle Scholar
- 10.Yao J, Chang C, Salmi ML, Hung YS, Loraine A, Roux SJ. Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient. BMC Bioinformatics. 2008; 9:288.CrossRefGoogle Scholar
- 11.Roche KE, Weinstein M, Dunwoodie LJ, Poehlman WL, Feltus FA. Sorting Five Human Tumor Types Reveals Specific Biomarkers and Background Classification Genes. Sci Rep. 2018; 8(1):8180.CrossRefGoogle Scholar
- 12.Jaskowiak PA, Campello RJ, Costa IG. On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics. 2014; 15(Suppl 2):S2.CrossRefGoogle Scholar
- 13.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998; 95(25):14863–8.CrossRefGoogle Scholar
- 14.Bushati N, Smith J, Briscoe J, Watkins C. An intuitive graphical visualization technique for the interrogation of transcriptome data. Nucleic Acids Res. 2011; 39(17):7380–9.CrossRefGoogle Scholar
- 15.Gehlenborg N, O’Donoghue SI, Baliga NS, Goesmann A, Hibbs MA, Kitano H, Kohlbacher O, Neuweger H, Schneider R, Tenenbaum D, et al.Visualization of omics data for systems biology. Nat Methods. 2010; 7(3 Suppl):S56–68.CrossRefGoogle Scholar
- 16.Sanguinetti G. Dimensionality reduction of clustered data sets. IEEE Trans Pattern Anal Mach Intell. 2008; 30(3):535–40.CrossRefGoogle Scholar
- 17.Huisman S, van Lew B, Mahfouz A, Pezzotti N, Hollt T, Michielsen L, Vilanova A, Reinders M, Lelieveldt B. BrainScope: interactive visual exploration of the spatial and temporal human brain transcriptome. Nucleic Acids Res. 2017; 45(10):e83.PubMedPubMedCentralGoogle Scholar
- 18.Tzeng WP, Frey TK. Mapping the rubella virus subgenomic promoter. J Virol. 2002; 76(7):3189–201.CrossRefGoogle Scholar
- 19.Jia X, Zhu G, Han Q, Lu Z. The biological knowledge discovery by PCCF measure and PCA-F projection. PLoS ONE. 2017; 12(4):e175104.Google Scholar
- 20.Jia X, Liu Y, Han Q, Lu Z. Multiple-cumulative probabilities used to cluster and visualize transcriptomes. FEBS Open Bio. 2017; 7(12):2008–20.CrossRefGoogle Scholar
- 21.Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008; 9:2579–605.Google Scholar
- 22.Xu W, Jiang X, Hu X, Li G. Visualization of genetic disease-phenotype similarities by multiple maps t-SNE with Laplacian regularization. BMC Med Genomics. 2014; 7(Suppl 2):S1.CrossRefGoogle Scholar
- 23.Lu TP, Tsai MH, Lee JM, Hsu CP, Chen PC, Lin CW, Shih JY, Yang PC, Hsiao CK, Lai LC, et al.Identification of a novel biomarker, SEMA5A, for non-small cell lung carcinoma in nonsmoking women. Cancer Epidemiol Biomarkers Prev. 2010; 19(10):2590–7.CrossRefGoogle Scholar
- 24.Hasan AN, Ahmad MW, Madar IH, Grace BL, Hasan TN. An in silico analytical study of lung cancer and smokers datasets from gene expression omnibus (GEO) for prediction of differentially expressed genes. Bioinformation. 2015; 11(5):229–35.CrossRefGoogle Scholar
- 25.Milligan GW, Cooper MC. A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. Multivariate Behav Res. 1986; 21(4):441–58.CrossRefGoogle Scholar
- 26.Estevez PA, Tesmer M, Perez CA, Zurada JM. Normalized mutual information feature selection. IEEE Trans Neural Netw. 2009; 20(2):189–201.CrossRefGoogle Scholar
- 27.Bruse JL, Zuluaga MA, Khushnood A, McLeod K, Ntsinjana HN, Hsia TY, Sermesant M, Pennec X, Taylor AM, Schievano S. Detecting Clinically Meaningful Shape Clusters in Medical Image Data: Metrics Analysis for Hierarchical Clustering Applied to Healthy and Pathological Aortic Arches. IEEE Trans Biomed Eng. 2017; 64(10):2373–83.CrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.