Interactive visual exploration and refinement of cluster assignments
- 2k Downloads
With ever-increasing amounts of data produced in biology research, scientists are in need of efficient data analysis methods. Cluster analysis, combined with visualization of the results, is one such method that can be used to make sense of large data volumes. At the same time, cluster analysis is known to be imperfect and depends on the choice of algorithms, parameters, and distance measures. Most clustering algorithms don’t properly account for ambiguity in the source data, as records are often assigned to discrete clusters, even if an assignment is unclear. While there are metrics and visualization techniques that allow analysts to compare clusterings or to judge cluster quality, there is no comprehensive method that allows analysts to evaluate, compare, and refine cluster assignments based on the source data, derived scores, and contextual data.
In this paper, we introduce a method that explicitly visualizes the quality of cluster assignments, allows comparisons of clustering results and enables analysts to manually curate and refine cluster assignments. Our methods are applicable to matrix data clustered with partitional, hierarchical, and fuzzy clustering algorithms. Furthermore, we enable analysts to explore clustering results in context of other data, for example, to observe whether a clustering of genomic data results in a meaningful differentiation in phenotypes.
Our methods are integrated into Caleydo StratomeX, a popular, web-based, disease subtype analysis tool. We show in a usage scenario that our approach can reveal ambiguities in cluster assignments and produce improved clusterings that better differentiate genotypes and phenotypes.
KeywordsCluster analysis Visualization Biology visualization Omics data
The Cancer Genome Atlas
Rapid improvement of data acquisition technologies and the fast growth of data collections in the biological sciences increase the need for advanced analysis methods and tools to extract meaningful information from the data. Cluster analysis is a method that can help make sense of large data and has played an important role in data mining for many years. Its purpose is to divide large datasets into meaningful subsets (clusters) of elements. The clusters then can be used for aggregation, ordering, or, in biology, to describe samples in terms of subtypes and to derive biomarkers. Clustering is ubiquitous in biological data analysis and applied to gene expression, copy number, and epigenetic data, as well as biological networks or text documents, to name just a few application areas.
A cluster is a group of similar items, where similarity is based on comparing data items using a measure of similarity. Cluster analysis is part of the standard toolbox for biology researchers, and there is a myriad of different algorithms designed for various purposes and with differing strengths and weaknesses. For example, clustering can be used to identify functionally related genes based on gene expression, or to categorize samples into disease subtypes. Since Eisen et al.  introduced cluster analysis for gene expression in 1998, it has been widely used to classify both, genes and samples in a variety of biological datasets [2, 3, 4, 5].
However, while clustering is useful, it is not always simple to use. Scientists have to deal with several challenges: the choice of an algorithm for a particular dataset, the parameters for these algorithms (e.g., the number of expected clusters), and the choice of a suitable similarity metric. All of these choices depend on the dataset and on the goals of the analysis. Also, methods generally suitable for a dataset can be sensitive to noise and outliers in the data and produce poor results for a high number of dimensions.
Several (semi)automated cluster validation, optimization, and evaluation techniques have been introduced to address the basic challenges of clustering and to determine the amount of concordance among certain outcomes (e.g., [6, 7, 8]). These methods try to examine the robustness of clustering results and guess the actual number of clusters. This task is often accompanied by visualizations of these measures by histograms or line graphs. Consensus clustering  addresses the task of detecting the number of clusters and attaining confidence in cluster assignments. It applies clustering algorithms to multiple perturbed subsamples of datasets and computes a consensus and correlation matrix from these results to measure concordance among them, and explores the stability of different techniques. These matrices are plotted both as histograms and two-dimensional graphs to assist scientists in the examination process.
Although cluster validation is a useful method to examine clustering algorithms it does not guarantee to reconstruct the actual or desired number of clusters from each data type. In particular, cluster validation is not able to compensate weaknesses of cluster algorithms to create an appropriate solution if the clustering algorithm is not suitable for a given dataset.
While knowledge about clustering algorithms and their strengths and weaknesses, as well as automated validation methods are helpful in picking a good initial configuration, trying out various algorithms and parametrizations is critical in the analysis process. For that reason, scientists usually conduct multiple runs of clustering algorithms with different parameters and compare the varying results while examining the concordance or discordance among them.
In this paper we introduce methods to evaluate and compare clustering results. We focus on revealing specificity or ambiguity of cluster assignments and embed our contributions in StratomeX [10, 11], a framework for stratification and disease subtype analysis that is also well suited to cluster comparison. Furthermore, we enable analysts to manually refine clusters and the underlying cluster assignments to improve ambiguous clusters. They can transfer entities to better fit clusters, merge similar clusters, and exclude groups of elements assumed to be outliers. An important aspect of this interactive process is that these operations can be informed by considering data that was not used to run the clustering: when considering cluster refinements, we can immediately show the impact on, for example, average patient survival.
In our tool, users are able to conduct multiple runs of clustering algorithms with full control over parametrization and examine both conspicuous patterns in heatmaps and quantify the quality and confidence of cluster assignments simultaneously. Our measures of cluster fit are independent from the underlying stratification/clustering technique and allow investigators to set thresholds to classify parts of a cluster as either reliable, uncertain, or a bad fit. We apply our methods to matrices of genomic datasets, which covers a large and important class of datasets and clustering applications.
We evaluate our tool based on a usage scenario with gene expression data from The Cancer Genome Atlas and demonstrate how visual inspection and manual refinement can be used to identify new clusters.
In the following we briefly introduce clustering algorithms and their properties, as well as StratomeX, the framework we used and extended for this this research, and other, relevant related work.
Clustering algorithms assign data to groups of similar elements. The two most common classes of algorithms are partitional and hierarchical clustering algorithms ; less frequently used are probabilistic or fuzzy clustering algorithms.
Partitional algorithms decompose data into non-overlapping partitions that optimize a distance function, for example by reducing the sum of squared error metric with respect to Euclidean distance. Based on that, they either attempt to iteratively create a user-specified number of clusters, like in k-Means  or they utilize advanced methods to guess the number of clusters implicitly, such as Affinity Propagation .
In contrast to that, hierarchical clustering algorithms generate a tree of similar records by either merging smaller clusters into larger ones (agglomerative approach) or splitting groups into smaller clusters (divisive). In the resulting binary tree, commonly represented with a dendrogram, each leaf node represents a record, each inner node represents a cluster as the union of its children. Inner nodes commonly also store a measure of similarity among their children. By cutting the tree at a threshold, we are able to obtain discrete clusters from the similarity tree.
These approaches use a deterministic cluster assignment, i.e., elements are assigned exclusively to one cluster and are not in other clusters. In contrast, fuzzy clustering uses a probabilistic assignment approach and allows entities to belong to multiple clusters. The degree of membership is described by weights, with values between 0 (no membership at all) and 1 (unique membership to one cluster). These weights, which are commonly called probabilities, capture the likelihood of an element belonging to a certain partition. A prominent example algorithm is Fuzzy c-Means .
Clustering algorithms make use of a measure of similarity or dissimilarity between pairs of elements. They aim to maximize pair-wise similarity or minimize pair-wise dissimilarity by using either geometrical distances or correlation measures. A popular way to define similarity is a measure of geometric distance based on, for example, squared Euclidean or Manhattan distance. These measures work well for “spherical” and “isolated” groups in the data  but are less well suited for other shapes and overlapping clusters. More sophisticated methods measure the cross-correlation or statistical relationship between two vectors. They compute correlation coefficients that denote the type of concordance and dependence among pairs of elements. The coefficients range from -1 (opposite or negative correlation) to 1 (perfect or positive correlation), whereas zero values denote that there is no relationship between two elements. The most commonly used coefficient in that context is the Pearson product-moment correlation coefficient that measures the linear relationship by means of the covariance of two variables. Spearman’s rank correlation coefficient is another approach to estimate concordance similar to Pearson’s but uses ranks or scores for data to compute covariances.
The choice of distance measure has an important impact on the clustering results, as it drives an algorithm’s determination of similarity between elements. At the same time, we can also use distance measures to identify the fit of an element to a cluster, by, for example, measuring the distance of an element to the cluster centroid. In doing so, we do not necessarily need to use the same measure that was used for the clustering in the first place. In our technique, we visualize this information for all elements in a cluster, to communicate the quality of fit to a cluster.
StratomeX visualizes stratifications of samples (patients) as rows (records) based on various attributes, such as clinical variables like gender or tumor staging, bins of numerical vectors, such as binned values of copy number alterations, or clusters of matrices/heat maps. Within these heat maps, the columns correspond to e.g., differentially expressed genes. StratomeX combines the visual metaphor used in parallel sets , with visualizations of the underlying data . Each dataset is shown as a column. A header block at the top shows the distribution of the whole dataset, while groups of patients are shown as blocks in the columns. Relationships between blocks are visualized by ribbons whose thickness represents the number of patients shared across two bricks. This method can be used to visualize relationships between groupings and clusterings of different data, but can equally be used to compare multiple clusterings of the same dataset.
StratomeX also integrates the visualization of “dependent data” by using the stratification of a neighboring column for a different dataset. This is commonly used to visualize survival data in Kaplan-Meier plots for a particular stratification, or to visualize expression of a patient cluster in a particular biological pathway.
There are several tools to analyze clustering results and assess the quality of clustering algorithms. A common approach to evaluate clustering results is to visualize the underlying data: heatmaps , for example, enable users to judge how consistent a pattern is within a cluster for high-dimensional data.
Seo at el.  introduced the hierarchical clustering explorer (HCE) to visualize hierarchical clustering results. It combines several visualization techniques such as scattergrams, histograms, heatmaps and dendrogram views. In addition to that, it supports dynamic partitioning of clusters by cutting the dendrogram interactively. HCE also enables the comparison of different clustering results while showing the relationship among two clusters with connecting links. Mayday [21, 22] is a similar tool that, in contrast to HCE, provides a wide variety of clustering options.
CComViz  is a cluster comparison application that uses the parallel sets technique to compare clustering results on the same data, and hence is related to the original StratomeX. In contrast to our proposed technique it does not allow for internal evaluation, cluster refinement, or the visualization of cluster fits.
Lex et al.  introduced Matchmaker, a method that enables both, comparisons of clustering algorithms, and clustering and visualization of homogeneous subsets, with the intention of producing better clustering results. Matchmaker uses a hybrid heatmap and a parallel sets or parallel coordinates layout to show relationships between columns, similar to StratomeX. VisBricks  is an extension of this idea and provides multiform visualization for the data represented by clusters: users can choose which visualization technique to use for which cluster.
In contrast to these techniques, Domino  provides a completely flexible arrangement of data subsets that can be used to create a wide range of visual representations, including the Matchmaker representation. It is, however, less suitable for cluster evaluation and comparison.
A tool that addresses the interactive exploration of fuzzy clustering in combination with biclustering results is FURBY . It uses a force-directed node-link layout, representing clusters as nodes and the relationship between them as links. The distance between nodes encodes the (approximate) similarity of two nodes. FURBY also allows users to refine or improve fuzzy clusterings by choosing a threshold that transforms fuzzy clusters into discrete ones.
Tools such as ClustVis  and Clustrophile  take a more traditional approach to cluster visualization by using scatterplots based on dimensionality reduction (e.g., using PCA) and/or heat maps to visualize clustering results. While these tools are well suited to evaluate a particular clustering result, they are less powerful with regards to comparison between clusterings.
A tool that is more closely related to our work is XCluSim . It focuses on visual exploration and validation of different clustering algorithms and the concordance or disconcordance among them. It combines several small sub-views to form a multiview layout for cluster evaluation. It contains dendrogram and force-directed graph views to show concordance among different clustering results and uses colors to represent clusters, without showing the underlying data. It offers a parallel sets view where each row represents one clustering result and thick dark ribbons depict which groups are stable, i.e., consistent throughout all clustering results. In contrast to XCluSim, our method integrates cluster metrics with the data more closely and can also bring in other, related data sources, to evaluate clusters. Also, XCluSim does not support cluster refinement.
Comparison of our technique to the most important existing tools with respect to basic data-processing and visualization features, clustering options, cluster visualization features, and software properties
Our methods are also related to silhouette plots, which visualize the tightness and separation of the elements in a cluster . Silhouette plots, however, work best for geometric distances and clearly separated and spherical clusters, whereas our approach is more flexible in terms of supporting a variety of different measures of cluster fit. Also, silhouette plots are typically static, however, we could conceivably integrate the metrics used for silhouette plots in our approach. iGPSe , for example, is a system similar to StratomeX that integrates silhouette plots.
Based on our experience in designing multiple tools for visualizing clustered biomolecular data [10, 11, 19, 24, 25, 32], conversations with bioinformaticians, and a literature review, we elicited a list of requirements that a tool for the analysis of clustered matrices from the biomolecular domain should address. R I: Provide representative algorithms with control over parametrization. A good cluster analysis tool should enable investigators to flexibly run various clustering algorithms on the data. Users should have control over all parameters and should be able to choose from various similarity metrics. R II: Work with discrete, hierarchical and probabilistic cluster assignments. Visualization tools that deal with the analysis of cluster assignments should be able to work with all important types of clustering, namely discrete/partitional, hierarchical, and fuzzy clustering. The visualization of hierarchical and fuzzy clusterings is usually more challenging: to deal with hierarchical clusterings a tool needs to enable dendrogram cuts, and to address the properties of fuzzy clusterings, it must support the analysis of ambiguous and/or redundant assignments. R III: Enable comparison of cluster assignments. Given the ability to run multiple clustering algorithms, it is essential to enable the comparison of the clustering results. This will allow analysts to judge similarities and differences between algorithms, parametrizations, and similarity measures. It will also enable them to identify stable clusters, i.e., those that are robust to changes in parameters and algorithms. R IV: Visualize fit of records to their cluster. For the assessment of confidence in cluster assignments, a tool should show the quality of cluster assignments for its records and the overall quality for the cluster. This enables analysts to judge whether a record is a good fit to a cluster or whether it’s an outlier or a bad fit. R V: Visualize fit of records to other clusters. Clustering algorithms commonly don’t find the perfect fit for a record. Hence, it is useful to enable analysts to investigate if particular records are good fits for other clusters, or whether they are very specific to their assigned clusters. This allows users to consider whether records should be moved to other clusters, whether a group of records should be split off into a separate cluster, and more generally, to evaluate whether the number of clusters in a clustering result is correct. R VI: Enable refinement of clusters. To enable the improvement of clusters, users should be able to interactively modify clusters. This includes shifting of elements to better fitting clusters based on similarity, merging clusters considered to be similar, and excluding non-fitting groups from individual groups or the whole dataset. R VII: Visualize context for clusters. It is important to explore evidence for clusters in other data sources. In molecular biology applications in particular, datasets rarely stand alone but are connected to a wealth of other (meta)data. Judging clusters based on effects in other data sources can indicate practical relevance of a clustering, or can reveal dependencies between data sets and hence is important for validation and interpretation of the results.
Based on these requirements, our tool extends StratomeX with new clustering features for cluster evaluation and cluster improvement. Table 1 illustrates how our tool differs from existing clustering tools by comparing their set of features with our work.
We designed our methods to address the aforementioned requirements while taking into account usability and good visualization design practices. Our design was influenced by our decision to integrate the methods into Caleydo StratomeX as StratomeX is a well-established tool for subtype analysis. A prototype of our methods is available at http://caleydo.org/publications/2017_bmc_clustering/. Please also refer to the Additional file 1: supplementary video for an introduction and to observe the interaction.
Additional file 1: We provide a Supplementary Video showing an interactive demonstration of the technique. (MP4 44,339 kb)
We now introduce a set of techniques to address our proposed requirements within this workflow.
Visualizing probabilities for fuzzy clustering
Once scientists have explored the cluster assignments, the next step is to improve the cluster assignments if necessary (R VI).
Splitting only based on distances, however, does not guarantee that the resulting groups are as homogeneous as they could be: all they have in common is a certain distance range from the original centroid, yet these distances could be in opposite “directions”. To improve the homogeneity of split clusters, we can dynamically shift the elements between the clusters, so that the elements are in the cluster that is closest to them using an approach similar to the k-Means algorithm. Shifting is based on the same similarity metric that was used to produce the original stratification.
Merging and exclusion
Our application also has the option to merge clusters. Especially when several clusters are split first, it is likely that some of the new clusters exhibit a similar pattern, and that their distances also indicate that they could belong together. This problem of too many clusters for the data can be addressed using a merge operation. We also support cluster exclusion since there might be groups or individual records that are outliers and shouldn’t belong to any cluster.
Integration with StratomeX
A common question in clustering is how to determine the appropriate number of clusters in the data. While there are algorithmic approaches, such as the cophenetic correlation coefficient , to estimate the number of clusters, visual inspection is often the initial step in confirming that a clustering algorithm has separated the elements appropriately. In this usage scenario we use our approach toinspect and refine a clustering result provided by an external clustering algorithm and to confirm our results with an integrated clustering algorithm.
Manual cluster refinement Using the sliders in the within-cluster distance visualization and the cluster splitting function we separated aforementioned patients from the clusters named Group 0 and Group 1. Because their profiles are very similar, we merged them into a single cluster using the cluster merging function (see Fig. 10e). The expression profiles in the resulting new cluster look homogeneous and are visibly different from the expression profiles in the other four clusters. We examined patient survival times (Fig. 10f) across the five clusters and did not observe any notable differences in the new cluster. Since the web-based prototype of StratomeX is currently still lacking the guided exploration features of the original standalone application , we were unable to identify a meaningful correlation between the new cluster and mutation and copy number calls or to identify significantly overlapping clusters in other data types.
However, we also compared the five clusters derived from the original four-cluster CNMF result with other clustering results computed on the same gene expression matrix (Fig. 10g) and found, for example, that three-, four-, and five-cluster k-Means clustering results using Euclidean distance and the k-means algorithm include almost exactly the same cluster that we identified in the CNMF clustering results using visual inspection and manual refinement.
Our methods are limited by the inherent limitations of StratomeX: when working with a large number of clusters, ribbons between the individual columns can result in clutter. We observe that 10-15 clusters can be used without too much clutter. Also, the number of columns is limited to about ten on typical displays. In terms of computational scalability, we found that even the computationally complex clustering algorithms such as affinity propagation execute almost interactively for a dataset with about 500×1500 entries, and complete within one to two minutes for a genomic dataset with about 500×12,000 entries on our t2.micro Amazon EC2 instance with 1 CPU and 1 GB memory. We find that the performance of our technique is in line with or superior to related techniques (see Table 1).
Our implementation currently cannot appropriately compare columns clustered with fuzzy algorithms, as the ribbons connecting the columns assume that every row exists only once. We plan on addressing this limitation in the future, either by allowing overlapping ribbons, or by using a separate visualization optimized to visualize set overlaps, such as UpSet .
Clustering is an important yet inherently imperfect process. In this paper we have introduced methods to evaluate and refine clustering results for the application to matrix data, as it is commonly used in molecular biology. In contrast to previous approaches, we combine visualization of the data directly with visualization of cluster quality and enable the comparison of multiple clustering results. We also allow interactive refinement of clusters while associating the updated clusters with contextual data, which allows analysts to judge clusters not only by the data used for clustering, but also based on effects observable in related datasets. We argue that our tool is thus the most comprehensive technique to refine, create, evaluate, compare, and contextualize clustering results.
In the future, we plan on adding additional clustering algorithms, as different algorithms have complementary strengths and weaknesses, and explore the possibility of using distributed clustering algorithms to scale to even bigger datasets. Also, density based clustering algorithms , which treat outliers separately would be valuable to integrate and would mandate an extension of our visualization method. We also plan on addressing cases with large numbers of clusters, a current limitation of our approach, which, however, will likely require a different visualization approach.
We plan on enabling analysts to cluster individual blocks, i.e., to run a clustering algorithm on the subset of records that were previously assigned to a cluster. This approach could be used to identify groups of outliers in clusters, which could then be split off and re-integrated with other clusters.
Finally, we will extend our work to datasets that are not in matrix form. This will require novel visual representations, as there is no equivalent to the well-defined borders of cluster blocks when clustering graphs or textual data.
Availability and requirements
Project name: Caleydo StratomeX
Project home page: http://caleydo.org/publications/2017_bmc_clustering/
Operating system(s): web-based
Other requirements: none
License: The 3-Clause BSD License
We thank Samuel Gratzl for his help with the implementation.
This work was funded in part by the US National Institutes of Health (U01 CA198935, P41 GM103545-17, R00 HG007583) and supported by a fellowship of the FITweltweit program of the German Academic Exchange Service (DAAD). The funding agencies played no role in the design or the conclusion of this study.
Availability of data and materials
The source code is released under the BSD license and is available at http://caleydo.org/publications/2017_bmc_clustering/. We also host a prototype that is accessible through that link. The data we use to demonstrate the technique has been downloaded from the NIH GDC Data Portal at https://portal.gdc.cancer.gov/.
MK implemented the software and contributed to the design of the study and the write-up. AL, NG, and CRJ contributed to the design of the study and the writing. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 13.Macqueen JB. Some methods for classification and analysis of multivariate observations. In: In 5-Th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. Berkeley, California, USA: University of California Press: 1967. p. 281–97.Google Scholar
- 28.Demiralp C. Clustrophile: A Tool for Visual Clustering Analysis. In: KDD 2016 workshop on Interactive Data Exploration and Analytics (IDEA’16) August 14th, 2016, San Francisco, CA, USA: 2016.Google Scholar
- 29.L’Yi S, Ko B, Shin D, Cho YJ, Lee J, Kim B, Seo J. XCluSim: A visual analytics tool for interactively comparing multiple clustering results of bioinformatics data. BMC Bioinforma. 2015; 16(11):1–15.Google Scholar
- 33.Gratzl S, Gehlenborg N, Lex A, Strobelt H, Partl C, Streit M. Caleydo Web: An Integrated Visual Analysis Platform for Biomedical Data. In: Poster Compendium of the IEEE Conference on Information Visualization (InfoVis ’15). Chicago, IL, USA: IEEE: 2015.Google Scholar
- 36.Broad Institute TCGA Genome Data Analysis Center. Clustering of mRNA Expression: Consensus NMF. 2013. doi:10.7908/C16W983Z.
- 38.Ester M, Kriegel HP, Sander J, Xu X, et al.A density-based algorithm for discovering clusters in large spatial databases with noise. In: The second international conference on Knowledge Discovery and Data Mining (KDD-96) August 2–4, 1996, Portland, Oregon. Association for the Advancement of Artificial Intelligence: 1996. p. 226–31.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.