Suboptimal Comparison of Partitions
- 3 Downloads
The distinction between classification and clustering is often based on a priori knowledge of classification labels. However, in the purely theoretical situation where a data-generating model is known, the optimal solutions for clustering do not necessarily correspond to optimal solutions for classification. Exploring this divergence leads us to conclude that no standard measures of either internal or external validation can guarantee a correspondence with optimal clustering performance. We provide recommendations for the suboptimal evaluation of clustering performance. Such suboptimal approaches can provide valuable insight to researchers hoping to add a post hoc interpretation to their clusters. Indices based on pairwise linkage provide the clearest probabilistic interpretation, while a triplet-based index yields information on higher level structures in the data. Finally, a graphical examination of receiver operating characteristics generated from hierarchical clustering dendrograms can convey information that would be lost in any one number summary.
KeywordsClassification Clustering Sensitivity Specificity Triplet index Hierarchical receiver operating characteristic
The authors thank the National Cancer Institute for supporting this research through the training grant “Biostatistics for Research in Genomics and Cancer,” NCI grant 5T32CA106209-07 (T32), and the National Institute of Environmental Health Sciences for supporting it through the training grant T32ES007018.
- Aidos, H., Duin, R., Fred, A. (2013). The area under the ROC curve as a criterion for clustering evaluation. In ICPRAM 2013 - proceedings of the 2nd international conference on pattern recognition applications and methods (pp. 276–280).Google Scholar
- Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Nature Genetics, 25(1), 25–29.CrossRefGoogle Scholar
- Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E.J., Lander, E.S., Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M. (2001). Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the United States of America, 98, 13790–13795.CrossRefGoogle Scholar
- Kaufman, L., & Rousseeuw, P.J. (Eds.). (2005). Finding groups in data: an introduction to cluster analysis. Wiley series in probability and statistics. Hoboken: Wiley.Google Scholar
- Olsen, J.V., Vermeulen, M., Santamaria, A., Kumar, C., Miller, M.L., Jensen, L.J., Gnad, F., Cox, J., Jensen, T.S., Nigg, E.A., Brunak, S., Mann, M. (2010). Quantitative phosphoproteomics reveals widespread full phosphorylation site occupancy during mitosis. Science Signaling, 3(104), ra3–ra3.CrossRefGoogle Scholar