Advertisement

Suboptimal Comparison of Partitions

  • Jonathon J. O’BrienEmail author
  • Michael T. Lawson
  • Devin K. Schweppe
  • Bahjat F. Qaqish
Article
  • 3 Downloads

Abstract

The distinction between classification and clustering is often based on a priori knowledge of classification labels. However, in the purely theoretical situation where a data-generating model is known, the optimal solutions for clustering do not necessarily correspond to optimal solutions for classification. Exploring this divergence leads us to conclude that no standard measures of either internal or external validation can guarantee a correspondence with optimal clustering performance. We provide recommendations for the suboptimal evaluation of clustering performance. Such suboptimal approaches can provide valuable insight to researchers hoping to add a post hoc interpretation to their clusters. Indices based on pairwise linkage provide the clearest probabilistic interpretation, while a triplet-based index yields information on higher level structures in the data. Finally, a graphical examination of receiver operating characteristics generated from hierarchical clustering dendrograms can convey information that would be lost in any one number summary.

Keywords

Classification Clustering Sensitivity Specificity Triplet index Hierarchical receiver operating characteristic 

Notes

Acknowledgments

The authors thank the National Cancer Institute for supporting this research through the training grant “Biostatistics for Research in Genomics and Cancer,” NCI grant 5T32CA106209-07 (T32), and the National Institute of Environmental Health Sciences for supporting it through the training grant T32ES007018.

References

  1. Aidos, H., Duin, R., Fred, A. (2013). The area under the ROC curve as a criterion for clustering evaluation. In ICPRAM 2013 - proceedings of the 2nd international conference on pattern recognition applications and methods (pp. 276–280).Google Scholar
  2. Albatineh, A.N., Niewiadomska-Bugaj, M., Mihalko, D. (2006). On similarity indices and correction for chance agreement. Journal of Classification, 23, 301–313.MathSciNetCrossRefzbMATHGoogle Scholar
  3. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46, 243–256.CrossRefGoogle Scholar
  4. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Nature Genetics, 25(1), 25–29.CrossRefGoogle Scholar
  5. Baulieu, F. (1997). Two variant axiom systems for presence/absence based dissimilarity coefficients. Journal of Classification, 14(1), 159–170.MathSciNetCrossRefzbMATHGoogle Scholar
  6. Baulieu, F.B. (1989). A classification of presence/absence based dissimilarity coefficients. Journal of Classification, 6(1), 233–246.MathSciNetCrossRefzbMATHGoogle Scholar
  7. Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E.J., Lander, E.S., Wong, W., Johnson, B.E., Golub, T.R., Sugarbaker, D.J., Meyerson, M. (2001). Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the United States of America, 98, 13790–13795.CrossRefGoogle Scholar
  8. Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., Dougherty, E.R. (2007). Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3), 807–824.CrossRefzbMATHGoogle Scholar
  9. Daws, J.T. (1996). The analysis of free-sorting data: beyond pairwise cooccurrences. Journal of Classification, 13(1), 57–80.CrossRefzbMATHGoogle Scholar
  10. Dougherty, E.R., & Brun, M. (2004). A probabilistic theory of clustering. Pattern Recognition, 37(5), 917–925.CrossRefzbMATHGoogle Scholar
  11. Gower, J.C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3(1), 5–48.MathSciNetCrossRefzbMATHGoogle Scholar
  12. Handl, J., Knowles, J., Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15), 3201–3212.CrossRefGoogle Scholar
  13. Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 64, 53–62.CrossRefzbMATHGoogle Scholar
  14. Hennig, C., & Liao, T.F. (2013). How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3), 309–369.MathSciNetCrossRefGoogle Scholar
  15. Hoshida, Y., Brunet, J.P., Tamayo, P., Golub, T.R., Mesirov, J.P. (2007). Subclass mapping: identifying common subtypes in independent disease data sets. PLoS ONE, 2(11), e1195.CrossRefGoogle Scholar
  16. Hubalek, Z. (1982). Coefficients of association and similarity, based on binary (presence-absence) data: an evaluation. Biological Reviews, 57(4), 669–689.CrossRefGoogle Scholar
  17. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.CrossRefzbMATHGoogle Scholar
  18. Jain, A.K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.CrossRefGoogle Scholar
  19. Kaufman, L., & Rousseeuw, P.J. (Eds.). (2005). Finding groups in data: an introduction to cluster analysis. Wiley series in probability and statistics. Hoboken: Wiley.Google Scholar
  20. McLachlan, G.J., & Basford, K.E. (1987). Mixture models: inference and applications to clustering. New York: Taylor & Francis.zbMATHGoogle Scholar
  21. Olsen, J.V., Vermeulen, M., Santamaria, A., Kumar, C., Miller, M.L., Jensen, L.J., Gnad, F., Cox, J., Jensen, T.S., Nigg, E.A., Brunak, S., Mann, M. (2010). Quantitative phosphoproteomics reveals widespread full phosphorylation site occupancy during mitosis. Science Signaling, 3(104), ra3–ra3.CrossRefGoogle Scholar
  22. Qaqish, B.F., O’Brien, J.J., Hibbard, J.C., Clowers, K.J. (2017). Gene expression accelerating high-dimensional clustering with lossless data reduction. Bioinformatics, 33(18), 2867–2872.CrossRefGoogle Scholar
  23. Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, 846–850.CrossRefGoogle Scholar
  24. Rezaei, M., & Franti, P. (2016). Set matching measures for external cluster validity. IEEE Transactions on Knowledge and Data Engineering, 28(8), 2173–2186.CrossRefGoogle Scholar
  25. Seber, G.A.F. (2009). Multivariate observations. New York: Wiley.zbMATHGoogle Scholar
  26. Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T. (2005). ROCR: visualizing classifier performance in R. Bioinformatics, 21(20), 3940–3941.CrossRefGoogle Scholar
  27. Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C. (2006). Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22 (19), 2405–2412.CrossRefGoogle Scholar
  28. Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G., Koong, A., Le, Q.T. (2004). Sample classification from protein mass spectrometry, by ‘Peak Probability Contrasts’. Bioinformatics, 20, 3034–3044.CrossRefGoogle Scholar
  29. Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411–423.MathSciNetCrossRefzbMATHGoogle Scholar
  30. Warrens, M.J. (2008a). On association coefficients for 2 × 2 tables and properties that do not depend on the marginal distributions. Psychometrika, 73(4), 777–789.MathSciNetCrossRefzbMATHGoogle Scholar
  31. Warrens, M.J. (2008b). On the equivalence of cohen’s kappa and the Hubert-Arabie adjusted rand index. Journal of Classification, 25(2), 177–183.MathSciNetCrossRefzbMATHGoogle Scholar
  32. Xuan Vinh, N, Julien Epps, U., Bailey, J. (2010). Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 2837–2854.MathSciNetzbMATHGoogle Scholar

Copyright information

© The Classification Society 2019

Authors and Affiliations

  1. 1.Department of Cell BiologyHarvard Medical SchoolBostonUSA
  2. 2.Department of BiostatisticsUniversity of North Carolina at Chapel HillChapel HillUSA

Personalised recommendations