Core Partitioning: Kmeans and Similarity Clustering
Abstract
Kmeans is arguably the most popular clusteranalysis method. The method’s output is twofold: (1) a partition of the entity set into clusters, and (2) centers representing the clusters. The method is rather intuitive and usually requires just a few pages to get presented. In contrast, this text includes a number of less popular subjects that are much important when using Kmeans for realworld data analysis:

Data standardization, especially, at nominal or mixed scales

Innate and other tools for interpretation of clusters

Analysis of examples of Kmeans working and its failures

Initialization—the choice of the number of clusters and location of centers.
Versions of Kmeans such as incremental Kmeans, nature inspired Kmeans, and entitycenter “medoid” methods are presented. Three modifications of Kmeans onto different cluster structures are given: Fuzzy Kmeans for finding fuzzy clusters, ExpectationMaximization (EM) for finding probabilistic clusters, and Kohonen’s selforganizing maps (SOM) that tie up the sought clusters to a visually convenient twodimensional grid. An equivalent reformulation of Kmeans criterion is described to yield what we call the complementary criterion. This criterion allows to reinterpret the method as that for finding big anomalous clusters. In this formulation, Kmeans is shown to extend the Principal component analysis criterion to the case at which the scoring factors are supposed to be binary. This allows to address a haunting issue at Kmeans, finding the “right” number of clusters K, by onebyone building Anomalous clusters. Section 4.6 is devoted to partitioning over similarity data. First of all, the complementary Kmeans criterion is equivalently reformulated as the socalled semiaverage similarity criterion. This criterion is maximized with a consecutive merger process referred to as SAAgglomeration clustering to produce provably tight, on average, clusters. This method stops merging clusters when the criterion does not increase anymore if the data has been preprocessed by zeroing the similarities of the objects to themselves. A similar process is considered for another natural criterion, the summary withincluster similarity, for which two preprocessing options are considered. These are: a popular “modularity” clustering option, based on subtraction of random interactions, and “uniform” partitioning, based on a scale shift, a.k.a. soft thresholding. Using either preprocessing option, the summary clustering also leads to an automated determination of the number of clusters. The chapter concludes with Sect. 4.7 on consensus clustering, a more recent concept. In the context of central partition for a given ensemble of partitions, two distancebetweenpartitions measures apply, both involving the socalled consensus matrix. The consensus similarity is defined, for any two objects, by the number of clusters in the ensemble to which both objects belong. This brings the issue of consensus into the context of similarity clustering, in the form of either the semiaverage criterion or uniform partitioning criterion.
References
 R.K. Ahuja, T.L. Magnanti, J.B. Orlin, Network Flows (Pearson Education, London, 2014)zbMATHGoogle Scholar
 J. Bezdek, J. Keller, R. Krisnapuram, M. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing (Kluwer Academic Publishers, 1999)Google Scholar
 B. Efron, T. Hastie, Computer Age Statistical Inference (Cambridge University Press, Cambridge, 2016)CrossRefGoogle Scholar
 S.B. Green, N.J. Salkind, Using SPSS for the Windows and Mackintosh: Analyzing and Understanding Data (Prentice Hall, Upper Saddle River, 2003)Google Scholar
 J.A. Hartigan, Clustering Algorithms (Wiley, Hoboken, 1975)zbMATHGoogle Scholar
 C. Hennig, M. Meila, F. Murtagh, R. Rocci (eds.), Handbook of Cluster Analysis (CRC Press, Boca Raton, 2015)Google Scholar
 R. Johnsonbaugh, M. Schaefer, Algorithms (Pearson Prentice Hall, 2004). ISBN 0131228536Google Scholar
 L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, Hoboken, 1990)CrossRefGoogle Scholar
 M.G. Kendall, A. Stewart, Advanced Statistics: Inference and Relationship, 3rd edn. (Griffin, London, 1973). ISBN 0852642154Google Scholar
 T. Kohonen, Selforganizing Maps (SpringerVerlag, Berlin, 1995)CrossRefGoogle Scholar
 A. Kryshtanowski, Analysis of Sociology Data with SPSS (Higher School of Economics Publishers, Moscow, 2008). (in Russian)Google Scholar
 B. Mirkin, Mathematical Classification and Clustering (Kluwer Academic Press, 1996)Google Scholar
 B. Mirkin, Clustering: A Data Recovery Approach (Chapman & Hall/CRC Press, 2012)Google Scholar
 S. Nascimento, Fuzzy Clustering via Proportional Membership Model (ISO Press, 2005)Google Scholar
 B. Polyak, Introduction to Optimization (Optimization Software, Los Angeles, 1987). ISBN 0911575144Google Scholar
Articles
 I.E. Allen, C.A. Seaman, Likert scales and data analyses. Qual. Prog. 40(7), 64 (2007)Google Scholar
 S. Bandyopadhyay, U. Maulik, An evolutionary technique based on Kmeans algorithm for optimal clustering in R^{N}. Inf. Sci. 146, 221–237 (2002)CrossRefGoogle Scholar
 R. Cangelosi, A. Goriely, Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2 (2007). http://www.biolgydirect.com/content/2/1/2
 B.J. Frey, D. Dueck, Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)MathSciNetCrossRefGoogle Scholar
 M. Girolami, Mercer kernelbased clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)CrossRefGoogle Scholar
 J. Kettenring, The practice of cluster analysis. J. Classif. 23, 3–30 (2006)MathSciNetCrossRefGoogle Scholar
 Y. Lu, S. Lu, F. Fotouhi, Y. Deng, S. Brown, Incremental genetic algorithm and its application in gene expression data analysis. BMC Bioinformatics 5, 172 (2004)CrossRefGoogle Scholar
 M. MingTso Chiang, B. Mirkin, Intelligent choice of the number of clusters in Kmeans clustering: an experimental study with different cluster spreads. J. Classif. 27(1), 3–40 (2010)MathSciNetCrossRefGoogle Scholar
 B. Mirkin, T.I. Fenner, Distance and consensus for preference relations corresponding to ordered partitions. J. Classif. 36(2) (2019) (see also https://publications.hse.ru/en/preprints/210704630, HSE Working Paper, 2016, no. 8)
 C.A. Murthy, N. Chowdhury, In search of optimal clusters using genetic algorithm, Pattern Recognit. Lett., 17(8), 825–832 (1996)Google Scholar
 S. Nascimento, P. Franco, Unsupervised fuzzy clustering for the segmentation and annotation of upwelling regions in sea surface temperature images, in Discovery Science, ed. by J. Gama. LNCS, vol. 5808 (SpringerVerlag, 2009), pp. 212–224Google Scholar
 M.E. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. U.S.A. 103(23), 8577–8582 (2006)CrossRefGoogle Scholar
 S. Paterlini, T. Krink, Differential evolution and PSO in partitional clustering. Comput. Stat. Data Anal. 50, 1220–1247 (2006)CrossRefGoogle Scholar
 P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefGoogle Scholar
 R. Stanforth, B. Mirkin, E. Kolossov, A measure of domain of applicability for QSAR modelling based on intelligent Kmeans clustering. QSAR Comb. Sci. 26(7), 837–844 (2007)CrossRefGoogle Scholar
 D. Steinley, M.J. Brusco, Initializing Kmeans batch clustering: a critical evaluation of several techniques. J. Classif. 24(1), 99–121 (2007)MathSciNetCrossRefGoogle Scholar