Core Partitioning: K-means and Similarity Clustering

  • Boris MirkinEmail author
Part of the Undergraduate Topics in Computer Science book series (UTICS)


K-means is arguably the most popular cluster-analysis method. The method’s output is twofold: (1) a partition of the entity set into clusters, and (2) centers representing the clusters. The method is rather intuitive and usually requires just a few pages to get presented. In contrast, this text includes a number of less popular subjects that are much important when using K-means for real-world data analysis:

  • Data standardization, especially, at nominal or mixed scales

  • Innate and other tools for interpretation of clusters

  • Analysis of examples of K-means working and its failures

  • Initialization—the choice of the number of clusters and location of centers.

Versions of K-means such as incremental K-means, nature inspired K-means, and entity-center “medoid” methods are presented. Three modifications of K-means onto different cluster structures are given: Fuzzy K-means for finding fuzzy clusters, Expectation-Maximization (EM) for finding probabilistic clusters, and Kohonen’s self-organizing maps (SOM) that tie up the sought clusters to a visually convenient two-dimensional grid. An equivalent reformulation of K-means criterion is described to yield what we call the complementary criterion. This criterion allows to reinterpret the method as that for finding big anomalous clusters. In this formulation, K-means is shown to extend the Principal component analysis criterion to the case at which the scoring factors are supposed to be binary. This allows to address a haunting issue at K-means, finding the “right” number of clusters K, by one-by-one building Anomalous clusters. Section 4.6 is devoted to partitioning over similarity data. First of all, the complementary K-means criterion is equivalently reformulated as the so-called semi-average similarity criterion. This criterion is maximized with a consecutive merger process referred to as SA-Agglomeration clustering to produce provably tight, on average, clusters. This method stops merging clusters when the criterion does not increase anymore if the data has been pre-processed by zeroing the similarities of the objects to themselves. A similar process is considered for another natural criterion, the summary within-cluster similarity, for which two pre-processing options are considered. These are: a popular “modularity” clustering option, based on subtraction of random interactions, and “uniform” partitioning, based on a scale shift, a.k.a. soft thresholding. Using either pre-processing option, the summary clustering also leads to an automated determination of the number of clusters. The chapter concludes with Sect. 4.7 on consensus clustering, a more recent concept. In the context of central partition for a given ensemble of partitions, two distance-between-partitions measures apply, both involving the so-called consensus matrix. The consensus similarity is defined, for any two objects, by the number of clusters in the ensemble to which both objects belong. This brings the issue of consensus into the context of similarity clustering, in the form of either the semi-average criterion or uniform partitioning criterion.


  1. R.K. Ahuja, T.L. Magnanti, J.B. Orlin, Network Flows (Pearson Education, London, 2014)zbMATHGoogle Scholar
  2. J. Bezdek, J. Keller, R. Krisnapuram, M. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing (Kluwer Academic Publishers, 1999)Google Scholar
  3. B. Efron, T. Hastie, Computer Age Statistical Inference (Cambridge University Press, Cambridge, 2016)CrossRefGoogle Scholar
  4. S.B. Green, N.J. Salkind, Using SPSS for the Windows and Mackintosh: Analyzing and Understanding Data (Prentice Hall, Upper Saddle River, 2003)Google Scholar
  5. J.A. Hartigan, Clustering Algorithms (Wiley, Hoboken, 1975)zbMATHGoogle Scholar
  6. C. Hennig, M. Meila, F. Murtagh, R. Rocci (eds.), Handbook of Cluster Analysis (CRC Press, Boca Raton, 2015)Google Scholar
  7. R. Johnsonbaugh, M. Schaefer, Algorithms (Pearson Prentice Hall, 2004). ISBN 0-13-122853-6Google Scholar
  8. L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, Hoboken, 1990)CrossRefGoogle Scholar
  9. M.G. Kendall, A. Stewart, Advanced Statistics: Inference and Relationship, 3rd edn. (Griffin, London, 1973). ISBN 0852642154Google Scholar
  10. T. Kohonen, Self-organizing Maps (Springer-Verlag, Berlin, 1995)CrossRefGoogle Scholar
  11. A. Kryshtanowski, Analysis of Sociology Data with SPSS (Higher School of Economics Publishers, Moscow, 2008). (in Russian)Google Scholar
  12. B. Mirkin, Mathematical Classification and Clustering (Kluwer Academic Press, 1996)Google Scholar
  13. B. Mirkin, Clustering: A Data Recovery Approach (Chapman & Hall/CRC Press, 2012)Google Scholar
  14. S. Nascimento, Fuzzy Clustering via Proportional Membership Model (ISO Press, 2005)Google Scholar
  15. B. Polyak, Introduction to Optimization (Optimization Software, Los Angeles, 1987). ISBN 0911575144Google Scholar


  1. I.E. Allen, C.A. Seaman, Likert scales and data analyses. Qual. Prog. 40(7), 64 (2007)Google Scholar
  2. S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inf. Sci. 146, 221–237 (2002)CrossRefGoogle Scholar
  3. R. Cangelosi, A. Goriely, Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2 (2007).
  4. B.J. Frey, D. Dueck, Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)MathSciNetCrossRefGoogle Scholar
  5. M. Girolami, Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)CrossRefGoogle Scholar
  6. J. Kettenring, The practice of cluster analysis. J. Classif. 23, 3–30 (2006)MathSciNetCrossRefGoogle Scholar
  7. Y. Lu, S. Lu, F. Fotouhi, Y. Deng, S. Brown, Incremental genetic algorithm and its application in gene expression data analysis. BMC Bioinformatics 5, 172 (2004)CrossRefGoogle Scholar
  8. M. Ming-Tso Chiang, B. Mirkin, Intelligent choice of the number of clusters in K-means clustering: an experimental study with different cluster spreads. J. Classif. 27(1), 3–40 (2010)MathSciNetCrossRefGoogle Scholar
  9. B. Mirkin, T.I. Fenner, Distance and consensus for preference relations corresponding to ordered partitions. J. Classif. 36(2) (2019) (see also, HSE Working Paper, 2016, no. 8)
  10. C.A. Murthy, N. Chowdhury, In search of optimal clusters using genetic algorithm, Pattern Recognit. Lett., 17(8), 825–832 (1996)Google Scholar
  11. S. Nascimento, P. Franco, Unsupervised fuzzy clustering for the segmentation and annotation of upwelling regions in sea surface temperature images, in Discovery Science, ed. by J. Gama. LNCS, vol. 5808 (Springer-Verlag, 2009), pp. 212–224Google Scholar
  12. M.E. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. U.S.A. 103(23), 8577–8582 (2006)CrossRefGoogle Scholar
  13. S. Paterlini, T. Krink, Differential evolution and PSO in partitional clustering. Comput. Stat. Data Anal. 50, 1220–1247 (2006)CrossRefGoogle Scholar
  14. P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefGoogle Scholar
  15. R. Stanforth, B. Mirkin, E. Kolossov, A measure of domain of applicability for QSAR modelling based on intelligent K-means clustering. QSAR Comb. Sci. 26(7), 837–844 (2007)CrossRefGoogle Scholar
  16. D. Steinley, M.J. Brusco, Initializing K-means batch clustering: a critical evaluation of several techniques. J. Classif. 24(1), 99–121 (2007)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Data Analysis and Artificial Intelligence, Faculty of Computer ScienceNational Research University Higher School of EconomicsMoscowRussia
  2. 2.Professor Emeritus, Department of Computer Science and Information SystemsBirkbeck University of LondonLondonUK

Personalised recommendations