K-Means and Related Clustering Methods

  • Boris Mirkin
Part of the Undergraduate Topics in Computer Science book series (UTICS)


K-Means is arguably the most popular data analysis method. The method outputs a partition of the entity set into clusters and centroids representing them. It is very intuitive and usually requires just a few pages to get presented. This text includes a number of less popular subjects that are important when using K-Means for real-world data analysis: Data standardization, especially, at mixed scales Innate tools for interpretation of clusters Analysis of examples of K-Means working and its failures Initialization – the choice of the number of clusters and location of centroids sVersions of K-Means such as incremental K-Means, nature inspired K-Means, and entity-centroid “medoid” methods are presented. Three modifications of K-Means onto different cluster structures are given:. Fuzzy K-Means for finding fuzzy clusters, Expectation-Maximization (EM) for finding probabilistic clusters, and Kohonen self-organizing maps (SOM) that tie up the sought clusters to a visually convenient two-dimensional grid. Equivalent reformulations of K-Means criterion are described – they can yield different algorithms for K-Means. One of these is explained at length: K-Means extends Principal component analysis to the case of binary scoring factors, which yields the so-called Anomalous cluster method, a key to an intelligent version of K-Means with automated choice of the number of clusters and their initialization.


Fuzzy Cluster Data Scatter Cluster Centroid Gravity Center Company Data 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Bandyopadhyay, S., Maulik, U.: An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inf. Sci. 146, 221–237 (2002).MathSciNetMATHCrossRefGoogle Scholar
  2. Bezdek, J., Keller, J., Krisnapuram, R., Pal, M.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer Academic Publishers, Dordrecht (1999).Google Scholar
  3. Cangelosi, R., Goriely, A.: Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct. 2, 2 (2007). http://www.biolgy-direct.com/con-tent/2/1/2.CrossRefGoogle Scholar
  4. Green, S.B., Salkind, N.J.: Using SPSS for the Windows and Mackintosh: Analyzing and Understanding Data. Prentice Hall, Upper Saddle River, NJ (2003).Google Scholar
  5. Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975).Google Scholar
  6. Kaufman. L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990).Google Scholar
  7. Kendall, M.G., Stewart, A.: Advanced Statistics: Inference and Relationship (3d edition). Griffin, London (1973). ISBN: 0852642156.Google Scholar
  8. Kettenring, J.: The practice of cluster analysis. J. Classific. 23, 3–30 (2006).MathSciNetCrossRefGoogle Scholar
  9. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995).CrossRefGoogle Scholar
  10. Kryshtanowski, A.: Analysis of Sociology Data with SPSS. Higher School of Economics Publishers, Moscow (in Russian) (2008).Google Scholar
  11. Lu, Y., Lu, S., Fotouhi, F., Deng, Y., Brown, S.: Incremental genetic algorithm and its application in gene expression data analysis. BMC Bioinform. 5,172 (2004).CrossRefGoogle Scholar
  12. Ming-Tso Chiang, M., Mirkin, B.: Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads. J. Classif. 27(1), 3–40 (2010).CrossRefGoogle Scholar
  13. Mirkin, B.: Clustering for Data Mining: A Data Recovery Approach. Chapman & Hall/CRC, Roca Baton, FL (2005). ISBN 1-58488-534-3.Google Scholar
  14. Mirkin, B.: Mathematical Classification and Clustering. Kluwer Academic Press, Boston-Dordrecht (1996).Google Scholar
  15. Murthy, C.A., Chowdhury, N.: In search of optimal clusters using genetic algorithms. Pattern Recognit. Lett. 17, 825–832 (1996).Google Scholar
  16. Nascimento, S., Franco, P.: Unsupervised fuzzy clustering for the segmentation and annotation of upwelling regions in sea surface temperature images. In: Gama, J. (ed.) Discovery Science, LNCS 5808, pp. 212–226. Springer (2009).Google Scholar
  17. Nascimento, S.: Fuzzy Clustering via Proportional Membership Model. ISO Press, Amsterdam (2005).Google Scholar
  18. Paterlini, S., Krink, T.: Differential evolution and PSO in partitional clustering. Comput. Stat. Data Anal. 50, 1220–1247 (2006).MathSciNetCrossRefGoogle Scholar
  19. Stanforth, R., Mirkin, B., Kolossov, E.: A measure of domain of applicability for QSAR modelling based on intelligent K-Means clustering. QSAR Comb. Sci. 26(7), 837–844 (2007).CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Boris Mirkin
    • 1
    • 2
  1. 1.Research University – Higher School of Economics, School of Applied Mathematics and InformaticsMoscowRussia
  2. 2.Department of Computer ScienceBirkbeck University of LondonLondonUK

Personalised recommendations