Abstract
K-means is arguably the most popular cluster-analysis method. The method’s output is twofold: (1) a partition of the entity set into clusters, and (2) centers representing the clusters . The method is rather intuitive and usually requires just a few pages to get presented. In contrast, this text includes a number of less popular subjects that are much important when using K-means for real-world data analysis:
-
Data standardization, especially, at nominal or mixed scales
-
Innate and other tools for interpretation of clusters
-
Analysis of examples of K-means working and its failures
-
Initialization—the choice of the number of clusters and location of centers.
Versions of K-means such as incremental K-means, nature inspired K-means, and entity-center “medoid” methods are presented. Three modifications of K-means onto different cluster structures are given: Fuzzy K-means for finding fuzzy clusters, Expectation-Maximization (EM) for finding probabilistic clusters, and Kohonen’s self-organizing maps (SOM) that tie up the sought clusters to a visually convenient two-dimensional grid. An equivalent reformulation of K-means criterion is described to yield what we call the complementary criterion. This criterion allows to reinterpret the method as that for finding big anomalous clusters. In this formulation, K-means is shown to extend the Principal component analysis criterion to the case at which the scoring factors are supposed to be binary. This allows to address a haunting issue at K-means, finding the “right” number of clusters K, by one-by-one building Anomalous clusters. Section 4.6 is devoted to partitioning over similarity data. First of all, the complementary K-means criterion is equivalently reformulated as the so-called semi-average similarity criterion. This criterion is maximized with a consecutive merger process referred to as SA-Agglomeration clustering to produce provably tight, on average, clusters. This method stops merging clusters when the criterion does not increase anymore if the data has been pre-processed by zeroing the similarities of the objects to themselves. A similar process is considered for another natural criterion, the summary within-cluster similarity, for which two pre-processing options are considered. These are: a popular “modularity” clustering option, based on subtraction of random interactions, and “uniform” partitioning, based on a scale shift, a.k.a. soft thresholding. Using either pre-processing option, the summary clustering also leads to an automated determination of the number of clusters. The chapter concludes with Sect. 4.7 on consensus clustering, a more recent concept. In the context of central partition for a given ensemble of partitions, two distance-between-partitions measures apply, both involving the so-called consensus matrix. The consensus similarity is defined, for any two objects, by the number of clusters in the ensemble to which both objects belong. This brings the issue of consensus into the context of similarity clustering, in the form of either the semi-average criterion or uniform partitioning criterion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
R.K. Ahuja, T.L. Magnanti, J.B. Orlin, Network Flows (Pearson Education, London, 2014)
J. Bezdek, J. Keller, R. Krisnapuram, M. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing (Kluwer Academic Publishers, 1999)
B. Efron, T. Hastie, Computer Age Statistical Inference (Cambridge University Press, Cambridge, 2016)
S.B. Green, N.J. Salkind, Using SPSS for the Windows and Mackintosh: Analyzing and Understanding Data (Prentice Hall, Upper Saddle River, 2003)
J.A. Hartigan, Clustering Algorithms (Wiley, Hoboken, 1975)
C. Hennig, M. Meila, F. Murtagh, R. Rocci (eds.), Handbook of Cluster Analysis (CRC Press, Boca Raton, 2015)
R. Johnsonbaugh, M. Schaefer, Algorithms (Pearson Prentice Hall, 2004). ISBN 0-13-122853-6
L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, Hoboken, 1990)
M.G. Kendall, A. Stewart, Advanced Statistics: Inference and Relationship, 3rd edn. (Griffin, London, 1973). ISBN 0852642154
T. Kohonen, Self-organizing Maps (Springer-Verlag, Berlin, 1995)
A. Kryshtanowski, Analysis of Sociology Data with SPSS (Higher School of Economics Publishers, Moscow, 2008). (in Russian)
B. Mirkin, Mathematical Classification and Clustering (Kluwer Academic Press, 1996)
B. Mirkin, Clustering: A Data Recovery Approach (Chapman & Hall/CRC Press, 2012)
S. Nascimento, Fuzzy Clustering via Proportional Membership Model (ISO Press, 2005)
B. Polyak, Introduction to Optimization (Optimization Software, Los Angeles, 1987). ISBN 0911575144
Articles
I.E. Allen, C.A. Seaman, Likert scales and data analyses. Qual. Prog. 40(7), 64 (2007)
S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inf. Sci. 146, 221–237 (2002)
R. Cangelosi, A. Goriely, Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2 (2007). http://www.biolgy-direct.com/con-tent/2/1/2
B.J. Frey, D. Dueck, Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
M. Girolami, Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)
J. Kettenring, The practice of cluster analysis. J. Classif. 23, 3–30 (2006)
Y. Lu, S. Lu, F. Fotouhi, Y. Deng, S. Brown, Incremental genetic algorithm and its application in gene expression data analysis. BMC Bioinformatics 5, 172 (2004)
M. Ming-Tso Chiang, B. Mirkin, Intelligent choice of the number of clusters in K-means clustering: an experimental study with different cluster spreads. J. Classif. 27(1), 3–40 (2010)
B. Mirkin, T.I. Fenner, Distance and consensus for preference relations corresponding to ordered partitions. J. Classif. 36(2) (2019) (see also https://publications.hse.ru/en/preprints/210704630, HSE Working Paper, 2016, no. 8)
C.A. Murthy, N. Chowdhury, In search of optimal clusters using genetic algorithm, Pattern Recognit. Lett., 17(8), 825–832 (1996)
S. Nascimento, P. Franco, Unsupervised fuzzy clustering for the segmentation and annotation of upwelling regions in sea surface temperature images, in Discovery Science, ed. by J. Gama. LNCS, vol. 5808 (Springer-Verlag, 2009), pp. 212–224
M.E. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. U.S.A. 103(23), 8577–8582 (2006)
S. Paterlini, T. Krink, Differential evolution and PSO in partitional clustering. Comput. Stat. Data Anal. 50, 1220–1247 (2006)
P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
R. Stanforth, B. Mirkin, E. Kolossov, A measure of domain of applicability for QSAR modelling based on intelligent K-means clustering. QSAR Comb. Sci. 26(7), 837–844 (2007)
D. Steinley, M.J. Brusco, Initializing K-means batch clustering: a critical evaluation of several techniques. J. Classif. 24(1), 99–121 (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Mirkin, B. (2019). Core Partitioning: K-means and Similarity Clustering. In: Core Data Analysis: Summarization, Correlation, and Visualization. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-00271-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-00271-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00270-1
Online ISBN: 978-3-030-00271-8
eBook Packages: Computer ScienceComputer Science (R0)