Core Partitioning: K-means and Similarity Clustering

Mirkin, Boris

doi:10.1007/978-3-030-00271-8_4

Boris Mirkin^11,12

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

2387 Accesses
1 Citations

Abstract

K-means is arguably the most popular cluster-analysis method. The method’s output is twofold: (1) a partition of the entity set into clusters, and (2) centers representing the clusters . The method is rather intuitive and usually requires just a few pages to get presented. In contrast, this text includes a number of less popular subjects that are much important when using K-means for real-world data analysis:

Data standardization, especially, at nominal or mixed scales
Innate and other tools for interpretation of clusters
Analysis of examples of K-means working and its failures
Initialization—the choice of the number of clusters and location of centers.

Versions of K-means such as incremental K-means, nature inspired K-means, and entity-center “medoid” methods are presented. Three modifications of K-means onto different cluster structures are given: Fuzzy K-means for finding fuzzy clusters, Expectation-Maximization (EM) for finding probabilistic clusters, and Kohonen’s self-organizing maps (SOM) that tie up the sought clusters to a visually convenient two-dimensional grid. An equivalent reformulation of K-means criterion is described to yield what we call the complementary criterion. This criterion allows to reinterpret the method as that for finding big anomalous clusters. In this formulation, K-means is shown to extend the Principal component analysis criterion to the case at which the scoring factors are supposed to be binary. This allows to address a haunting issue at K-means, finding the “right” number of clusters K, by one-by-one building Anomalous clusters. Section 4.6 is devoted to partitioning over similarity data. First of all, the complementary K-means criterion is equivalently reformulated as the so-called semi-average similarity criterion. This criterion is maximized with a consecutive merger process referred to as SA-Agglomeration clustering to produce provably tight, on average, clusters. This method stops merging clusters when the criterion does not increase anymore if the data has been pre-processed by zeroing the similarities of the objects to themselves. A similar process is considered for another natural criterion, the summary within-cluster similarity, for which two pre-processing options are considered. These are: a popular “modularity” clustering option, based on subtraction of random interactions, and “uniform” partitioning, based on a scale shift, a.k.a. soft thresholding. Using either pre-processing option, the summary clustering also leads to an automated determination of the number of clusters. The chapter concludes with Sect. 4.7 on consensus clustering, a more recent concept. In the context of central partition for a given ensemble of partitions, two distance-between-partitions measures apply, both involving the so-called consensus matrix. The consensus similarity is defined, for any two objects, by the number of clusters in the ensemble to which both objects belong. This brings the issue of consensus into the context of similarity clustering, in the form of either the semi-average criterion or uniform partitioning criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

R.K. Ahuja, T.L. Magnanti, J.B. Orlin, Network Flows (Pearson Education, London, 2014)
MATH Google Scholar
J. Bezdek, J. Keller, R. Krisnapuram, M. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing (Kluwer Academic Publishers, 1999)
Google Scholar
B. Efron, T. Hastie, Computer Age Statistical Inference (Cambridge University Press, Cambridge, 2016)
Book Google Scholar
S.B. Green, N.J. Salkind, Using SPSS for the Windows and Mackintosh: Analyzing and Understanding Data (Prentice Hall, Upper Saddle River, 2003)
Google Scholar
J.A. Hartigan, Clustering Algorithms (Wiley, Hoboken, 1975)
MATH Google Scholar
C. Hennig, M. Meila, F. Murtagh, R. Rocci (eds.), Handbook of Cluster Analysis (CRC Press, Boca Raton, 2015)
Google Scholar
R. Johnsonbaugh, M. Schaefer, Algorithms (Pearson Prentice Hall, 2004). ISBN 0-13-122853-6
Google Scholar
L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, Hoboken, 1990)
Book Google Scholar
M.G. Kendall, A. Stewart, Advanced Statistics: Inference and Relationship, 3rd edn. (Griffin, London, 1973). ISBN 0852642154
Google Scholar
T. Kohonen, Self-organizing Maps (Springer-Verlag, Berlin, 1995)
Book Google Scholar
A. Kryshtanowski, Analysis of Sociology Data with SPSS (Higher School of Economics Publishers, Moscow, 2008). (in Russian)
Google Scholar
B. Mirkin, Mathematical Classification and Clustering (Kluwer Academic Press, 1996)
Google Scholar
B. Mirkin, Clustering: A Data Recovery Approach (Chapman & Hall/CRC Press, 2012)
Google Scholar
S. Nascimento, Fuzzy Clustering via Proportional Membership Model (ISO Press, 2005)
Google Scholar
B. Polyak, Introduction to Optimization (Optimization Software, Los Angeles, 1987). ISBN 0911575144
Google Scholar

Articles

I.E. Allen, C.A. Seaman, Likert scales and data analyses. Qual. Prog. 40(7), 64 (2007)
Google Scholar
S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in R^N. Inf. Sci. 146, 221–237 (2002)
Article Google Scholar
R. Cangelosi, A. Goriely, Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2 (2007). http://www.biolgy-direct.com/con-tent/2/1/2
B.J. Frey, D. Dueck, Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)
Article MathSciNet Google Scholar
M. Girolami, Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)
Article Google Scholar
J. Kettenring, The practice of cluster analysis. J. Classif. 23, 3–30 (2006)
Article MathSciNet Google Scholar
Y. Lu, S. Lu, F. Fotouhi, Y. Deng, S. Brown, Incremental genetic algorithm and its application in gene expression data analysis. BMC Bioinformatics 5, 172 (2004)
Article Google Scholar
M. Ming-Tso Chiang, B. Mirkin, Intelligent choice of the number of clusters in K-means clustering: an experimental study with different cluster spreads. J. Classif. 27(1), 3–40 (2010)
Article MathSciNet Google Scholar
B. Mirkin, T.I. Fenner, Distance and consensus for preference relations corresponding to ordered partitions. J. Classif. 36(2) (2019) (see also https://publications.hse.ru/en/preprints/210704630, HSE Working Paper, 2016, no. 8)
C.A. Murthy, N. Chowdhury, In search of optimal clusters using genetic algorithm, Pattern Recognit. Lett., 17(8), 825–832 (1996)
Google Scholar
S. Nascimento, P. Franco, Unsupervised fuzzy clustering for the segmentation and annotation of upwelling regions in sea surface temperature images, in Discovery Science, ed. by J. Gama. LNCS, vol. 5808 (Springer-Verlag, 2009), pp. 212–224
Google Scholar
M.E. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. U.S.A. 103(23), 8577–8582 (2006)
Article Google Scholar
S. Paterlini, T. Krink, Differential evolution and PSO in partitional clustering. Comput. Stat. Data Anal. 50, 1220–1247 (2006)
Article Google Scholar
P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
R. Stanforth, B. Mirkin, E. Kolossov, A measure of domain of applicability for QSAR modelling based on intelligent K-means clustering. QSAR Comb. Sci. 26(7), 837–844 (2007)
Article Google Scholar
D. Steinley, M.J. Brusco, Initializing K-means batch clustering: a critical evaluation of several techniques. J. Classif. 24(1), 99–121 (2007)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Data Analysis and Artificial Intelligence, Faculty of Computer Science, National Research University Higher School of Economics, Moscow, Russia
Boris Mirkin (Professor)
Professor Emeritus, Department of Computer Science and Information Systems, Birkbeck University of London, London, UK
Boris Mirkin (Professor)

Authors

Boris Mirkin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Boris Mirkin .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mirkin, B. (2019). Core Partitioning: K-means and Similarity Clustering. In: Core Data Analysis: Summarization, Correlation, and Visualization. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-00271-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-00271-8_4
Published: 14 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00270-1
Online ISBN: 978-3-030-00271-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics