Skip to main content

Core Partitioning: K-means and Similarity Clustering

  • Chapter
  • First Online:
Core Data Analysis: Summarization, Correlation, and Visualization

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

Abstract

K-means is arguably the most popular cluster-analysis method. The method’s output is twofold: (1) a partition of the entity set into clusters, and (2) centers representing the clusters . The method is rather intuitive and usually requires just a few pages to get presented. In contrast, this text includes a number of less popular subjects that are much important when using K-means for real-world data analysis:

  • Data standardization, especially, at nominal or mixed scales

  • Innate and other tools for interpretation of clusters

  • Analysis of examples of K-means working and its failures

  • Initialization—the choice of the number of clusters and location of centers.

Versions of K-means such as incremental K-means, nature inspired K-means, and entity-center “medoid” methods are presented. Three modifications of K-means onto different cluster structures are given: Fuzzy K-means for finding fuzzy clusters, Expectation-Maximization (EM) for finding probabilistic clusters, and Kohonen’s self-organizing maps (SOM) that tie up the sought clusters to a visually convenient two-dimensional grid. An equivalent reformulation of K-means criterion is described to yield what we call the complementary criterion. This criterion allows to reinterpret the method as that for finding big anomalous clusters. In this formulation, K-means is shown to extend the Principal component analysis criterion to the case at which the scoring factors are supposed to be binary. This allows to address a haunting issue at K-means, finding the “right” number of clusters K, by one-by-one building Anomalous clusters. Section 4.6 is devoted to partitioning over similarity data. First of all, the complementary K-means criterion is equivalently reformulated as the so-called semi-average similarity criterion. This criterion is maximized with a consecutive merger process referred to as SA-Agglomeration clustering to produce provably tight, on average, clusters. This method stops merging clusters when the criterion does not increase anymore if the data has been pre-processed by zeroing the similarities of the objects to themselves. A similar process is considered for another natural criterion, the summary within-cluster similarity, for which two pre-processing options are considered. These are: a popular “modularity” clustering option, based on subtraction of random interactions, and “uniform” partitioning, based on a scale shift, a.k.a. soft thresholding. Using either pre-processing option, the summary clustering also leads to an automated determination of the number of clusters. The chapter concludes with Sect. 4.7 on consensus clustering, a more recent concept. In the context of central partition for a given ensemble of partitions, two distance-between-partitions measures apply, both involving the so-called consensus matrix. The consensus similarity is defined, for any two objects, by the number of clusters in the ensemble to which both objects belong. This brings the issue of consensus into the context of similarity clustering, in the form of either the semi-average criterion or uniform partitioning criterion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • R.K. Ahuja, T.L. Magnanti, J.B. Orlin, Network Flows (Pearson Education, London, 2014)

    MATH  Google Scholar 

  • J. Bezdek, J. Keller, R. Krisnapuram, M. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing (Kluwer Academic Publishers, 1999)

    Google Scholar 

  • B. Efron, T. Hastie, Computer Age Statistical Inference (Cambridge University Press, Cambridge, 2016)

    Book  Google Scholar 

  • S.B. Green, N.J. Salkind, Using SPSS for the Windows and Mackintosh: Analyzing and Understanding Data (Prentice Hall, Upper Saddle River, 2003)

    Google Scholar 

  • J.A. Hartigan, Clustering Algorithms (Wiley, Hoboken, 1975)

    MATH  Google Scholar 

  • C. Hennig, M. Meila, F. Murtagh, R. Rocci (eds.), Handbook of Cluster Analysis (CRC Press, Boca Raton, 2015)

    Google Scholar 

  • R. Johnsonbaugh, M. Schaefer, Algorithms (Pearson Prentice Hall, 2004). ISBN 0-13-122853-6

    Google Scholar 

  • L. Kaufman, P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis (Wiley, Hoboken, 1990)

    Book  Google Scholar 

  • M.G. Kendall, A. Stewart, Advanced Statistics: Inference and Relationship, 3rd edn. (Griffin, London, 1973). ISBN 0852642154

    Google Scholar 

  • T. Kohonen, Self-organizing Maps (Springer-Verlag, Berlin, 1995)

    Book  Google Scholar 

  • A. Kryshtanowski, Analysis of Sociology Data with SPSS (Higher School of Economics Publishers, Moscow, 2008). (in Russian)

    Google Scholar 

  • B. Mirkin, Mathematical Classification and Clustering (Kluwer Academic Press, 1996)

    Google Scholar 

  • B. Mirkin, Clustering: A Data Recovery Approach (Chapman & Hall/CRC Press, 2012)

    Google Scholar 

  • S. Nascimento, Fuzzy Clustering via Proportional Membership Model (ISO Press, 2005)

    Google Scholar 

  • B. Polyak, Introduction to Optimization (Optimization Software, Los Angeles, 1987). ISBN 0911575144

    Google Scholar 

Articles

  • I.E. Allen, C.A. Seaman, Likert scales and data analyses. Qual. Prog. 40(7), 64 (2007)

    Google Scholar 

  • S. Bandyopadhyay, U. Maulik, An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inf. Sci. 146, 221–237 (2002)

    Article  Google Scholar 

  • R. Cangelosi, A. Goriely, Component retention in principal component analysis with application to cDNA microarray data. Biol. Direct 2, 2 (2007). http://www.biolgy-direct.com/con-tent/2/1/2

  • B.J. Frey, D. Dueck, Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)

    Article  MathSciNet  Google Scholar 

  • M. Girolami, Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)

    Article  Google Scholar 

  • J. Kettenring, The practice of cluster analysis. J. Classif. 23, 3–30 (2006)

    Article  MathSciNet  Google Scholar 

  • Y. Lu, S. Lu, F. Fotouhi, Y. Deng, S. Brown, Incremental genetic algorithm and its application in gene expression data analysis. BMC Bioinformatics 5, 172 (2004)

    Article  Google Scholar 

  • M. Ming-Tso Chiang, B. Mirkin, Intelligent choice of the number of clusters in K-means clustering: an experimental study with different cluster spreads. J. Classif. 27(1), 3–40 (2010)

    Article  MathSciNet  Google Scholar 

  • B. Mirkin, T.I. Fenner, Distance and consensus for preference relations corresponding to ordered partitions. J. Classif. 36(2) (2019) (see also https://publications.hse.ru/en/preprints/210704630, HSE Working Paper, 2016, no. 8)

  • C.A. Murthy, N. Chowdhury, In search of optimal clusters using genetic algorithm, Pattern Recognit. Lett., 17(8), 825–832 (1996)

    Google Scholar 

  • S. Nascimento, P. Franco, Unsupervised fuzzy clustering for the segmentation and annotation of upwelling regions in sea surface temperature images, in Discovery Science, ed. by J. Gama. LNCS, vol. 5808 (Springer-Verlag, 2009), pp. 212–224

    Google Scholar 

  • M.E. Newman, Modularity and community structure in networks. Proc. Natl. Acad. Sci. U.S.A. 103(23), 8577–8582 (2006)

    Article  Google Scholar 

  • S. Paterlini, T. Krink, Differential evolution and PSO in partitional clustering. Comput. Stat. Data Anal. 50, 1220–1247 (2006)

    Article  Google Scholar 

  • P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  • R. Stanforth, B. Mirkin, E. Kolossov, A measure of domain of applicability for QSAR modelling based on intelligent K-means clustering. QSAR Comb. Sci. 26(7), 837–844 (2007)

    Article  Google Scholar 

  • D. Steinley, M.J. Brusco, Initializing K-means batch clustering: a critical evaluation of several techniques. J. Classif. 24(1), 99–121 (2007)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Boris Mirkin .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mirkin, B. (2019). Core Partitioning: K-means and Similarity Clustering. In: Core Data Analysis: Summarization, Correlation, and Visualization. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-00271-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00271-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00270-1

  • Online ISBN: 978-3-030-00271-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics