Partition-Based Clustering with k-Means

Nielsen, Frank

doi:10.1007/978-3-319-21903-5_7

Frank Nielsen^3,4

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

5010 Accesses

Abstract

Nowadays, huge size data-sets are commonly publicly available, and it becomes increasingly important to eefficiently process them to discover worthwhile structures (or “patterns”) in those seas of data. Exploratory data analysis is concerned with this challenge of finding such structural information without any prior knowledge: in this case, those techniques that consist in learning from data without prior knowledge are called generically unsupervised machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available online at https://archive.ics.uci.edu/ml/datasets.html.
2.
http://en.wikipedia.org/wiki/Iris_flower_data_set.
3.
Otherwise, we can choose the “smallest” x that yields the minimum value according to some lexicographic order on \({\mathbb {X}}\).
4.
http://en.wikipedia.org/wiki/Geometric_median.
5.
Freely available online at https://www.r-project.org/.
6.
The 3 -SAT problem consists in answering whether a boolean formula with n clauses of 3 literals can be satisfiable or not. 3-SAT is a famous NP-complete problem (Cook’s theorem, 1971), a corner stone of theoretical computer science.
7.
The number of distinct partitions of a set of n elements into k non-empty subsets is defined by the second kind of Stirling number: \(\left\{ \begin{matrix} n \\ k \end{matrix}\right\} = \frac{1}{k!}\sum _{j=0}^{k} (-1)^{k-j}\left( {\begin{array}{c}k\\ j\end{array}}\right) j^n\).
8.
Vertical partitioning means that each entity has only a block of attributes.
9.
It is mathematically always possible to separate data by increasing by lifting the features into higher dimensions using a kernel mapping.

References

Vattani, A.: \(k\)-means requires exponentially many iterations even in the plane. Discret. Comput. Geom. 45(4), 596–616 (2011)
Google Scholar
Har-Peled, S., Sadri, B.: How fast is the \(k\)-means method? Algorithmica 41(3), 185–202 (2005)
Google Scholar
Arthur, D., Vassilvitskii, S.: \(k\)-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)
Google Scholar
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)
Google Scholar
Lashkari, D., Golland, P.: Convex clustering with exemplar-based models. In: Advances in Neural Information Processing Systems, pp. 825–832 (2007)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. classif. 2(1), 193–218 (1985)
Google Scholar
Steinhaus, H.: Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. Cl. III(4), 801–804 (1956)
Google Scholar
Kleinberg, J.: An impossibility theorem for clustering. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, pp. 446–453. MIT Press, USA (2002)
Google Scholar
Zadeh, R., Ben-David, S.: A uniqueness theorem for clustering. In: Bilmes, J., Ng, A.Y. (eds) Uncertainty in Artificial Intelligence (UAI), pp. 639–646. Association for Uncertainty in Artificial Intelligence (AUAI) Press, USA (2009)
Google Scholar
Carlsson, G., Mémoli, F.: Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res. (JMLR) 11, 1425–1470 (2010)
Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory, IT-28(2):129–137, Mar (1982). First appeared as a technical report in 1957
Google Scholar
Matousek, J.: On approximate geometric \(k\)-clustering. Discret. Comput. Geom. 24(1), 61–84 (2000)
Google Scholar
Awasthi, P., Blum, A., Sheffet, O.: Stability yields a PTAS for \(k\)-median and \(k\)-means clustering. In: FOCS, pp. 309–318. IEEE Computer Society, USA (2010)
Google Scholar
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pp. 245–260, Springer, London (2000)
Google Scholar
James, B.: MacQueen. Some methods of classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley (1967)
Google Scholar
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable \(k\)-means+. In: Proceedings of the VLDB Endowment 5(7) (2012)
Google Scholar
Babenko, A., Lempitsky, V.S.: Improving bilayer product quantization for billion-scale approximate nearest neighbors in high dimensions. CoRR, abs/1404.1831, 2014
Google Scholar
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for \(k\)-means, PCA and projective clustering. In: Symposium on Discrete Algorithms (SODA), pp. 1434–1453 (2013)
Google Scholar
Balcan, M-F., Ehrlich, S., Liang, Y.: Distributed \(k\)-means and \(k\)-median clustering on general topologies. In: Advances in Neural Information Processing Systems, pp. 1995–2003 (2013)
Google Scholar
Vaidya, J., Clifton, C.: Privacy-preserving \(k\)-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215, ACM, New York (2003)
Google Scholar
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel \(k\)-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pp. 551–556, ACM, New York (2004)
Google Scholar
Telgarsky, M., Vattani, A.: Hartigan’s method: \(k\)-means clustering without Voronoi. In: International Conference on Artificial Intelligence and Statistics, pp. 820–827 (2010)
Google Scholar
Banerjee, A., Merugu, S., Inderjit S.D., Joydeep G.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
Google Scholar
Huang, Z.: Extensions to the \(k\)-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)
Google Scholar
Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence models, analysis, and simulation. J. Artif. Soc. Soc. Simul. (JASSS), 5(3) (2002)
Google Scholar
Nielsen, F., Nock, R.: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning. IEEE Signal Process. Lett. 21(10), 1289–1292 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

École Polytechnique, Palaiseau, France
Frank Nielsen
Sony Computer Science Laboratories, Inc., Tokyo, Japan
Frank Nielsen

Authors

Frank Nielsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank Nielsen .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nielsen, F. (2016). Partition-Based Clustering with k-Means. In: Introduction to HPC with MPI for Data Science. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-21903-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-21903-5_7
Published: 04 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21902-8
Online ISBN: 978-3-319-21903-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics