Abstract
Nowadays, huge size data-sets are commonly publicly available, and it becomes increasingly important to eefficiently process them to discover worthwhile structures (or “patterns”) in those seas of data. Exploratory data analysis is concerned with this challenge of finding such structural information without any prior knowledge: in this case, those techniques that consist in learning from data without prior knowledge are called generically unsupervised machine learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available online at https://archive.ics.uci.edu/ml/datasets.html.
- 2.
- 3.
Otherwise, we can choose the “smallest” x that yields the minimum value according to some lexicographic order on \({\mathbb {X}}\).
- 4.
- 5.
Freely available online at https://www.r-project.org/.
- 6.
The 3 -SAT problem consists in answering whether a boolean formula with n clauses of 3 literals can be satisfiable or not. 3-SAT is a famous NP-complete problem (Cook’s theorem, 1971), a corner stone of theoretical computer science.
- 7.
The number of distinct partitions of a set of n elements into k non-empty subsets is defined by the second kind of Stirling number: \(\left\{ \begin{matrix} n \\ k \end{matrix}\right\} = \frac{1}{k!}\sum _{j=0}^{k} (-1)^{k-j}\left( {\begin{array}{c}k\\ j\end{array}}\right) j^n\).
- 8.
Vertical partitioning means that each entity has only a block of attributes.
- 9.
It is mathematically always possible to separate data by increasing by lifting the features into higher dimensions using a kernel mapping.
References
Vattani, A.: \(k\)-means requires exponentially many iterations even in the plane. Discret. Comput. Geom. 45(4), 596–616 (2011)
Har-Peled, S., Sadri, B.: How fast is the \(k\)-means method? Algorithmica 41(3), 185–202 (2005)
Arthur, D., Vassilvitskii, S.: \(k\)-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)
Lashkari, D., Golland, P.: Convex clustering with exemplar-based models. In: Advances in Neural Information Processing Systems, pp. 825–832 (2007)
Hubert, L., Arabie, P.: Comparing partitions. J. classif. 2(1), 193–218 (1985)
Steinhaus, H.: Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. Cl. III(4), 801–804 (1956)
Kleinberg, J.: An impossibility theorem for clustering. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, pp. 446–453. MIT Press, USA (2002)
Zadeh, R., Ben-David, S.: A uniqueness theorem for clustering. In: Bilmes, J., Ng, A.Y. (eds) Uncertainty in Artificial Intelligence (UAI), pp. 639–646. Association for Uncertainty in Artificial Intelligence (AUAI) Press, USA (2009)
Carlsson, G., Mémoli, F.: Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res. (JMLR) 11, 1425–1470 (2010)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory, IT-28(2):129–137, Mar (1982). First appeared as a technical report in 1957
Matousek, J.: On approximate geometric \(k\)-clustering. Discret. Comput. Geom. 24(1), 61–84 (2000)
Awasthi, P., Blum, A., Sheffet, O.: Stability yields a PTAS for \(k\)-median and \(k\)-means clustering. In: FOCS, pp. 309–318. IEEE Computer Society, USA (2010)
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pp. 245–260, Springer, London (2000)
James, B.: MacQueen. Some methods of classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley (1967)
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable \(k\)-means+. In: Proceedings of the VLDB Endowment 5(7) (2012)
Babenko, A., Lempitsky, V.S.: Improving bilayer product quantization for billion-scale approximate nearest neighbors in high dimensions. CoRR, abs/1404.1831, 2014
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for \(k\)-means, PCA and projective clustering. In: Symposium on Discrete Algorithms (SODA), pp. 1434–1453 (2013)
Balcan, M-F., Ehrlich, S., Liang, Y.: Distributed \(k\)-means and \(k\)-median clustering on general topologies. In: Advances in Neural Information Processing Systems, pp. 1995–2003 (2013)
Vaidya, J., Clifton, C.: Privacy-preserving \(k\)-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215, ACM, New York (2003)
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel \(k\)-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pp. 551–556, ACM, New York (2004)
Telgarsky, M., Vattani, A.: Hartigan’s method: \(k\)-means clustering without Voronoi. In: International Conference on Artificial Intelligence and Statistics, pp. 820–827 (2010)
Banerjee, A., Merugu, S., Inderjit S.D., Joydeep G.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
Huang, Z.: Extensions to the \(k\)-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)
Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence models, analysis, and simulation. J. Artif. Soc. Soc. Simul. (JASSS), 5(3) (2002)
Nielsen, F., Nock, R.: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning. IEEE Signal Process. Lett. 21(10), 1289–1292 (2014)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Nielsen, F. (2016). Partition-Based Clustering with k-Means. In: Introduction to HPC with MPI for Data Science. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-21903-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-21903-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21902-8
Online ISBN: 978-3-319-21903-5
eBook Packages: Computer ScienceComputer Science (R0)