Skip to main content

Partition-Based Clustering with k-Means

  • Chapter
  • First Online:
Introduction to HPC with MPI for Data Science

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

  • 5010 Accesses

Abstract

Nowadays, huge size data-sets are commonly publicly available, and it becomes increasingly important to eefficiently process them to discover worthwhile structures (or “patterns”) in those seas of data. Exploratory data analysis is concerned with this challenge of finding such structural information without any prior knowledge: in this case, those techniques that consist in learning from data without prior knowledge are called generically unsupervised machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available online at https://archive.ics.uci.edu/ml/datasets.html.

  2. 2.

    http://en.wikipedia.org/wiki/Iris_flower_data_set.

  3. 3.

    Otherwise, we can choose the “smallest” x that yields the minimum value according to some lexicographic order on \({\mathbb {X}}\).

  4. 4.

    http://en.wikipedia.org/wiki/Geometric_median.

  5. 5.

    Freely available online at https://www.r-project.org/.

  6. 6.

    The 3 -SAT problem consists in answering whether a boolean formula with n clauses of 3 literals can be satisfiable or not. 3-SAT is a famous NP-complete problem (Cook’s theorem, 1971), a corner stone of theoretical computer science.

  7. 7.

    The number of distinct partitions of a set of n elements into k non-empty subsets is defined by the second kind of Stirling number: \(\left\{ \begin{matrix} n \\ k \end{matrix}\right\} = \frac{1}{k!}\sum _{j=0}^{k} (-1)^{k-j}\left( {\begin{array}{c}k\\ j\end{array}}\right) j^n\).

  8. 8.

    Vertical partitioning means that each entity has only a block of attributes.

  9. 9.

    It is mathematically always possible to separate data by increasing by lifting the features into higher dimensions using a kernel mapping.

References

  1. Vattani, A.: \(k\)-means requires exponentially many iterations even in the plane. Discret. Comput. Geom. 45(4), 596–616 (2011)

    Google Scholar 

  2. Har-Peled, S., Sadri, B.: How fast is the \(k\)-means method? Algorithmica 41(3), 185–202 (2005)

    Google Scholar 

  3. Arthur, D., Vassilvitskii, S.: \(k\)-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)

    Google Scholar 

  4. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)

    Google Scholar 

  5. Lashkari, D., Golland, P.: Convex clustering with exemplar-based models. In: Advances in Neural Information Processing Systems, pp. 825–832 (2007)

    Google Scholar 

  6. Hubert, L., Arabie, P.: Comparing partitions. J. classif. 2(1), 193–218 (1985)

    Google Scholar 

  7. Steinhaus, H.: Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. Cl. III(4), 801–804 (1956)

    Google Scholar 

  8. Kleinberg, J.: An impossibility theorem for clustering. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, pp. 446–453. MIT Press, USA (2002)

    Google Scholar 

  9. Zadeh, R., Ben-David, S.: A uniqueness theorem for clustering. In: Bilmes, J., Ng, A.Y. (eds) Uncertainty in Artificial Intelligence (UAI), pp. 639–646. Association for Uncertainty in Artificial Intelligence (AUAI) Press, USA (2009)

    Google Scholar 

  10. Carlsson, G., Mémoli, F.: Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res. (JMLR) 11, 1425–1470 (2010)

    Google Scholar 

  11. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory, IT-28(2):129–137, Mar (1982). First appeared as a technical report in 1957

    Google Scholar 

  12. Matousek, J.: On approximate geometric \(k\)-clustering. Discret. Comput. Geom. 24(1), 61–84 (2000)

    Google Scholar 

  13. Awasthi, P., Blum, A., Sheffet, O.: Stability yields a PTAS for \(k\)-median and \(k\)-means clustering. In: FOCS, pp. 309–318. IEEE Computer Society, USA (2010)

    Google Scholar 

  14. Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pp. 245–260, Springer, London (2000)

    Google Scholar 

  15. James, B.: MacQueen. Some methods of classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, Berkeley (1967)

    Google Scholar 

  16. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable \(k\)-means+. In: Proceedings of the VLDB Endowment 5(7) (2012)

    Google Scholar 

  17. Babenko, A., Lempitsky, V.S.: Improving bilayer product quantization for billion-scale approximate nearest neighbors in high dimensions. CoRR, abs/1404.1831, 2014

    Google Scholar 

  18. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for \(k\)-means, PCA and projective clustering. In: Symposium on Discrete Algorithms (SODA), pp. 1434–1453 (2013)

    Google Scholar 

  19. Balcan, M-F., Ehrlich, S., Liang, Y.: Distributed \(k\)-means and \(k\)-median clustering on general topologies. In: Advances in Neural Information Processing Systems, pp. 1995–2003 (2013)

    Google Scholar 

  20. Vaidya, J., Clifton, C.: Privacy-preserving \(k\)-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215, ACM, New York (2003)

    Google Scholar 

  21. Dhillon, I.S., Guan, Y., Kulis, B.: Kernel \(k\)-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pp. 551–556, ACM, New York (2004)

    Google Scholar 

  22. Telgarsky, M., Vattani, A.: Hartigan’s method: \(k\)-means clustering without Voronoi. In: International Conference on Artificial Intelligence and Statistics, pp. 820–827 (2010)

    Google Scholar 

  23. Banerjee, A., Merugu, S., Inderjit S.D., Joydeep G.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    Google Scholar 

  24. Huang, Z.: Extensions to the \(k\)-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)

    Google Scholar 

  25. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence models, analysis, and simulation. J. Artif. Soc. Soc. Simul. (JASSS), 5(3) (2002)

    Google Scholar 

  26. Nielsen, F., Nock, R.: Optimal interval clustering: Application to Bregman clustering and statistical mixture learning. IEEE Signal Process. Lett. 21(10), 1289–1292 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Nielsen .

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Nielsen, F. (2016). Partition-Based Clustering with k-Means. In: Introduction to HPC with MPI for Data Science. Undergraduate Topics in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-21903-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21903-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21902-8

  • Online ISBN: 978-3-319-21903-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics