Skip to main content

A Fast and Efficient Grid-Based K-means++ Clustering Algorithm for Large-Scale Datasets

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 891))

Abstract

In the k-means clustering algorithm, the selection of the initial clustering center affects the clustering efficiency. Currently widely used k-means++ can effectively improve the speed and accuracy of k-means. But k-means cluster algorithm does not scale well to massive datasets, as it needs to traverse the data set multiple times. In this paper, based on k-means++ clustering algorithm and grid clustering algorithm, a fast and efficient grid-based k-means++ clustering algorithm was proposed, which can efficiently process large-scale data. First, the N-dimensional space is granulated into disjoint rectangular grid cells. Then, the dense grid cell is marked by statistical gird cell information. Finally, the modified k-means++ clustering algorithm is applied to the meshed datasets. The experimental results on the simulation dataset show that compared with the original k-means++ clustering algorithm, the proposed algorithm can quickly obtain the clustering center and can effectively deal with the clustering problem of large-scale datasets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Chen, Y.S., Chen, B.T.: Efficient fuzzy c-means clustering for image data. J. Electron. Imaging 14(1), 013017 (2005). https://doi.org/10.1117/1.1879012

    Article  Google Scholar 

  2. Lavrač, N.: Selected techniques for data mining in medicine. Artif. Intell. Med. 16(1), 3–23 (1999). https://doi.org/10.1016/S0933-3657(98)00062-1

    Article  MathSciNet  Google Scholar 

  3. Nazeri, Z., Bloedorn, E., Ostwald, P.: Experiences in mining aviation safety data. In: ACM SIGMOD Record, vol. 30, No. 2, pp. 562–566. ACM (2001). https://doi.org/10.1145/376284.375743

    Article  Google Scholar 

  4. Lynch, C.: Big data: How do your data grow? Nature 455(7209), 28 (2008). https://doi.org/10.1038/455028a

    Article  Google Scholar 

  5. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a k-means clustering algorithm. J. Roy. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979). https://doi.org/10.2307/2346830

    Article  Google Scholar 

  6. Arthur, D., Vassilvitskii, S.: k-means ++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics, Philadelphia (2007). https://doi.org/10.1145/1283383.1283494

  7. Anusha, M., Sathiaseelan, J.G.R.: Feature selection using k-means genetic algorithm for multi-objective optimization. Procedia Comput. Sci. 57, 1074–1080 (2015). https://doi.org/10.1016/j.procs.2015.07.387

    Article  Google Scholar 

  8. Li, M.J., Ng, M.K., Cheung, Y.M., Huang, J.Z.: Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans. Knowl. Data Eng. 20(11), 1519–1534 (2008). https://doi.org/10.1109/TKDE.2008.88

    Article  Google Scholar 

  9. Berger, M., Rigoutsos, I.: An algorithm for point clustering and grid generation. IEEE Trans. Syst. Man Cybern. 21(5), 1278–1286 (1991). https://doi.org/10.1109/21.120081

    Article  Google Scholar 

  10. Bhatnagar, V., Kaur, S., Chakravarthy, S.: Clustering data streams using grid-based synopsis. Knowl. Inf. Syst. 41(1), 127–152 (2014). https://doi.org/10.1007/s10115-013-0659-1

    Article  Google Scholar 

  11. Park, N.H., Lee, W.S.: Statistical grid-based clustering over data streams. ACM Sigmod Record 33(1), 32–37 (2004). https://doi.org/10.1145/974121.974127

    Article  Google Scholar 

  12. Yue, S., Wei, M., Wang, J.S., Wang, H.: A general grid-clustering approach. Pattern Recogn. Lett. 29(9), 1372–1384 (2008). https://doi.org/10.1016/j.patrec.2008.02.019

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Y., Zhu, Z. (2019). A Fast and Efficient Grid-Based K-means++ Clustering Algorithm for Large-Scale Datasets. In: Krömer, P., Zhang, H., Liang, Y., Pan, JS. (eds) Proceedings of the Fifth Euro-China Conference on Intelligent Data Analysis and Applications. ECC 2018. Advances in Intelligent Systems and Computing, vol 891. Springer, Cham. https://doi.org/10.1007/978-3-030-03766-6_57

Download citation

Publish with us

Policies and ethics