A Fast and Efficient Grid-Based K-means++ Clustering Algorithm for Large-Scale Datasets
In the k-means clustering algorithm, the selection of the initial clustering center affects the clustering efficiency. Currently widely used k-means++ can effectively improve the speed and accuracy of k-means. But k-means cluster algorithm does not scale well to massive datasets, as it needs to traverse the data set multiple times. In this paper, based on k-means++ clustering algorithm and grid clustering algorithm, a fast and efficient grid-based k-means++ clustering algorithm was proposed, which can efficiently process large-scale data. First, the N-dimensional space is granulated into disjoint rectangular grid cells. Then, the dense grid cell is marked by statistical gird cell information. Finally, the modified k-means++ clustering algorithm is applied to the meshed datasets. The experimental results on the simulation dataset show that compared with the original k-means++ clustering algorithm, the proposed algorithm can quickly obtain the clustering center and can effectively deal with the clustering problem of large-scale datasets.
KeywordsK-means K-means++ Grid-based clustering algorithm Large-scale datasets
- 6.Arthur, D., Vassilvitskii, S.: k-means ++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics, Philadelphia (2007). https://doi.org/10.1145/1283383.1283494