Journal of Computer Science and Technology

, Volume 34, Issue 1, pp 77–93 | Cite as

Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight

  • Min Li
  • Chao YangEmail author
  • Qiao Sun
  • Wen-Jing Ma
  • Wen-Long Cao
  • Yu-Long Ao
Regular Paper


With the advent of the big data era, the amounts of sampling data and the dimensions of data features are rapidly growing. It is highly desired to enable fast and efficient clustering of unlabeled samples based on feature similarities. As a fundamental primitive for data clustering, the k-means operation is receiving increasingly more attentions today. To achieve high performance k-means computations on modern multi-core/many-core systems, we propose a matrix-based fused framework that can achieve high performance by conducting computations on a distance matrix and at the same time can improve the memory reuse through the fusion of the distance-matrix computation and the nearest centroids reduction. We implement and optimize the parallel k-means algorithm on the SW26010 many-core processor, which is the major horsepower of Sunway TaihuLight. In particular, we design a task mapping strategy for load-balanced task distribution, a data sharing scheme to reduce the memory footprint and a register blocking strategy to increase the data locality. Optimization techniques such as instruction reordering and double buffering are further applied to improve the sustained performance. Discussions on block-size tuning and performance modeling are also presented. We show by experiments on both randomly generated and real-world datasets that our parallel implementation of k-means on SW26010 can sustain a double-precision performance of over 348.1 Gflops, which is 46.9% of the peak performance and 84% of the theoretical performance upper bound on a single core group, and can achieve a nearly ideal scalability to the whole SW26010 processor of four core groups. Performance comparisons with the previous state-of-the-art on both CPU and GPU are also provided to show the superiority of our optimized k-means kernel.


parallel k-means performance optimization SW26010 processor Sunway TaihuLight 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



We would like to thank Ms Li-Juan Jiang from Institute of Software, Chinese Academy of Sciences for valuable discussion.

Supplementary material

11390_2019_1900_MOESM1_ESM.pdf (384 kb)
ESM 1 (PDF 383 kb)


  1. [1]
    Wu X, Kumar V, Quinlan J R, Ghosh J, Yang Q, Motoda H, Mclachlan G J, Ng A S, Liu B, Yu P S et al. Top 10 algorithms in data mining. Knowledge and Information Systems, 2007, 14(1): 1-37.CrossRefGoogle Scholar
  2. [2]
    Muja M, Lowe D G. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(11): 2227-2240.CrossRefGoogle Scholar
  3. [3]
    You Y, Demmel J, Czechowski K, Song L, Vuduc R. CASVM: Communication-avoiding support vector machines on distributed systems. In Proc. IEEE International Parallel and Distributed Processing Symposium, May 2015, pp.847-859.Google Scholar
  4. [4]
    Wu J, Leng C, Wang Y, Hu Q, Cheng J. Quantized convolutional neural networks for mobile devices. In Proc. Computer Vision and Pattern Recognition, June 2016, pp.4820-4828.Google Scholar
  5. [5]
    Narayanan R, Ozisikyilmaz B, Zambreno J, Memik G, Choudhary A. MineBench: A benchmark suite for data mining workloads. In Proc. IEEE International Symposium on Workload Characterization, October 2006, pp.182-188.Google Scholar
  6. [6]
    Hadian A, Shahrivari S. High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. Journal of Supercomputing, 2014, 69(2): 845-863.CrossRefGoogle Scholar
  7. [7]
    Wang H, Zhao J, Li H, Wang J. Parallel clustering algorithms for image processing on multi-core CPUs. In Proc. International Conference on Computer Science and Software Engineering, December 2008, pp.450-453.Google Scholar
  8. [8]
    Farivar R, Rebolledo D, Chan E, Campbell R H. A parallel implementation of k-means clustering on GPUs. In Proc. International Conference on Parallel and Distributed Processing Techniques and Applications, July 2008, pp.340-345.Google Scholar
  9. [9]
    Che S, Boyer M, Meng J, Tarjan D, Sheaffer J W, Skadron K. A performance study of general purpose applications on graphics processors using CUDA. Journal of Parallel & Distributed Computing, 2008, 68(10): 1370-1380.CrossRefGoogle Scholar
  10. [10]
    Fang W, Lau K K, Lu M, Xiao X, Chi K L, Yang P Y, He B, Luo Q, Sander P V, Yang K. Parallel data mining on graphics processors. Technical Report, Hong Kong University of Science and Technology, 2008. Fall/Mining/Parallel%20Data%20Mining%20on%20Graphics% 20Processors gpuminer.pdf, September 2018.
  11. [11]
    Wu R, Zhang B, Hsu M. Clustering billions of data points using GPUs. In Proc. the Combined Workshops on UnConventional High Performance Computing Workshop Plus Memory Access Workshop, May 2009, pp.1-6.Google Scholar
  12. [12]
    Zechner M, Granitzer M. Accelerating k-means on the graphics processor via CUDA. In Proc. the 1st International Conference on Intensive Applications and Services, April 2009, pp.7-15.Google Scholar
  13. [13]
    Kohlhoff K J, Pande V S, Altman R B. K-means for parallel architectures using all-prefix-sum sorting and updating steps. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(8): 1602-1612.CrossRefGoogle Scholar
  14. [14]
    Li Y, Zhao K, Chu X, Liu J. Speeding up k-means algorithm by GPUs. Journal of Computer and System Sciences, 2013, 79(2): 216-229.MathSciNetCrossRefGoogle Scholar
  15. [15]
    Karantasis K I, Polychronopoulos E D, Dimitrakopoulos G N. Accelerating data clustering on GPU-based clusters under shared memory abstraction. In Proc. IEEE International Conference on Cluster Computing Workshops and Posters, September 2010, Article No. 31.Google Scholar
  16. [16]
    Karantasis K I, Polychronopoulos E D. Programming GPU clusters with shared memory abstraction in software. In Proc. the 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, February 2011, pp.223-230.Google Scholar
  17. [17]
    Wasif M K, Narayanan P J. Scalable clustering using multiple GPUs. In Proc. the 18th International Conference on High Performance Computing, December 2011, Article No. 14.Google Scholar
  18. [18]
    Kijsipongse E, U-Ruekolan S. Dynamic load balancing on GPU clusters for large-scale k-means clustering. In Proc. the 9th International Joint Conference on Computer Science and Software Engineering, May 2012, pp.346-350.Google Scholar
  19. [19]
    Stuart J A, Owens J D. Multi-GPU MapReduce on GPU clusters. In Proc. IEEE International Parallel & Distributed Processing Symposium, May 2011, pp.1068-1079.Google Scholar
  20. [20]
    Wu F, Wu Q, Tan Y, Wei L, Shao L, Gao L. A vectorized k-means algorithm for Intel many integrated core architecture. In Proc. the 10th International Symposium on Advanced Parallel Processing Technologies, August 2013, pp.277-294.Google Scholar
  21. [21]
    Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F et al. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences, 2016, 59(7): Article No. 072001.Google Scholar
  22. [22]
    Yang C, XueW, Fu H, You H,Wang X, Ao Y, Liu F, Gan L, Xu P, Wang L et al. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, Article No. 6.Google Scholar
  23. [23]
    Garey M R, Johnson D S, Witsenhausen H S. The complexity of the generalized Lloyd-Max problem (Corresp.). IEEE Transactions on Information Theory, 1982, 28(2): 255-256.MathSciNetCrossRefzbMATHGoogle Scholar
  24. [24]
    Hamerly G, Elkan C. Alternatives to the k-means algorithm that find better clusterings. In Proc. the 11th International Conference on Information and Knowledge Management, November 2002, pp.600-607.Google Scholar
  25. [25]
    Bradley P S, Fayyad U M. Refining initial points for kmeans clustering. In Proc. the 15th International Conference on Machine Learning, July 1998, pp.91-99.Google Scholar
  26. [26]
    Arthur D, Vassilvitskii S. K-means++: The advantages of careful seeding. In Proc. the 18th ACM-SIAM Symposium on Discrete Algorithms, January 2007, pp.1027-1035.Google Scholar
  27. [27]
    Sarje A, Zola J, Aluru S. Accelerating pairwise computations on cell processors. IEEE Transactions on Parallel and Distributed Systems, 2011, 22(1): 69-77.CrossRefGoogle Scholar
  28. [28]
    Sarje A, Aluru S. All-pairs computations on many-core graphics processors. Parallel Computing, 2013, 39(2): 79-93.CrossRefGoogle Scholar
  29. [29]
    Steuwer M, Friese M, Albers S, Gorlatch S. Introducing and implementing the allpairs skeleton for programming multi-GPU systems. International Journal of Parallel Programming, 2014, 42(4): 601-618.CrossRefGoogle Scholar
  30. [30]
    Dongarra J J, Croz J D, Hammarling S, Duff I S. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 1990, 16(1): 1-17.MathSciNetCrossRefzbMATHGoogle Scholar
  31. [31]
    Dhillon I, Modha D. A data clustering algorithm on distributed memory multiprocessors. Lecture Notes in Computer Science, 2000, 1759: 245-260.CrossRefGoogle Scholar
  32. [32]
    Zhang J, Zhou C, Wang Y, Ju L, Du Q, Chi X, Xu D, Chen D, Liu Y, Liu Z. Extreme-scale phase field simulations of coarsening dynamics on the Sunway TaihuLight supercomputer. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, November 2016, Article No. 4.Google Scholar
  33. [33]
    Lucas D D, Yver Kwok C, Cameronsmith P, Graven H, Bergmann D, Guilderson T P, Weiss R, Keeling R. Designing optimal greenhouse gas observing networks that consider performance and cost. Geoscientific Instrumentation Methods & Data Systems Discussions, 2014, 4(1): 121-137.CrossRefGoogle Scholar
  34. [34]
    Sapsanis C, Georgoulas G, Tzes A, Lymberopoulos D. Improving EMG based classification of basic hand movements using EMD. In Proc. the 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, July 2013, pp.5754-5757.Google Scholar
  35. [35]
    Altun K, Barshan B, Tuncel O. Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition, 2010, 43(10): 3605-3620.CrossRefzbMATHGoogle Scholar
  36. [36]
    Barshan B, Yüksek M C. Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. Computer Journal, 2014, 57(11): 1649-1667.CrossRefGoogle Scholar
  37. [37]
    Altun K, Barshan B. Human activity recognition using inertial/magnetic sensor units. In Proc. the 1st International Workshop on Human Behavior Understanding, August 2010, pp.38-51.Google Scholar
  38. [38]
    Newling J, Fleuret F. Fast k-means with accurate bounds. In Proc. the 33rd International Conference on International Conference on Machine Learning, June 2016, pp.936-944.Google Scholar
  39. [39]
    Elkan C. Using the triangle inequality to accelerate kmeans. In Proc. the 20th International Conference on Machine Learning, August 2003, pp.147-153.Google Scholar
  40. [40]
    Drake J, Hamerly G. Accelerated k-means with adaptive distance bounds. In Proc. the 5th NIPS Workshop on Optimization for Machine Learning, December 2012, pp.42-53.Google Scholar
  41. [41]
    Ding Y, Zhao Y, Shen X, Musuvathi M, Mytkowicz T. Yinyang k-means: A drop-in replacement of the classic kmeans with consistent speedup. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.579-587.Google Scholar
  42. [42]
    Bottesch T, Buhler T, Kachele M. Speeding up k-means by approximating Euclidean distances via block vectors. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.2578-2586.Google Scholar
  43. [43]
    Wu J, Hong B. An efficient k-means algorithm on CUDA. In Proc. IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, May 2011, pp.1740-1749.Google Scholar
  44. [44]
    Lin Z, Lo C, Chow P. K-means implementation on FPGA for high-dimensional data using triangle inequality. In Proc. the 22nd International Conference on Field Programmable Logic and Applications, August 2012, pp.437-442.Google Scholar
  45. [45]
    Winterstein F, Bayliss S, Constantinides G A. FPGAbased k-means clustering using tree-based data structures. In Proc. the 23rd International Conference on Field Programmable Logic and Applications, September 2013, Article No. 18.Google Scholar
  46. [46]
    Tang Q Y, Khalid M A S. Acceleration of k-means algorithm using Altera SDK for OpenCL. ACM Transactions on Reconfigurable Technology and Systems, 2016, 10(1): Article No. 6.Google Scholar
  47. [47]
    Jiang L, Yang C, Ao Y, Yin W, Ma W, Sun Q, Liu F, Lin R, Zhang P. Towards highly efficient DGEMM on the emerging SW26010 many-core processor. In Proc. the 46th International Conference on Parallel Processing, August 2017, pp.422-431.Google Scholar
  48. [48]
    Fang J, Fu H, ZhaoW, Chen B, ZhengW, Yang G. swDNN: A library for accelerating deep learning applications on Sunway TaihuLight. In Proc. IEEE International Parallel & Distributed Processing Symposium, May 2017, pp.615-624.Google Scholar
  49. [49]
    Zhao W, Ma H, He Q. Parallel k-means clustering based on MapReduce. In Proc. the 1st IEEE International Conference on Cloud Computing, December 2009, pp.674-679.Google Scholar
  50. [50]
    Cordeiro R L F, Traina A J M, Kang U, Faloutsos C et al. Clustering very large multi-dimensional datasets with MapReduce. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.690-698.Google Scholar
  51. [51]
    Li Q, Wang P, Wang W, Hu H, Li Z, Li J. An efficient k-means clustering algorithm on MapReduce. In Proc. the 19th International Conference on Database Systems for Advanced Applications, April 2014, pp.357-371.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Min Li
    • 1
    • 2
  • Chao Yang
    • 3
    • 4
    • 5
    Email author
  • Qiao Sun
    • 1
  • Wen-Jing Ma
    • 1
  • Wen-Long Cao
    • 1
    • 2
  • Yu-Long Ao
    • 3
    • 4
    • 5
  1. 1.Institute of SoftwareChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina
  3. 3.School of Mathematical SciencesPeking UniversityBeijingChina
  4. 4.Center for Data SciencePeking UniversityBeijingChina
  5. 5.Peng Cheng LaboratoryShenzhenChina

Personalised recommendations