Research on large data set clustering method based on MapReduce

  • 105 Accesses

  • 1 Citations


The similarities and differences between the K-means algorithm and the Canopy algorithm’s MapReduce implementation are described in detail, and the possibility of combining the two to design a better algorithm suitable for clustering analysis of large data sets is analyzed in this paper. Different from the previous literature’s improvement ideas for K-means algorithm, it proposes new ideas for sampling and analyzes the selection of relevant thresholds in this paper. Finally, it introduces the MapReduce implementation framework based on Canopy partitioning and filtering K-means algorithm and analyzes some pseudocode in this chapter. Finally, it briefly analyzes the time complexity of the algorithm in this paper.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

    Alexey B, Dmytro I, Oleg R et al (2018) Constraints on decaying dark matter from XMM-Newton observations of M31. Mon Not R Astron Soc 387(4):1361–1373

  2. 2.

    Treu T, Dutton AA, Auger MW et al (2018) The SWELLS survey-I. A large spectroscopically selected sample of edge-on late-type lens galaxies. Mon Not R Astron Soc 417(3):1601–1620

  3. 3.

    Efstathiou G, Gratton S, Paci F (2018) Impact of Galactic polarized emission on B-mode detection at low multipoles. Mon Not R Astron Soc 397(3):1355–1373

  4. 4.

    Driver SP, Robotham ASG (2018) Quantifying cosmic variance. Mon Not R Astron Soc 407(4):2131–2140

  5. 5.

    Humphrey PJ, Buote DA, Brighenti F et al (2018) Reconciling stellar dynamical and hydrostatic X-ray mass measurements of an elliptical galaxy with gas rotation, turbulence and magnetic fields. Mon Not R Astron Soc 430(3):1516–1528

  6. 6.

    Barentsen G, Vink JS, Drew JE et al (2018) Bayesian inference of T Tauri star properties using multi-wavelength survey photometry. Mon Not R Astron Soc 429(3):1981–2000

  7. 7.

    Littlefair SP, Naylor T, Mayne NJ et al (2018) Rotation of young stars in Cepheus OB3b. Mon Not R Astron Soc 403(2):545–557

  8. 8.

    Clark CD (2017) Emergent drumlins and their clones: from till dilatancy to flow instabilities. J Glaciol 51(200):1011–1025

  9. 9.

    Peng H, Li B, Ling H et al (2017) Salient object detection via structured matrix decomposition. IEEE Trans Pattern Anal Mach Intell 39(4):818–832

  10. 10.

    Mukherjee AP, Tirthapura S (2017) Enumerating maximal bicliques from a large graph using MapReduce. IEEE Trans Serv Comput 10(5):771–784

  11. 11.

    Kim Y, Shim K, Kim MS et al (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. Inf Syst 42(2):15–35

  12. 12.

    Río SD, López V, Benítez JM et al (2015) A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int J Comput Intell Syst 8(3):422–437

  13. 13.

    Nagwani NK (2015) Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J Big Data 2(1):1–18

  14. 14.

    Xiaoshan YU, Yangyang WU (2014) Parallel text hierarchical clustering based on MapReduce. J Comput Appl 34(6):1595–1599

  15. 15.

    Fan T (2017) Research and implementation of user clustering based on MapReduce in multimedia big data. Multimed Tools Appl 1:1–15

  16. 16.

    Leng YL, Zhang QC (2014) A big graph clustering algorithm based on MapReduce. Adv Mater Res 1049–1050:1467–1470

  17. 17.

    Xia D, Wang B, Li Y et al (2015) An efficient MapReduce-based parallel clustering algorithm for distributed traffic subarea division. Discrete Dyn Nat Soc 2015(6018):1–18

  18. 18.

    Lamari Y, Slaoui SC (2017) Clustering categorical data based on the relational analysis approach and MapReduce. J Big Data 4(1):28

  19. 19.

    Hajkacem MAB, N’Cir CEB, Essoussi N (2017) One-pass MapReduce-based clustering method for mixed large scale data. J Intell Inf Syst 2:1–18

  20. 20.

    Sun Z, Fox G, Gu W et al (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467

Download references


This work was supported by Chongqing Big Data Engineering Laboratory for Children, Chongqing Electronics Engineering Technology Research Center for Interactive Learning, the Science and Technology Research Project of Chongqing Municipal Education Commission of China (No. KJ1601401), the Science and Technology Research Project of Chongqing University of Education (No. KY201725C), Basic research and Frontier Exploration of Chongqing Science and Technology Commission (CSTC2014jcyjA40019), Project of Science and Technology Research Program of Chongqing Education Commission of China (No. KJZD-K201801601).

Author information

Correspondence to Fangcheng He.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wei, P., He, F., Li, L. et al. Research on large data set clustering method based on MapReduce. Neural Comput & Applic 32, 93–99 (2020).

Download citation


  • MapReduce
  • Large data
  • Set clustering method