The similarities and differences between the K-means algorithm and the Canopy algorithm’s MapReduce implementation are described in detail, and the possibility of combining the two to design a better algorithm suitable for clustering analysis of large data sets is analyzed in this paper. Different from the previous literature’s improvement ideas for K-means algorithm, it proposes new ideas for sampling and analyzes the selection of relevant thresholds in this paper. Finally, it introduces the MapReduce implementation framework based on Canopy partitioning and filtering K-means algorithm and analyzes some pseudocode in this chapter. Finally, it briefly analyzes the time complexity of the algorithm in this paper.
This is a preview of subscription content, log in to check access.
Buy single article
Instant unlimited access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Alexey B, Dmytro I, Oleg R et al (2018) Constraints on decaying dark matter from XMM-Newton observations of M31. Mon Not R Astron Soc 387(4):1361–1373
Treu T, Dutton AA, Auger MW et al (2018) The SWELLS survey-I. A large spectroscopically selected sample of edge-on late-type lens galaxies. Mon Not R Astron Soc 417(3):1601–1620
Efstathiou G, Gratton S, Paci F (2018) Impact of Galactic polarized emission on B-mode detection at low multipoles. Mon Not R Astron Soc 397(3):1355–1373
Driver SP, Robotham ASG (2018) Quantifying cosmic variance. Mon Not R Astron Soc 407(4):2131–2140
Humphrey PJ, Buote DA, Brighenti F et al (2018) Reconciling stellar dynamical and hydrostatic X-ray mass measurements of an elliptical galaxy with gas rotation, turbulence and magnetic fields. Mon Not R Astron Soc 430(3):1516–1528
Barentsen G, Vink JS, Drew JE et al (2018) Bayesian inference of T Tauri star properties using multi-wavelength survey photometry. Mon Not R Astron Soc 429(3):1981–2000
Littlefair SP, Naylor T, Mayne NJ et al (2018) Rotation of young stars in Cepheus OB3b. Mon Not R Astron Soc 403(2):545–557
Clark CD (2017) Emergent drumlins and their clones: from till dilatancy to flow instabilities. J Glaciol 51(200):1011–1025
Peng H, Li B, Ling H et al (2017) Salient object detection via structured matrix decomposition. IEEE Trans Pattern Anal Mach Intell 39(4):818–832
Mukherjee AP, Tirthapura S (2017) Enumerating maximal bicliques from a large graph using MapReduce. IEEE Trans Serv Comput 10(5):771–784
Kim Y, Shim K, Kim MS et al (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. Inf Syst 42(2):15–35
Río SD, López V, Benítez JM et al (2015) A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. Int J Comput Intell Syst 8(3):422–437
Nagwani NK (2015) Summarizing large text collection using topic modeling and clustering based on MapReduce framework. J Big Data 2(1):1–18
Xiaoshan YU, Yangyang WU (2014) Parallel text hierarchical clustering based on MapReduce. J Comput Appl 34(6):1595–1599
Fan T (2017) Research and implementation of user clustering based on MapReduce in multimedia big data. Multimed Tools Appl 1:1–15
Leng YL, Zhang QC (2014) A big graph clustering algorithm based on MapReduce. Adv Mater Res 1049–1050:1467–1470
Xia D, Wang B, Li Y et al (2015) An efficient MapReduce-based parallel clustering algorithm for distributed traffic subarea division. Discrete Dyn Nat Soc 2015(6018):1–18
Lamari Y, Slaoui SC (2017) Clustering categorical data based on the relational analysis approach and MapReduce. J Big Data 4(1):28
Hajkacem MAB, N’Cir CEB, Essoussi N (2017) One-pass MapReduce-based clustering method for mixed large scale data. J Intell Inf Syst 2:1–18
Sun Z, Fox G, Gu W et al (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467
This work was supported by Chongqing Big Data Engineering Laboratory for Children, Chongqing Electronics Engineering Technology Research Center for Interactive Learning, the Science and Technology Research Project of Chongqing Municipal Education Commission of China (No. KJ1601401), the Science and Technology Research Project of Chongqing University of Education (No. KY201725C), Basic research and Frontier Exploration of Chongqing Science and Technology Commission (CSTC2014jcyjA40019), Project of Science and Technology Research Program of Chongqing Education Commission of China (No. KJZD-K201801601).
About this article
Cite this article
Wei, P., He, F., Li, L. et al. Research on large data set clustering method based on MapReduce. Neural Comput & Applic 32, 93–99 (2020). https://doi.org/10.1007/s00521-018-3780-y
- Large data
- Set clustering method