An Accelerated MapReduce-Based K-prototypes for Big Data

  • Mohamed Aymen Ben HajKacemEmail author
  • Chiheb-Eddine Ben N’cir
  • Nadia Essoussi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9946)


Big data are often characterized by a huge volume and a variety of attributes namely, numerical and categorical. To address this issue, this paper proposes an accelerated MapReduce-based k-prototypes method. The proposed method is based on pruning strategy to accelerate the clustering process by reducing the unnecessary distance computations between cluster centers and data points. Experiments performed on huge synthetic and real data sets show that the proposed method is scalable and improves the efficiency of the existing MapReduce-based k-prototypes method.


K-prototypes MapReduce Big data Mixed data 


  1. 1.
    Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)CrossRefGoogle Scholar
  2. 2.
    Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endowment 5(7), 622–633 (2012)CrossRefGoogle Scholar
  3. 3.
    Ben Haj Kacem, M.A., Ben N’cir, C.E., Essoussi, N.: MapReduce-based k-prototypes clustering method for big data. In: Proceedings of Data Science and Advanced Analytics, pp. 1–7(2015)Google Scholar
  4. 4.
    Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data k-means clustering using mapReduce. J. Supercomput. 70(3), 1249–1259 (2014)CrossRefGoogle Scholar
  5. 5.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  6. 6.
    Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)CrossRefGoogle Scholar
  7. 7.
    Gorodetsky, V.: Opportunities, challenges and solutions. In: Information and Communication Technologies in Education, Research, and Industrial Applications, pp. 3–22Google Scholar
  8. 8.
    Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z.: An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120, 590–596 (2013)CrossRefGoogle Scholar
  9. 9.
    Hadian, A., Shahrivari, S.: High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. J. Supercomput. 69(2), 845–863 (2014)CrossRefGoogle Scholar
  10. 10.
    Hamerly, G., Drake, J. Accelerating Lloyd’s algorithm for k-means clustering. In: Partitional Clustering Algorithms, pp. 41–78 (2015)Google Scholar
  11. 11.
    Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–34(1997)Google Scholar
  12. 12.
    Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)CrossRefGoogle Scholar
  13. 13.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)CrossRefGoogle Scholar
  14. 14.
    Kim, Y., Shim, K., Kim, M.S., Lee, J.S.: DBCURE-MR: an efficient density-based clustering algorithm for large data using mapReduce. Inf. Syst. 42, 15–35 (2014)CrossRefGoogle Scholar
  15. 15.
    Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. Knowl. Data Eng. 14(4), 673–690 (2002)CrossRefGoogle Scholar
  16. 16.
    Li, Q., Wang, P., Wang, W., Hu, H., Li, Z., Li, J.: An efficient k-means clustering algorithm on mapReduce. In: Proceedings of Database Systems for Advanced Applications, pp. 357–371 (2014)Google Scholar
  17. 17.
    Ludwig, S.A.: MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. Cybern. 6(6), 923–934 (2015)CrossRefGoogle Scholar
  18. 18.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 14, no. 1, pp. 281–297 (1967)Google Scholar
  19. 19.
    Shahrivari, S., Jalili, S.: Single-pass and linear-time k-means clustering based on mapReduce. Inf. Syst. 60, 1–12 (2016)CrossRefGoogle Scholar
  20. 20.
    Vattani, A.: K-means requires exponentially many iterations even in the plane. Discrete Comput. Geom. 45(4), 596–616 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Xu, R., Wunsch, D.C.: Clustering algorithms in biomedical research: a review. Biomed. Eng. IEEE Rev. 3, 120–154 (2010)CrossRefGoogle Scholar
  22. 22.
    Xu, X., Jäger, J., Kriegel, H.P.: A fast parallel clustering algorithm for large spatial databases. In: High Performance Data Mining, pp. 263–290 (2002)Google Scholar
  23. 23.
    Zhao, W., Ma, H., He, Q. Parallel k-means clustering based on mapReduce. In: Proceedings of Cloud Computing, pp. 674–679 (2009)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Mohamed Aymen Ben HajKacem
    • 1
    Email author
  • Chiheb-Eddine Ben N’cir
    • 1
  • Nadia Essoussi
    • 1
  1. 1.LARODECUniversité de Tunis, Institut Supérieur de Gestion de TunisLe BardoTunisia

Personalised recommendations