A fuzzy data reduction cluster method based on boundary information for large datasets

  • Gustavo R. L. SilvaEmail author
  • Paulo C. Neto
  • Luiz C. B. Torres
  • Antônio P. Braga
WSOM 2017


The fuzzy c-means algorithm (FCM) is aimed at computing the membership degree of each data point to its corresponding cluster center. This computation needs to calculate the distance matrix between the cluster center and the data point. The main bottleneck of the FCM algorithm is the computing of the membership matrix for all data points. This work presents a new clustering method, the bdrFCM (boundary data reduction fuzzy c-means). Our algorithm is based on the original FCM proposal, adapted to detect and remove the boundary regions of clusters. Our implementation efforts are directed in two aspects: processing large datasets in less time and reducing the data volume, maintaining the quality of the clusters. A significant volume of real data application (> 106 records) was used, and we identified that bdrFCM implementation has good scalability to handle datasets with millions of data points.


Fuzzy c-means Large dataset Boundary information 



This work has been supported by the Brazilian agency CAPES, CNPq and FAPEMIG.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. 1.
    Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267CrossRefGoogle Scholar
  2. 2.
    Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2):191CrossRefGoogle Scholar
  3. 3.
    Li F, Nath S (2014) Scalable data summarization on big data. Distrib Parallel Databases 32(3):313. CrossRefGoogle Scholar
  4. 4.
    Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J R Stat Soc Ser B (Stat Methodol) 76(4):795MathSciNetCrossRefGoogle Scholar
  5. 5.
    Liang F, Cheng Y, Song Q, Park J, Yang P (2013) A resampling-based stochastic approximation method for analysis of large geostatistical data. J Am Stat Assoc 108(501):325MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Montreal, CanadaGoogle Scholar
  7. 7.
    Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65Google Scholar
  8. 8.
    Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Morgan Kaufmann, pp 506–517Google Scholar
  9. 9.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38MathSciNetzbMATHGoogle Scholar
  10. 10.
    Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130CrossRefGoogle Scholar
  11. 11.
    Parker JK, Hall LO (2014) Accelerating fuzzy-c means using an estimated subsample size. IEEE Trans Fuzzy Syst 22(5):1229CrossRefGoogle Scholar
  12. 12.
    Tien ND et al (2017) Tune up fuzzy c-means for big data: some novel hybrid clustering algorithms based on initial selection and incremental clustering. Int J Fuzzy Syst 19(5):1585MathSciNetCrossRefGoogle Scholar
  13. 13.
    Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern Part B (Cybern) 27(5):787CrossRefGoogle Scholar
  14. 14.
    R Core Team (2017) UCI Machine Learning Repository. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed 2 Jan 2019
  15. 15.
    Stetco A, Zeng XJ, Keane J (2015) Fuzzy c-means++: fuzzy c-means with effective seeding initialization. Expert Syst Appl 42(21):7541. CrossRefGoogle Scholar
  16. 16.
    Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677zbMATHGoogle Scholar
  17. 17.
    Leisch F, Dimitriadou E (2010) mlbench: machine learning benchmark problems. R package version 2.1-1Google Scholar
  18. 18.
    UML Repository (2017) Iris. Accessed 2 Jan 2019
  19. 19.
    UML Repository (2017) Breast cancer. Accessed 2 Jan 2019
  20. 20.
    Cattral R, Oppacher F (2007) Poker hand data set. Carleton University. Accessed 16 Aug 2017
  21. 21.
    Attila Reiss DG (2012) Pamap2 physical activity monitoring data set. Department Augmented Vision. Accessed 16 Aug 2017
  22. 22.
    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278CrossRefGoogle Scholar
  23. 23.
    Blackard JA (1998) Covertype data set. Colorado State University. Accessed 16 Aug 2017
  24. 24.
    Rajen Bhatt AD (2012) Skin data set. 16 Aug 2017
  25. 25.
    Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433CrossRefGoogle Scholar
  26. 26.
    Jaccard P (1908) Nouvelles recherches sur la distribution floraleGoogle Scholar
  27. 27.
    Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Graduate Program in Electrical EngineeringFederal University of Minas GeraisBelo HorizonteBrazil

Personalised recommendations