Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8583))

Included in the following conference series:

Abstract

Clustering is an essential data mining and tool for analyzing big data. There are difficulties for applying clustering techniques to big data duo to new challenges that are raised with big data. As Big Data is referring to terabytes and petabytes of data and clustering algorithms are come with high computational costs, the question is how to cope with this problem and how to deploy clustering techniques to big data and get the results in a reasonable time. This study is aimed to review the trend and progress of clustering algorithms to cope with big data challenges from very first proposed algorithms until today’s novel solutions. The algorithms and the targeted challenges for producing improved clustering algorithms are introduced and analyzed, and afterward the possible future path for more advanced algorithms is illuminated based on today’s available technologies and frameworks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Havens, T.C., Bezdek, J.C., Palaniswami, M.: Scalable single linkage hierarchical clustering for big data. In: 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 396–401. IEEE (2013)

    Google Scholar 

  2. YouTube Statistic (2014), http://www.youtube.com/yt/press/statistics.html

  3. Williams, P., Soares, C., Gilbert, J.E.: A Clustering Rule Based Approach for Classification Problems. Int. J. Data Warehous. Min. 8(1), 1–23 (2012)

    Article  Google Scholar 

  4. Priya, R.V., Vadivel, A.: User Behaviour Pattern Mining from Weblog. Int. J. Data Warehous. Min. 8(2), 1–22 (2012)

    Article  Google Scholar 

  5. Kwok, T., Smith, K.A., Lozano, S., Taniar, D.: Parallel Fuzzy c-Means Clustering for Large Data Sets. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 365–374. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  6. Kalia, H., Dehuri, S., Ghosh, A.: A Survey on Fuzzy Association Rule Mining. Int. J. Data Warehous. Min. 9(1), 1–27 (2013)

    Article  Google Scholar 

  7. Daly, O., Taniar, D.: Exception Rules Mining Based on Negative Association Rules. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3046, pp. 543–552. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  8. Ashrafi, M.Z., Taniar, D., Smith, K.A.: Redundant association rules reduction techniques. Int. J. Bus. Intell. Data Min. 2(1), 29–63 (2007)

    Article  Google Scholar 

  9. Taniar, D., Rahayu, W., Lee, V.C.S., Daly, O.: Exception rules in association rule mining. Appl. Math. Comput. 205(2), 735–750 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  10. Meyer, F.G., Chinrungrueng, J.: Spatiotemporal clustering of fMRI time series in the spectral domain. Med. Image Anal. 9(1), 51–68 (2004)

    Article  Google Scholar 

  11. Ernst, J., Nau, G.J., Bar-Joseph, Z.: Clustering short time series gene expression data. Bioinforma. 21(suppl. 1), i159–i168 (2005)

    Article  Google Scholar 

  12. Iglesias, F., Kastner, W.: Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns. Energies 6(2), 579–597 (2013)

    Article  Google Scholar 

  13. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

  14. Hathaway, R., Bezdek, J.: Extending fuzzy and probabilistic clustering to very large data sets. Comput. Stat. Data Anal. 51(1), 215–234 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  15. Big Data, What is it and why it is important, http://www.sas.com/en_us/insights/big-data/what-is-big-data.html

  16. Ng, R.T., Han, J.: CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)

    Article  Google Scholar 

  17. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction on Cluster Analysis. John Wiley and Sons (1990)

    Google Scholar 

  18. Ng, R.T., Han, J.: CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)

    Article  Google Scholar 

  19. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large database. In: SIGMOD Conference, pp. 103–114 (1996)

    Google Scholar 

  20. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large database. In: SIGMOD Conference, pp. 103–114 (1996)

    Google Scholar 

  21. Guha, S., Rastogi, R.: CURE: An efficient clustering algorithm for large database. Inf. Syst. 26(1), 35–58 (2001)

    Article  MATH  Google Scholar 

  22. Achlioptas, D., McSherry, F.: Fast computation of low rank matrix approximations. J. ACM 54(2), 9 (2007)

    Article  MathSciNet  Google Scholar 

  23. Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: A cluster ensemble approach. In: ICML, pp. 186–193 (2003)

    Google Scholar 

  24. Dasgupta, S.: Experiments with random projection. In: UAI, pp. 143–151 (2000)

    Google Scholar 

  25. Boutsidis, C., Chekuri, C., Feder, T., Motwani, R.: Random projections for k-means clustering. In: NIPS, pp. 298–306 (2010)

    Google Scholar 

  26. Golub, G.H., Van-Loan, C.F.: Matrix computations, 2nd edn. The Johns Hopkins University Press (1989)

    Google Scholar 

  27. Drineas, P., Kannan, R., Mahony, M.W.: Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition. SIAM J. Comput. 36(1), 132–157 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  28. Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is More: Compact Matrix Decomposition for Large Sparse Graphs. In: SDM (2007)

    Google Scholar 

  29. Tong, H., Papadimitriou, S., Sun, J., Yu, P.S., Faloutsos, C.: Colibri: Fast mining of large static and dynamic graphs. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 686–694 (2008)

    Google Scholar 

  30. Januzaj, E., Kriegel, H.-P., Pfeifle, M.: DBDC: Density based distributed clustering. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 88–105. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  31. Aggarwal, C.C., Reddy, C.K. (eds.): Data Clustering: Algorithms and Applications (2013)

    Google Scholar 

  32. Ester, M., Kriegel, H.P., Sander, J., Xui, X.: A density-based algorithm for discovering clusters in large spatial database with noise. In: KDD, pp. 226–231 (1996)

    Google Scholar 

  33. Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning for irregular graphs. SIAM Rev. 41(2), 278–300 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  34. Karypis, G., Kumar, V.: Multilevel k-way partitining scheme for irregular graphs. J. Parallel Disteributed Comput. 48(1), 96–129 (1998)

    Article  MathSciNet  Google Scholar 

  35. Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering. Procedia Comput. Sci. 18, 369–378 (2013)

    Article  Google Scholar 

  36. Anchalia, P.P., Koundinya, A.K., Srinath, N.: MapReduce Design of K-Means Clustering Algorithm. In: 2013 International Conference on Information Science and Applications (ICISA), pp. 1–5 (2013)

    Google Scholar 

  37. Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on MapReduce. In: Cloud Computing, pp. 674–679 (2009)

    Google Scholar 

  38. Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan Kaufmann (2006)

    Google Scholar 

  39. Mirkin, B.: Clustering for data mining a data recovery approach. CRC Press (2012)

    Google Scholar 

  40. He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T. (2014). Big Data Clustering: A Review. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2014. ICCSA 2014. Lecture Notes in Computer Science, vol 8583. Springer, Cham. https://doi.org/10.1007/978-3-319-09156-3_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09156-3_49

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09155-6

  • Online ISBN: 978-3-319-09156-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics