Big Data Clustering: A Review

Shirkhorshidi, Ali Seyed; Aghabozorgi, Saeed; Wah, Teh Ying; Herawan, Tutut

doi:10.1007/978-3-319-09156-3_49

Ali Seyed Shirkhorshidi²³,
Saeed Aghabozorgi²³,
Teh Ying Wah²³ &
…
Tutut Herawan^23,24

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8583))

Included in the following conference series:

International Conference on Computational Science and Its Applications

5026 Accesses
96 Citations

Abstract

Clustering is an essential data mining and tool for analyzing big data. There are difficulties for applying clustering techniques to big data duo to new challenges that are raised with big data. As Big Data is referring to terabytes and petabytes of data and clustering algorithms are come with high computational costs, the question is how to cope with this problem and how to deploy clustering techniques to big data and get the results in a reasonable time. This study is aimed to review the trend and progress of clustering algorithms to cope with big data challenges from very first proposed algorithms until today’s novel solutions. The algorithms and the targeted challenges for producing improved clustering algorithms are introduced and analyzed, and afterward the possible future path for more advanced algorithms is illuminated based on today’s available technologies and frameworks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Havens, T.C., Bezdek, J.C., Palaniswami, M.: Scalable single linkage hierarchical clustering for big data. In: 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp. 396–401. IEEE (2013)
Google Scholar
YouTube Statistic (2014), http://www.youtube.com/yt/press/statistics.html
Williams, P., Soares, C., Gilbert, J.E.: A Clustering Rule Based Approach for Classification Problems. Int. J. Data Warehous. Min. 8(1), 1–23 (2012)
Article Google Scholar
Priya, R.V., Vadivel, A.: User Behaviour Pattern Mining from Weblog. Int. J. Data Warehous. Min. 8(2), 1–22 (2012)
Article Google Scholar
Kwok, T., Smith, K.A., Lozano, S., Taniar, D.: Parallel Fuzzy c-Means Clustering for Large Data Sets. In: Monien, B., Feldmann, R.L. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 365–374. Springer, Heidelberg (2002)
Chapter Google Scholar
Kalia, H., Dehuri, S., Ghosh, A.: A Survey on Fuzzy Association Rule Mining. Int. J. Data Warehous. Min. 9(1), 1–27 (2013)
Article Google Scholar
Daly, O., Taniar, D.: Exception Rules Mining Based on Negative Association Rules. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3046, pp. 543–552. Springer, Heidelberg (2004)
Chapter Google Scholar
Ashrafi, M.Z., Taniar, D., Smith, K.A.: Redundant association rules reduction techniques. Int. J. Bus. Intell. Data Min. 2(1), 29–63 (2007)
Article Google Scholar
Taniar, D., Rahayu, W., Lee, V.C.S., Daly, O.: Exception rules in association rule mining. Appl. Math. Comput. 205(2), 735–750 (2008)
Article MathSciNet MATH Google Scholar
Meyer, F.G., Chinrungrueng, J.: Spatiotemporal clustering of fMRI time series in the spectral domain. Med. Image Anal. 9(1), 51–68 (2004)
Article Google Scholar
Ernst, J., Nau, G.J., Bar-Joseph, Z.: Clustering short time series gene expression data. Bioinforma. 21(suppl. 1), i159–i168 (2005)
Article Google Scholar
Iglesias, F., Kastner, W.: Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns. Energies 6(2), 579–597 (2013)
Article Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach. Learn. 55(3), 311–331 (2004)
Article MATH Google Scholar
Hathaway, R., Bezdek, J.: Extending fuzzy and probabilistic clustering to very large data sets. Comput. Stat. Data Anal. 51(1), 215–234 (2006)
Article MathSciNet MATH Google Scholar
Big Data, What is it and why it is important, http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
Ng, R.T., Han, J.: CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction on Cluster Analysis. John Wiley and Sons (1990)
Google Scholar
Ng, R.T., Han, J.: CLARANS: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large database. In: SIGMOD Conference, pp. 103–114 (1996)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large database. In: SIGMOD Conference, pp. 103–114 (1996)
Google Scholar
Guha, S., Rastogi, R.: CURE: An efficient clustering algorithm for large database. Inf. Syst. 26(1), 35–58 (2001)
Article MATH Google Scholar
Achlioptas, D., McSherry, F.: Fast computation of low rank matrix approximations. J. ACM 54(2), 9 (2007)
Article MathSciNet Google Scholar
Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: A cluster ensemble approach. In: ICML, pp. 186–193 (2003)
Google Scholar
Dasgupta, S.: Experiments with random projection. In: UAI, pp. 143–151 (2000)
Google Scholar
Boutsidis, C., Chekuri, C., Feder, T., Motwani, R.: Random projections for k-means clustering. In: NIPS, pp. 298–306 (2010)
Google Scholar
Golub, G.H., Van-Loan, C.F.: Matrix computations, 2nd edn. The Johns Hopkins University Press (1989)
Google Scholar
Drineas, P., Kannan, R., Mahony, M.W.: Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition. SIAM J. Comput. 36(1), 132–157 (2006)
Article MathSciNet MATH Google Scholar
Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is More: Compact Matrix Decomposition for Large Sparse Graphs. In: SDM (2007)
Google Scholar
Tong, H., Papadimitriou, S., Sun, J., Yu, P.S., Faloutsos, C.: Colibri: Fast mining of large static and dynamic graphs. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 686–694 (2008)
Google Scholar
Januzaj, E., Kriegel, H.-P., Pfeifle, M.: DBDC: Density based distributed clustering. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 88–105. Springer, Heidelberg (2004)
Chapter Google Scholar
Aggarwal, C.C., Reddy, C.K. (eds.): Data Clustering: Algorithms and Applications (2013)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xui, X.: A density-based algorithm for discovering clusters in large spatial database with noise. In: KDD, pp. 226–231 (1996)
Google Scholar
Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning for irregular graphs. SIAM Rev. 41(2), 278–300 (1999)
Article MathSciNet MATH Google Scholar
Karypis, G., Kumar, V.: Multilevel k-way partitining scheme for irregular graphs. J. Parallel Disteributed Comput. 48(1), 96–129 (1998)
Article MathSciNet Google Scholar
Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering. Procedia Comput. Sci. 18, 369–378 (2013)
Article Google Scholar
Anchalia, P.P., Koundinya, A.K., Srinath, N.: MapReduce Design of K-Means Clustering Algorithm. In: 2013 International Conference on Information Science and Applications (ICISA), pp. 1–5 (2013)
Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on MapReduce. In: Cloud Computing, pp. 674–679 (2009)
Google Scholar
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan Kaufmann (2006)
Google Scholar
Mirkin, B.: Clustering for data mining a data recovery approach. CRC Press (2012)
Google Scholar
He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99 (2014)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems Faculty of Computer Science and Information Technology, University of Malaya, 50603, Pantai Valley, Kuala Lumpur, Malaysia
Ali Seyed Shirkhorshidi, Saeed Aghabozorgi, Teh Ying Wah & Tutut Herawan
AMCS Research Center, Yogyakarta, Indonesia
Tutut Herawan

Authors

Ali Seyed Shirkhorshidi
View author publications
You can also search for this author in PubMed Google Scholar
Saeed Aghabozorgi
View author publications
You can also search for this author in PubMed Google Scholar
Teh Ying Wah
View author publications
You can also search for this author in PubMed Google Scholar
Tutut Herawan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, University of Basilicata, 85100, Potenza, Italy
Beniamino Murgante
Department of Computer and Information Sciences, Covenant University, Ota, Nigeria
Sanjay Misra
Department of Production and Systems, University of Minho, 4710-057, Braga, Portugal
Ana Maria A. C. Rocha
DICAR, Polytecnico di Bari, 70125, Bari, Italy
Carmelo Torre
University of Minho, Braga, Portugal
Jorge Gustavo Rocha & Maria Irene Falcão &
Monash University, 3800,, Clayton, VIC, Australia
David Taniar
Department of Intelligent Informatics, Kyushu Sangyo University, 2-3-1 Matsukadai, 813-8503, Higashi-ku, Fukuoka, Japan
Bernady O. Apduhan
Department of Mathematics and Computer Science, University of Perugia, Via Vanvitelli, 1, 06123, Perugia, Italy
Osvaldo Gervasi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T. (2014). Big Data Clustering: A Review. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2014. ICCSA 2014. Lecture Notes in Computer Science, vol 8583. Springer, Cham. https://doi.org/10.1007/978-3-319-09156-3_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-09156-3_49
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09155-6
Online ISBN: 978-3-319-09156-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics