Abstract
As a critical research topic toward the new era of big data, how to develop a high-performance data analytics system has received significant research attention from different disciplines since the 2000s. In the literature, many recent works attempted to develop a high-performance data analytics system to handle the large amount of data (i.e., volume) from different information systems (i.e., variety) that typically will be created very quickly in a short time (i.e., velocity). In particular, several recent studies have shown that metaheuristic algorithms can be applied to many data mining optimization problems to provide a better way to find a high-quality result than traditional deterministic algorithms. A high-performance clustering algorithm for big data analytics system will be presented in this paper. The proposed algorithm is designed based on a new kind of metaheuristic algorithm, coral reef optimization with substrate layers (CRO-SL), to get a better cluster result. To improve the effectiveness and efficiency, the proposed CRO-SL scheme has been applied to a cloud computing platform as well to reduce the response time of a data analytics system. The simulation results show that the proposed algorithm is able to provide a better clustering result than the other clustering algorithms compared in this research, including k-means, genetic k-means algorithm, particle swarm optimization, and simple coral reef optimization algorithm in terms of the sum of squared errors.
Similar content being viewed by others
Notes
The CRO-SL is an extended version of the coral reefs optimization algorithm, which was presented in Salcedo-Sanz et al. (2016).
References
Agrawal D, Das S, El Abbadi A (2011) Big data and cloud computing: current state and future opportunities. In: Proceedings of the international conference on extending database technology, pp 530–533
Ashish T, Kapil S, Manju B (2018) Parallel bat algorithm-based clustering using MapReduce. In: Proceedings of the networking communication and data knowledge engineering. Springer Singapore, pp 73–82
Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on K-means algorithm for optimal clustering in \(R^N\). Inf Sci 146(1):221–237
Baraniuk RG (2011) More is less: signal processing and the data deluge. Science 331(6018):717–719
Blum C, Roli A (2003) Metaheuristics in combinatorial optimization: overview and conceptual comparison. ACM Comput Surv 35(3):268–308
Bryan K, Cunningham P, Bolshakova N (2005) Biclustering of expression data using simulated annealing. In: Proceedings of the IEEE symposium on computer-based medical systems (CBMS’05), pp 383–388
Daoudi M, Hamena S, Benmounah Z, Batouche M (2014) Parallel differential evolution clustering algorithm based on MapReduce. In: Proceedings of the international conference of soft computing and pattern recognition, pp 337–341
Debuse JC, Rayward-Smith VJ (1997) Feature subset selection within a simulated annealing data mining algorithm. J Intell Inf Syst 9(1):57–81
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Fang W, Lau KK, Lu M, Xiao X, Lam CK, Yang PY, He B, Luo Q, Sander PV, Yang K (2008) Parallel data mining on graphics processors. Tech. Rep., The Hong Kong University of Science and Technology
Fayyad U, Piatetsky-shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Mag 17:37–54
Ficco M, Esposito C, Palmieri F, Castiglione A (2018) A coral-reefs and game theory-based approach for optimizing elastic cloud resource allocation. Future Gener Comput Syst 78:343–352
Glover F, Kochenberger GA (eds) (2003) Handbook of metaheuristics. Springer, Berlin
Handl J, Meyer B (2007) Ant-based and swarm-based clustering. Swarm Intell 1(2):95–113
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco. ISBN 0123814790, 9780123814791
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115
Hoffman P, Grinstein G, Pinkney D (1999) Dimensional anchors: a graphic primitive for multidimensional multivariate information visualizations. In: Proceedings of the workshop on new paradigms in information visualization and manipulation in conjunction with the ACM international conference on information and knowledge management, pp 9–16
Huang DW, Lin J (2010) Scaling populations of a genetic algorithm for job shop scheduling problems using MapReduce. In: Proceedings of the IEEE second international conference on cloud computing technology and science, pp 780–785
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of international conference on neural networks, vol 4, pp 1942–1948
Krishna K, Murty MN (1999) Genetic \(k\)-means algorithm. IEEE Trans Syst Man Cybern Part B 29(3):433–439
Lai JZC, Liaw Y-C, Liu J (2008) A fast VQ codebook generation algorithm using codeword displacement. Pattern Recognit Lett 41(1):315–319
Laney D (2001) 3D data management: controlling data volume, velocity, and variety. Tech. Rep, META Group
Liu B (2009) Web data mining: exploring hyperlinks, contents, and usage data. Springer, Berlin
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727
Lu Y, Cao B, Rego C, Glover F (2018) A Tabu search based clustering algorithm and its parallel implementation on Spark. Appl Soft Comput 63:97–109
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1: statistics, pp 281–297
Maimon O (2009) Soft computing for knowledge discovery and data mining. Springer, Berlin. ISBN 144194351X, 9781441943514
Medeiros IG, Xavier JC, Canuto AMP (2015) Applying the coral reefs optimization algorithm to clustering problems. In: Proceedings of the international joint conference on neural networks, pp 1–8
Mitra S, Pal SK, Mitra P (2002) Data mining in soft computing framework: a survey. IEEE Trans Neural Netw 13(1):3–14
Ostfeld A, Salomons S (2005) A hybrid genetic-instance based learning algorithm for CE-QUAL-W2 calibration. J Hydrol 310(1):122–142
Parpinelli RS, Lopes HS, Freitas AA (2002) Data mining with an ant colony optimization algorithm. IEEE Trans Evolut Comput 6(4):321–332
Radviz (2018) https://cran.r-project.org/web/packages/Radviz/vignettes/single_cell_projections.html
Raghupathi W, Raghupathi V (2014) Big data analytics in healthcare: promise and potential. Health Inf Sci Syst 2(3):1–10
Sagiroglu S, Sinanc D (2013) Big data: a review. In: Proceedings of the international conference on collaboration technologies and systems (CTS), pp 42–47
Salcedo-Sanz S, Ser JD, Gil-López S, Landa-Torres I, Portilla-Figueras JA (2013a) The coral reefs optimization algorithm: an efficient meta-heuristic for solving hard optimization problems. In: Proceedings of the applied stochastic models and data analysis international conference, pp 751–758
Salcedo-Sanz S, Pastor-Sánchez A, Gallo-Marazuela D, Portilla-Figueras A (2013b) A novel coral reefs optimization algorithm for multi-objective problems. In: Proceedings of the intelligent data engineering and automated learning, pp 326–333
Salcedo-Sanz S, Ser JD, Landa-Torres I, Gil-López S, Portilla-Figueras JA (2014a) The coral reefs optimization algorithm: a novel metaheuristic for efficiently solving optimization problems. Sci World J 2014:1–15
Salcedo-Sanz S, García-Díaz P, Portilla-Figueras J, Ser JD, Gil-López S (2014b) A coral reefs optimization algorithm for optimal mobile network deployment with electromagnetic pollution control criterion. Appl Soft Comput 24:239–248
Salcedo-Sanz S, Gallo-Marazuela D, Pastor-Sánchez A, Carro-Calvo L, Portilla-Figueras A, Prieto L (2014c) Offshore wind farm design with the coral reefs optimization algorithm. Renew Energy 63:109–115
Salcedo-Sanz S, Casanova-Mateo C, Pastor-Sánchez A, Sánchez-Girón M (2014d) Daily global solar radiation prediction based on a hybrid coral reefs optimization—extreme learning machine approach. Sol Energy 105:91–98
Salcedo-Sanz S, Pastor-Sánchez A, Ser JD, Prieto L, Geem Z (2015) A coral reefs optimization algorithm with harmony search operators for accurate wind speed prediction. Renew Energy 75:93–101
Salcedo-Sanz S, Camacho-Gómez C, Molina D, Herrera F (2016) A coral reefs optimization algorithm with substrate layers and local search for large scale global optimization. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp 3574–3581
Sarazin T, Azzag H, Lebbah M (2014) SOM clustering using Spark-MapReduce. In: Proceedings of the IEEE international parallel distributed processing symposium workshops, pp 1727–1734
Selim SZ, Alsultan K (1991) A simulated annealing algorithm for the clustering problem. Pattern Recognit 24(10):1003–1008
Shmueli G, Bruce PC, Yahav I, Patel NR, L KC Jr (2017) Data mining for business analytics: concepts, techniques, and applications in R. Wiley, Hoboken
Teijeiro D, Pardo XC, González P, Banga JR, Doallo R (2016) Implementing parallel differential evolution on Spark. In: Proceedings of the applications of evolutionary computation. Springer, pp 75–90
Tsai C, Lai C, Chiang M, Yang LT (2014) Data mining for internet of things: a survey. IEEE Commun Surv Tutor 16(1):77–97
Tsai C-W, Huang K-W, Yang C-S, Chiang M-C (2015) A fast particle swarm optimization for clustering. Soft Comput 19(2):321–338
Tsai C-W, Chang H-C, Hu K-C, Chiang M-C (2016) Parallel coral reef algorithm for solving JSP on Spark. In: Proceedings of the IEEE international conference on systems, man, and cybernetics, pp 1872–1877
Tsai C-W, Liu S-J, Wang Y-C (2018) A parallel metaheuristic data clustering framework for cloud. J Parallel Distrib Comput 116:39–49
Tseng L-Y, Chen C (2008) Multiple trajectory search for large scale global optimization. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp 3052–3059
User locations until 2012 (FINLAND) (2018). http://cs.uef.fi/mopsi/data/
van der Merwe DW, Engelbrecht AP (2003) Data clustering using particle swarm optimization. Proc Evolut Comput 1:215–220
Wang Y-C, Tsai C-W (2008) An efficient coral reef optimization with substrate layers for clustering problem on Spark. In: Proceedings of IEEE international conference on systems, man and cybernetics
Wang B, Yin J, Hua Q, Wu Z, Cao J (2016) Parallelizing \(k\)-means-based clustering on Spark. In: Proceedings of the international conference on advanced cloud and big data, pp 31–36
Wu R, Zhang B, Hsu M (2009) Clustering billions of data points using GPUs. In: Proceedings of the combined workshops on unconventional high performance computing workshop plus memory access workshop, pp 1–6
Wu B, Wu G, Yang M (2012) A MapReduce based ant colony optimization approach to combinatorial optimization problems. In: Proceedings of the international conference on natural computation, pp 728–732
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Zhou J, Yu K-M, Wu B-C (2010) Parallel frequent patterns mining algorithm on GPU. In: Proceedings of the IEEE international conference on systems, man and cybernetics, pp 435–440
Zü (2008) K-harmonic means data clustering with tabu-search method. Appl Math Model 32(6):1115–1125
Funding
This work was supported in part by the Ministry of Science and Technology of Taiwan, R.O.C., under Contracts MOST106-2221-E-005-094, MOST107-2221-E-005-029, MOST107-2221-E-005-022 and MOST107-2218-E-005-018.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Communicated by A.K. Sangaiah, H. Pham, M.-Y. Chen, H. Lu, F. Mercaldo.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tsai, CW., Chang, WY., Wang, YC. et al. A high-performance parallel coral reef optimization for data clustering. Soft Comput 23, 9327–9340 (2019). https://doi.org/10.1007/s00500-019-03950-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-03950-3