Implementing Scalable Machine Learning Algorithms for Mining Big Data: A State-of-the-Art Survey

Skënduli, Marjana Prifti; Biba, Marenglen; Ceci, Michelangelo

doi:10.1007/978-981-10-8476-8_4

Marjana Prifti Skënduli⁶,
Marenglen Biba⁶ &
Michelangelo Ceci⁷

Part of the book series: Studies in Big Data ((SBD,volume 44))

2200 Accesses
2 Citations

Abstract

The growing trend of Big Data drives additional demand for novel solutions and specifically-designed algorithms that will perform efficient Big Data filtering and processing, recently even in a real-time fashion. Thus, the necessity to scale up Machine Learning algorithms to larger datasets and more complex methods should be addressed by distributed parallelism. This book chapter conducts a thorough literature review on distributed parallel data-intensive Machine Learning algorithms applied on Big Data so far. The selected algorithms fall into various Machine Learning categories, including (i) unsupervised learning, (ii) supervised learning, (iii) semi-supervised learning and (iv) deep learning. The most popular programming frameworks like MapReduce, PLANET, DryadLINQ, IBM Parallel Machine Learning Toolbox (PML), Compute Unified Device Architecture (CUDA) etc., well suited for parallelizing Machine Learning algorithms, will be cited throughout the review. However, this review is mainly focused on the performance and implementation traits of scalable Machine Learning algorithms, rather than on framework wide-ranging choices and their trade-offs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

IDC/EMC. (2014, April). The digital universe of opportunities: Rich data and the increasing value of the internet of things. Retrieved from https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm.
Mashey, J. R. (1999). Retrieved from http://static.usenix.org/event/usenix99/invited_talks/mashey.pdf.
Weiss, S., & Indurkhya, N. (1998). Predictive data mining: A practical guide. Morgan.
Google Scholar
Jin, X., W. Wah, B., Cheng, X., & Wang, Y. (2015). Significance and challenges of Big Data research. Big Data Research, 59–64.
Google Scholar
Hey, A. J., Tansley, S., & Tolle, K. M. (2009). The fourth paradigm: Data-intensive scientific discovery. WA: Microsoft Research Redmon.
Google Scholar
Laney, D. (2001). 3-D data management: Controlling data volume, velocity and variety. META Group Research Note.
Google Scholar
Zhou, L., Pan, S., Wang, J., & V. Vasilakos, A. (2017, May 10). Machine learning on Big Data: Opportunities and challenges. Neurocomputing, 237, 350–361. http://doi.org/10.1016/j.neucom.2017.01.026.
Alippi, C., Ntalampiras, S., & Roveri, M. (2016). Designing HMMs in the age of big data. In Advances in Big Data: Proceedings of the 2nd INNS Conference on Big Data (pp. 120–130). Springer International Publishing.
Google Scholar
Chen, P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, Elsevier.
Google Scholar
Chen, X.-W., & Lin, X. (2014). Big data deep learning: challenges and perspectives. In IEEE Access, 2, 514–525. https://doi.org/10.1109/access.2014.2325029.
Bekkerman, R. A. (2011). Scaling up machine learning: parallel and distributed approaches. In Proceedings of the 17th ACM SIGKDD International Conference Tutorials (pp. Article 4, 1). San Diego, California. http://dx.doi.org/10.1145/2107736.2107740.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008, January). Knowledge and Information Systems, 14(1), 1–37.
Google Scholar
Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on MapReduce. In IEEE International Conference on Cloud Computing. CloudCom 2009 (Vol. 5931, pp. 674–679). Berlin, Heidelberg: Springer.
Google Scholar
Anchalia, P. P., Koundinya, A. K., & Srinath, N. K. (2013). MapReduce design of k-means clustering algorithm. In 013 International Conference on Information Science and Applications (ICISA) (pp. 1–5). Suwon. https://doi.org/10.1109/icisa.2013.6579448.
Arthur, D., & Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1027–1035). New Orleans, Louisiana: Society for Industrial and Applied Mathematics.
Google Scholar
Bahmani, B., Moseley, B., Vattani, A. K., & Vassilvitskii, S. (2012). Scalable k-means++. In Proceedings of the VLDB Endowment (pp. 622–633). VLDB Endowment. http://dx.doi.org/10.14778/2180912.2180915.
Xu, K., Wen, C., Yuan, Q., He, X., & Tie, J. (2014). A MapReduce based parallel SVM for email classification. Journal of Networks, 9(6), 1640–1647.
Google Scholar
Xu, Y., Qu, W., Li, Z., Min, G., Li, K., & Liu, Z. (2014). Efficient k-means++ approximation with MapReduce. IEEE Transactions on Parallel and Distributed Systems, 3135–3144.
Google Scholar
Tang, Z., Liu, K., Xiao, J., Yang, L., & Xiao, Z. (2017, March 23). A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurrency and Computation: Practice and Experience. http://dx.doi.org/10.1002/cpe.4109.
Tang, Z., Liu, K., Xiao, J., Yang, L., & Xiao, Z. (2017). A parallel k‐means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.4109.
Lv, Y., Ma, T., Tang, M., Cao, J., Tian, Y., Al-Dhelaan, A., & Al-Rodhaan, M. (2016, January 1). An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing, 9–22. http://doi.org/10.1016/j.neucom.2015.05.109.
Chen, M., Gao, X., & Li, H. (2010). Parallel DBSCAN with priority r-tree. In 2010 2nd IEEE International Conference on Information Management and Engineering (pp. 508–511). Chengdu. https://doi.org/10.1109/icime.2010.5477926.
Dai, B.-R., & Lin, I.-C. (2012). Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 59–66). Honolulu. https://doi.org/10.1109/cloud.2012.42.
Patwary, M. A., Palsetia, D., Agrawal, A., Liao, W.-K., Manne, F., & Choudhary, A. (2012). A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (pp. 1–11). Salt Lake City, Utah: IEEE Computer Society Press.
Google Scholar
Yu, Y., Zhao, J., Wang, X., Wang, Q., & Zhang, Y. (2015). Cludoop: An efficient distributed density-based clustering for Big Data using Hadoop. International Journal of Distributed Sensor Networks, 11(6).
Google Scholar
Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., & Alok, C. (2015). A scalable hierarchical clustering algorithm using spark. In Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications (pp. 418–426). Washington, DC, USA: IEEE Computer Society.
Google Scholar
Goyal, P., Kumari, S., Sharma, S., Kuma, D. R., Kishore, V., Balasubramaniam, S., & Goyal, N. (2016). A fast, scalable SLINK algorithm for commodity cluster computing exploiting spatial locality. In 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (pp. 268–275). IEEE. https://doi.org/10.1109/hpcc-smartcity-dss.2016.0047.
Yan, W., Brahmakshatriya, U., Xue, Y., Gilder, M., & Wise, B. (2013, March). p-PIC: Parallel power iteration clustering for Big Data. Journal of Parallel and Distributed Computing, 73(3), 352–359. http://doi.org/10.1016/j.jpdc.2012.06.009.
Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 214–225.
Google Scholar
Jayalatchumy, D., & Thambidurai, P. (2014). Implementation of P-Pic algorithm in Map Reduce to handle Big Data. IJRET: International Journal of Research in Engineering and Technology, 113–118.
Google Scholar
Priscilla, G. A., & Chilambuchelvan, A. (2016). A fast and parallel implementation of reduction based power iterative clustering algorithm in cloud. International Journal of Computer Science and Information Security, 81–87.
Google Scholar
Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. SIGKDD Explorations Newsletter, 90–105.
Google Scholar
Zhu, B., Mara, A., & Mozo, M. (2015). CLUS: Parallel subspace clustering algorithm on Spark* (pp. 175–185). Poitiers, France.
Google Scholar
Zhu, B., & Mozo, A. (2016). Spark2Fires: A new parallel approximate subspace clustering algorithm. In I. et al. (Ed.), Communications in computer and information science (pp. 147–154). Prague, Czech Republic.
Google Scholar
Kwon, B., & Cho, H. (2010). Scalable co-clustering algorithms. In Algorithms and Architectures for Parallel Processing: 10th International Conference (pp. 32–43). Busan, Korea: Springer Berlin Heidelberg.
Google Scholar
Papadimitriou, S., & Sun, J. (2008). DisCo: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (pp. 512–521). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/icdm.2008.142.
Mu, Y., Liu, X., Yang, Z., & Liu, X. (2017). A parallel C4.5 decision tree algorithm based on MapReduce. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.4015.
Han, H., Liu, Y., & Sun, X. (2013). A scalable random forest algorithm based on MapReduce. In 2013 IEEE 4th International Conference on Software Engineering and Service Science (pp. 849–852). IEEE. https://doi.org/10.1109/icsess.2013.6615438.
Gupta, P., Sharma, A., & Jindal, R. (2016). Scalable machine-learning algorithms for Big Data analytics: A comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 194–214.
Google Scholar
del Río, S., López, V., Benítez, M., & Herrera, F. (2015). A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. International Journal of Computational Intelligence Systems, 422–427.
Google Scholar
Fernandez, A., del Río, S., & Herrera, F. (2016). A first approach in evolutionary fuzzy systems based on the lateral tuning of the linguistic labels for Big Data classification. In 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (pp. 1437–1444). IEEE. https://doi.org/10.1109/fuzz-ieee.2016.7737858.
Liu, B., Blasch, E., Chen, Y., Shen, D., & Chen, G. (2013). Scalable sentiment classification for Big Data analysis using Naïve Bayes classifier. In 2013 IEEE International Conference on Big Data (pp. 99–104). IEEE.
Google Scholar
Sun, Z. S., & Fox, G. (2012). Study on parallel SVM based on MapReduce. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) (p. 1).
Google Scholar
Khairnar, J., & Kinikar, M. (2014). Sentiment analysis based mining and summarizing using SVM-MapReduce. International Journal of Computer Science & Information Technology, 5(3), 4081.
Google Scholar
You, Z.-H., Yu, J.-Z., Zhu, L., Li, S., & Wen, Z.-K. (2014). A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing, 145, 37–43. http://doi.org/10.1016/j.neucom.2014.05.072.
Liu, Y., Xu, L., & Li, M. (2016). The parallelization of back propagation neural network in MapReduce and spark. International Journal of Parallel Programming, 1–20.
Google Scholar
Rehab, M. A., & Boufares, F. (2015). Scalable massively parallel learning of multiple linear regression algorithm with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 41–47).
Google Scholar
Hariharan, C., & Subramanian, S. (2013). Large scale multi-view learning on MapReduce. In Proceedings of 19th International Conference on Advanced Computing and Communications.
Google Scholar
Wang, S., Zhao, Q., & Ye, F. (2013). A new back-propagation neural network algorithm for a Big Data environment based on punishing characterized active learning strategy. International Journal of Knowledge and Systems Science, 32–45.
Google Scholar
Ranzato, M. A., Boureau, Y.-L., & LeCun, Y. (2007). Sparse feature learning for deep belief networks. In Proceedings of the 20th International Conference on Neural Information Processing Systems (pp. 1185–1192). USA: Curran Associates Inc.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of New York in Tirana, Tirana, Albania
Marjana Prifti Skënduli & Marenglen Biba
Department of Computer Science, University of Bari, Bari, Italy
Michelangelo Ceci

Authors

Marjana Prifti Skënduli
View author publications
You can also search for this author in PubMed Google Scholar
Marenglen Biba
View author publications
You can also search for this author in PubMed Google Scholar
Michelangelo Ceci
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marjana Prifti Skënduli .

Editor information

Editors and Affiliations

School of Computing Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
Sanjiban Sekhar Roy
Department of Civil Engineering, National Institute of Technology Patna, Patna, Bihar, India
Pijush Samui
University of Southern Queensland, Springfield, Queensland, Australia
Ravinesh Deo
Polytechnic University of Milan, Milan, Italy
Stavros Ntalampiras

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Skënduli, M.P., Biba, M., Ceci, M. (2018). Implementing Scalable Machine Learning Algorithms for Mining Big Data: A State-of-the-Art Survey. In: Roy, S., Samui, P., Deo, R., Ntalampiras, S. (eds) Big Data in Engineering Applications. Studies in Big Data, vol 44. Springer, Singapore. https://doi.org/10.1007/978-981-10-8476-8_4

Download citation

DOI: https://doi.org/10.1007/978-981-10-8476-8_4
Published: 03 May 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8475-1
Online ISBN: 978-981-10-8476-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics