Skip to main content

Implementing Scalable Machine Learning Algorithms for Mining Big Data: A State-of-the-Art Survey

  • Chapter
  • First Online:
Big Data in Engineering Applications

Part of the book series: Studies in Big Data ((SBD,volume 44))

Abstract

The growing trend of Big Data drives additional demand for novel solutions and specifically-designed algorithms that will perform efficient Big Data filtering and processing, recently even in a real-time fashion. Thus, the necessity to scale up Machine Learning algorithms to larger datasets and more complex methods should be addressed by distributed parallelism. This book chapter conducts a thorough literature review on distributed parallel data-intensive Machine Learning algorithms applied on Big Data so far. The selected algorithms fall into various Machine Learning categories, including (i) unsupervised learning, (ii) supervised learning, (iii) semi-supervised learning and (iv) deep learning. The most popular programming frameworks like MapReduce, PLANET, DryadLINQ, IBM Parallel Machine Learning Toolbox (PML), Compute Unified Device Architecture (CUDA) etc., well suited for parallelizing Machine Learning algorithms, will be cited throughout the review. However, this review is mainly focused on the performance and implementation traits of scalable Machine Learning algorithms, rather than on framework wide-ranging choices and their trade-offs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. IDC/EMC. (2014, April). The digital universe of opportunities: Rich data and the increasing value of the internet of things. Retrieved from https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm.

  2. Mashey, J. R. (1999). Retrieved from http://static.usenix.org/event/usenix99/invited_talks/mashey.pdf.

  3. Weiss, S., & Indurkhya, N. (1998). Predictive data mining: A practical guide. Morgan.

    Google Scholar 

  4. Jin, X., W. Wah, B., Cheng, X., & Wang, Y. (2015). Significance and challenges of Big Data research. Big Data Research, 59–64.

    Google Scholar 

  5. Hey, A. J., Tansley, S., & Tolle, K. M. (2009). The fourth paradigm: Data-intensive scientific discovery. WA: Microsoft Research Redmon.

    Google Scholar 

  6. Laney, D. (2001). 3-D data management: Controlling data volume, velocity and variety. META Group Research Note.

    Google Scholar 

  7. Zhou, L., Pan, S., Wang, J., & V. Vasilakos, A. (2017, May 10). Machine learning on Big Data: Opportunities and challenges. Neurocomputing, 237, 350–361. http://doi.org/10.1016/j.neucom.2017.01.026.

  8. Alippi, C., Ntalampiras, S., & Roveri, M. (2016). Designing HMMs in the age of big data. In Advances in Big Data: Proceedings of the 2nd INNS Conference on Big Data (pp. 120–130). Springer International Publishing.

    Google Scholar 

  9. Chen, P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, Elsevier.

    Google Scholar 

  10. Chen, X.-W., & Lin, X. (2014). Big data deep learning: challenges and perspectives. In IEEE Access, 2, 514–525. https://doi.org/10.1109/access.2014.2325029.

  11. Bekkerman, R. A. (2011). Scaling up machine learning: parallel and distributed approaches. In Proceedings of the 17th ACM SIGKDD International Conference Tutorials (pp. Article 4, 1). San Diego, California. http://dx.doi.org/10.1145/2107736.2107740.

  12. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008, January). Knowledge and Information Systems, 14(1), 1–37.

    Google Scholar 

  13. Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on MapReduce. In IEEE International Conference on Cloud Computing. CloudCom 2009 (Vol. 5931, pp. 674–679). Berlin, Heidelberg: Springer.

    Google Scholar 

  14. Anchalia, P. P., Koundinya, A. K., & Srinath, N. K. (2013). MapReduce design of k-means clustering algorithm. In 013 International Conference on Information Science and Applications (ICISA) (pp. 1–5). Suwon. https://doi.org/10.1109/icisa.2013.6579448.

  15. Arthur, D., & Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 1027–1035). New Orleans, Louisiana: Society for Industrial and Applied Mathematics.

    Google Scholar 

  16. Bahmani, B., Moseley, B., Vattani, A. K., & Vassilvitskii, S. (2012). Scalable k-means++. In Proceedings of the VLDB Endowment (pp. 622–633). VLDB Endowment. http://dx.doi.org/10.14778/2180912.2180915.

  17. Xu, K., Wen, C., Yuan, Q., He, X., & Tie, J. (2014). A MapReduce based parallel SVM for email classification. Journal of Networks, 9(6), 1640–1647.

    Google Scholar 

  18. Xu, Y., Qu, W., Li, Z., Min, G., Li, K., & Liu, Z. (2014). Efficient k-means++ approximation with MapReduce. IEEE Transactions on Parallel and Distributed Systems, 3135–3144.

    Google Scholar 

  19. Tang, Z., Liu, K., Xiao, J., Yang, L., & Xiao, Z. (2017, March 23). A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurrency and Computation: Practice and Experience. http://dx.doi.org/10.1002/cpe.4109.

  20. Tang, Z., Liu, K., Xiao, J., Yang, L., & Xiao, Z. (2017). A parallel k‐means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.4109.

  21. Lv, Y., Ma, T., Tang, M., Cao, J., Tian, Y., Al-Dhelaan, A., & Al-Rodhaan, M. (2016, January 1). An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing, 9–22. http://doi.org/10.1016/j.neucom.2015.05.109.

  22. Chen, M., Gao, X., & Li, H. (2010). Parallel DBSCAN with priority r-tree. In 2010 2nd IEEE International Conference on Information Management and Engineering (pp. 508–511). Chengdu. https://doi.org/10.1109/icime.2010.5477926.

  23. Dai, B.-R., & Lin, I.-C. (2012). Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 59–66). Honolulu. https://doi.org/10.1109/cloud.2012.42.

  24. Patwary, M. A., Palsetia, D., Agrawal, A., Liao, W.-K., Manne, F., & Choudhary, A. (2012). A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (pp. 1–11). Salt Lake City, Utah: IEEE Computer Society Press.

    Google Scholar 

  25. Yu, Y., Zhao, J., Wang, X., Wang, Q., & Zhang, Y. (2015). Cludoop: An efficient distributed density-based clustering for Big Data using Hadoop. International Journal of Distributed Sensor Networks, 11(6).

    Google Scholar 

  26. Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., & Alok, C. (2015). A scalable hierarchical clustering algorithm using spark. In Proceedings of the 2015 IEEE First International Conference on Big Data Computing Service and Applications (pp. 418–426). Washington, DC, USA: IEEE Computer Society.

    Google Scholar 

  27. Goyal, P., Kumari, S., Sharma, S., Kuma, D. R., Kishore, V., Balasubramaniam, S., & Goyal, N. (2016). A fast, scalable SLINK algorithm for commodity cluster computing exploiting spatial locality. In 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (pp. 268–275). IEEE. https://doi.org/10.1109/hpcc-smartcity-dss.2016.0047.

  28. Yan, W., Brahmakshatriya, U., Xue, Y., Gilder, M., & Wise, B. (2013, March). p-PIC: Parallel power iteration clustering for Big Data. Journal of Parallel and Distributed Computing, 73(3), 352–359. http://doi.org/10.1016/j.jpdc.2012.06.009.

  29. Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 214–225.

    Google Scholar 

  30. Jayalatchumy, D., & Thambidurai, P. (2014). Implementation of P-Pic algorithm in Map Reduce to handle Big Data. IJRET: International Journal of Research in Engineering and Technology, 113–118.

    Google Scholar 

  31. Priscilla, G. A., & Chilambuchelvan, A. (2016). A fast and parallel implementation of reduction based power iterative clustering algorithm in cloud. International Journal of Computer Science and Information Security, 81–87.

    Google Scholar 

  32. Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. SIGKDD Explorations Newsletter, 90–105.

    Google Scholar 

  33. Zhu, B., Mara, A., & Mozo, M. (2015). CLUS: Parallel subspace clustering algorithm on Spark* (pp. 175–185). Poitiers, France.

    Google Scholar 

  34. Zhu, B., & Mozo, A. (2016). Spark2Fires: A new parallel approximate subspace clustering algorithm. In I. et al. (Ed.), Communications in computer and information science (pp. 147–154). Prague, Czech Republic.

    Google Scholar 

  35. Kwon, B., & Cho, H. (2010). Scalable co-clustering algorithms. In Algorithms and Architectures for Parallel Processing: 10th International Conference (pp. 32–43). Busan, Korea: Springer Berlin Heidelberg.

    Google Scholar 

  36. Papadimitriou, S., & Sun, J. (2008). DisCo: Distributed co-clustering with Map-Reduce: A case study towards petabyte-scale end-to-end mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (pp. 512–521). Washington, DC, USA: IEEE Computer Society. https://doi.org/10.1109/icdm.2008.142.

  37. Mu, Y., Liu, X., Yang, Z., & Liu, X. (2017). A parallel C4.5 decision tree algorithm based on MapReduce. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.4015.

  38. Han, H., Liu, Y., & Sun, X. (2013). A scalable random forest algorithm based on MapReduce. In 2013 IEEE 4th International Conference on Software Engineering and Service Science (pp. 849–852). IEEE. https://doi.org/10.1109/icsess.2013.6615438.

  39. Gupta, P., Sharma, A., & Jindal, R. (2016). Scalable machine-learning algorithms for Big Data analytics: A comprehensive review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 194–214.

    Google Scholar 

  40. del Río, S., López, V., Benítez, M., & Herrera, F. (2015). A MapReduce approach to address big data classification problems based on the fusion of linguistic fuzzy rules. International Journal of Computational Intelligence Systems, 422–427.

    Google Scholar 

  41. Fernandez, A., del Río, S., & Herrera, F. (2016). A first approach in evolutionary fuzzy systems based on the lateral tuning of the linguistic labels for Big Data classification. In 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (pp. 1437–1444). IEEE. https://doi.org/10.1109/fuzz-ieee.2016.7737858.

  42. Liu, B., Blasch, E., Chen, Y., Shen, D., & Chen, G. (2013). Scalable sentiment classification for Big Data analysis using Naïve Bayes classifier. In 2013 IEEE International Conference on Big Data (pp. 99–104). IEEE.

    Google Scholar 

  43. Sun, Z. S., & Fox, G. (2012). Study on parallel SVM based on MapReduce. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) (p. 1).

    Google Scholar 

  44. Khairnar, J., & Kinikar, M. (2014). Sentiment analysis based mining and summarizing using SVM-MapReduce. International Journal of Computer Science & Information Technology, 5(3), 4081.

    Google Scholar 

  45. You, Z.-H., Yu, J.-Z., Zhu, L., Li, S., & Wen, Z.-K. (2014). A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing, 145, 37–43. http://doi.org/10.1016/j.neucom.2014.05.072.

  46. Liu, Y., Xu, L., & Li, M. (2016). The parallelization of back propagation neural network in MapReduce and spark. International Journal of Parallel Programming, 1–20.

    Google Scholar 

  47. Rehab, M. A., & Boufares, F. (2015). Scalable massively parallel learning of multiple linear regression algorithm with MapReduce. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 41–47).

    Google Scholar 

  48. Hariharan, C., & Subramanian, S. (2013). Large scale multi-view learning on MapReduce. In Proceedings of 19th International Conference on Advanced Computing and Communications.

    Google Scholar 

  49. Wang, S., Zhao, Q., & Ye, F. (2013). A new back-propagation neural network algorithm for a Big Data environment based on punishing characterized active learning strategy. International Journal of Knowledge and Systems Science, 32–45.

    Google Scholar 

  50. Ranzato, M. A., Boureau, Y.-L., & LeCun, Y. (2007). Sparse feature learning for deep belief networks. In Proceedings of the 20th International Conference on Neural Information Processing Systems (pp. 1185–1192). USA: Curran Associates Inc.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marjana Prifti Skënduli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Skënduli, M.P., Biba, M., Ceci, M. (2018). Implementing Scalable Machine Learning Algorithms for Mining Big Data: A State-of-the-Art Survey. In: Roy, S., Samui, P., Deo, R., Ntalampiras, S. (eds) Big Data in Engineering Applications. Studies in Big Data, vol 44. Springer, Singapore. https://doi.org/10.1007/978-981-10-8476-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-8476-8_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-8475-1

  • Online ISBN: 978-981-10-8476-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics