Distributed Platforms and Cloud Services: Enabling Machine Learning for Big Data

Pop, Daniel; Iuhasz, Gabriel; Petcu, Dana

doi:10.1007/978-3-319-31861-5_7

Daniel Pop²,
Gabriel Iuhasz² &
Dana Petcu²

4640 Accesses
1 Citations

Abstract

Applying popular machine learning algorithms to large amounts of data has raised new challenges for machine learning practitioners. Traditional libraries do not support properly the processing of huge data sets, so the new approaches are needed. Using modern distributed computing paradigms, such as MapReduce or in-memory processing, novel machine learning libraries have been developed. At the same time, the advance of cloud computing in the past 10 years could not be ignored by the machine learning community. Thus, a rise of cloud-based platforms has been of significance. This chapter aims at presenting an overview of novel platforms, libraries, and cloud services that can be used by data scientists to extract knowledge from unstructured and semi-structured, large data sets. The overview covers several popular packages to enable distributed computing in popular machine learning environments, distributed platforms for machine learning, and cloud services for machine learning, known as machine-learning-as-a-service approach. We also provide a number of recommendations for data scientists when considering machine learning approach for their problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Agarwal A, Chapelle O, Dudik M, Langford J (2014) A reliable effective terascale linear learning system. J Mach Learn Res 15:1111–1133
MathSciNet MATH Google Scholar
Assunção MD, Calheiros RN, Bianchi S, Netto MAS, Buyya R (2014) Big data computing and clouds: trends and future directions. J. Parallel Distrib Comput. http://dx.doi.org/10.10.16/j.jpdc.2014.08.003
Bekkerman R, Bilenko M, Langford J (eds) (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, Cambridge
Google Scholar
Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: The Konstanz Information Miner. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R (eds) Studies in classification, data analysis, and knowledge organization. Springer, Berlin/Heidelberg
Google Scholar
Budiu M, Fetterly D, Isard M, McSherry F, Yu Y (2012) Large-scale machine learning using DryadLINQ. In: Bekkerman R, Bilenko M, Langford J (eds) Scaling up machine learning. Cambridge University Press, Cambridge
Google Scholar
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Article Google Scholar
Charrington S (2012) Three new tools bring machine learning insights to the masses, February, Read Write Web. http://www.readwriteweb.com/hack/2012/02/three-new-tools-bring-machine.php
Dai W et al (2015) High-performance distributed ML at scale through parameter server consistency models, AAAI
Google Scholar
Eckerson W (2012) New technologies for big data. http://www.b-eye-network.com/blogs/eckerson/archives/2012/11/new_technologie.php
Franklin M et al (2015) MLlib: Machine Learning in apache Spark
Google Scholar
Gander M et al (2013) Anomaly detection in the cloud: detecting security incidents via machine learning, trustworthy eternal systems via evolving software, data and knowledge, vol 379. Springer, Berlin/Heidelberg, pp 103–116
Book Google Scholar
Ghoting A, Kambadur P, Pednault E, Kannan R (2011) NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge Discovery and Data mining KDD’11, ACM, New York, NY, USA, pp 334–342
Google Scholar
Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S (2011) SystemML: declarative machine learning on MapReduce. In: Proceedings of the 2011 I.E. 27th International Conference on Data Engineering (ICDE ‘11). IEEE Computer Society, Washington, DC, USA, pp 231–242
Google Scholar
Granger B, Perez F, Ragan-Kelley M (2011) Using IPython for parallel computing. http://minrk.github.com/scipy-tutorial-2011. Accessed 13 May 2015
Grisel O (2013) Advanced machine learning with scikit-learn, PYCON tutorial. https://us.pycon.org/2013/schedule/presentation/23/
Hall M et al (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Hall M (2013) Weka and Spark – http://markahall.blogspot.co.nz/. Accessed 13 May 2015
Harris D (2015) 5 low-profile startups that could change the face of big data. http://gigaom.com/cloud/5-low-profile-startups-that-could-change-the-face-of-big-data/. Accessed 15 July 2015
Hido S, Tokui S, Oda S (2013) Jubauts: an open source platform for distributed online machine learning, NIPS workshop on Big Learning, Lake Taho
Google Scholar
Hofmann M, Klinkenberg R (2013) RapidMiner: data mining use cases and business analytics applications. Chapman &Hall/CRC, Boca Raton
Google Scholar
Isard M et al. (2007) Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper Syst Rev 41:59–72. doi:10.1145/1272998.1273005
Google Scholar
Jain A, Nalya A (2014) Learning storm. Packt Publishing, Birmingham
Google Scholar
Nuggets KD (2014) http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html. Accessed 15 May 2015
Krizhevsky A, Sutskever I, Hinton GE ImageNet (2012) Classification with deep convolutional neural networks. NIPS 2012: neural information processing systems, Lake Tahoe, Nevada
Google Scholar
Le Q, Ranzato MA, Monga R, Devin M, Chen K, Corrado G, Dean J, Ng A (2012) Building high-level features using large scale unsupervised learning, international conference in machine learning, Edinburgh, UK
Google Scholar
Leo S, Zanetti G (2010) Pydoop: a Python MapReduce and HDFS API for Hadoop. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, Chicago, IL, USA, pp 819–825
Google Scholar
Low Y et al. (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: Proceedings of the VLDB endowment, vol 5, no 8, August 2012, Istanbul, Turkey
Google Scholar
Mohri M, Rostamizadeh A, Talwalkar A (2012) A foundations of machine learning. The MIT Press, Cambridge, MA
MATH Google Scholar
Owen S, Anil R, Dunning T, Friedman E (2011) Mahout in action. Manning Publications Co., Shelter Island
Google Scholar
Patcha A, Park JM (2007) An overview of anomaly detection techniques: existing solutions and latest technological trends. Comput Netw Elsevier, North-Holland, Inc., 51:3448–3470
Google Scholar
Pednault E, Yom-Tov E, Ghoting A (2012) IBM parallel machine learning toolbox. In: Bekkerman R, Bilenko M, Langford J (eds) Scaling up machine learning. Cambridge University Press, New York
Google Scholar
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Piccolboni A (2015) RHadoop. https://github.com/RevolutionAnalytics/RHadoop/wiki. Accessed 13 May 2015
Roldn MC (2013) Pentaho data integration beginner’s guide. Packt Publishing, Birmingham
Google Scholar
Rosen J et al (2013) Iterative MapReduce for large scale machine learning, CoRR, abs/1303.3517
Google Scholar
Russom P Big data Analytics (2011) TDWI best practices report, The Data Warehousing Institute (TDWI) Research
Google Scholar
Sagha H, Bayati H, Millán JDR, Chavarriaga R (2013) On-line anomaly detection and resilience in classifier ensembles. Pattern Recogn Lett, Elsevier Science Inc., 34:1916–1927
Google Scholar
Shi Q et al (2009) Hash kernels for structured data. J Mach Learn Res JMLR.org, 10:2615–2637
Google Scholar
Elston SF (2015) Data science in the cloud with Microsoft Azure Machine Learning and R, O’Reilly
Google Scholar
Thearling KK (1995) Massively parallel architectures and algorithms for time series analysis. In: Nadel L, Stien D (eds) Lectures in complex systems. Addison-Wesley, Reading
Google Scholar
Tierney L, Rossini AJ, Snow NL (2009) A parallel computing framework for the R system. Int J Parallel Prog 37:78–90. doi:10.1007/s10766-008-0077-2
Article MATH Google Scholar
Upadhyaya SR (2013) Parallel approaches to machine learning – a comprehensive survey. J Parallel Distrib Comput 73(3):284–292. ISSN 0743–7315. http://dx.doi.org/10.1016/j.jpdc.2012.11.001
Google Scholar
Wei D, Wei J, Zheng X, Kim JK, Lee S, Yin J, Ho Q, Xing EP (2013) Petuum: a framework for iterative-convergent distributed ML. arxiv.org/abs/1312:7651
Google Scholar
Zaharia M et al. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in Cloud, Computing, USENIX Association, pp 10–10
Google Scholar

Download references

Acknowledgments

This work was supported by the European Commission H2020 co-funded with project DICE (GA 644869).

Author information

Authors and Affiliations

Institute e-Austria Timisoara, West University of Timisoara, Blvd. Vasile Parvan, nr. 4, 300223, Timișoara, Romania
Daniel Pop, Gabriel Iuhasz & Dana Petcu

Authors

Daniel Pop
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Iuhasz
View author publications
You can also search for this author in PubMed Google Scholar
Dana Petcu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Pop .

Editor information

Editors and Affiliations

Department of Computing and Mathematics , University of Derby, Derby, United Kingdom
Zaigham Mahmood

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pop, D., Iuhasz, G., Petcu, D. (2016). Distributed Platforms and Cloud Services: Enabling Machine Learning for Big Data. In: Mahmood, Z. (eds) Data Science and Big Data Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-31861-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-31861-5_7
Published: 06 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31859-2
Online ISBN: 978-3-319-31861-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics