Abstract
Applying popular machine learning algorithms to large amounts of data has raised new challenges for machine learning practitioners. Traditional libraries do not support properly the processing of huge data sets, so the new approaches are needed. Using modern distributed computing paradigms, such as MapReduce or in-memory processing, novel machine learning libraries have been developed. At the same time, the advance of cloud computing in the past 10 years could not be ignored by the machine learning community. Thus, a rise of cloud-based platforms has been of significance. This chapter aims at presenting an overview of novel platforms, libraries, and cloud services that can be used by data scientists to extract knowledge from unstructured and semi-structured, large data sets. The overview covers several popular packages to enable distributed computing in popular machine learning environments, distributed platforms for machine learning, and cloud services for machine learning, known as machine-learning-as-a-service approach. We also provide a number of recommendations for data scientists when considering machine learning approach for their problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
References
Agarwal A, Chapelle O, Dudik M, Langford J (2014) A reliable effective terascale linear learning system. J Mach Learn Res 15:1111–1133
Assunção MD, Calheiros RN, Bianchi S, Netto MAS, Buyya R (2014) Big data computing and clouds: trends and future directions. J. Parallel Distrib Comput. http://dx.doi.org/10.10.16/j.jpdc.2014.08.003
Bekkerman R, Bilenko M, Langford J (eds) (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, Cambridge
Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: The Konstanz Information Miner. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R (eds) Studies in classification, data analysis, and knowledge organization. Springer, Berlin/Heidelberg
Budiu M, Fetterly D, Isard M, McSherry F, Yu Y (2012) Large-scale machine learning using DryadLINQ. In: Bekkerman R, Bilenko M, Langford J (eds) Scaling up machine learning. Cambridge University Press, Cambridge
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Charrington S (2012) Three new tools bring machine learning insights to the masses, February, Read Write Web. http://www.readwriteweb.com/hack/2012/02/three-new-tools-bring-machine.php
Dai W et al (2015) High-performance distributed ML at scale through parameter server consistency models, AAAI
Eckerson W (2012) New technologies for big data. http://www.b-eye-network.com/blogs/eckerson/archives/2012/11/new_technologie.php
Franklin M et al (2015) MLlib: Machine Learning in apache Spark
Gander M et al (2013) Anomaly detection in the cloud: detecting security incidents via machine learning, trustworthy eternal systems via evolving software, data and knowledge, vol 379. Springer, Berlin/Heidelberg, pp 103–116
Ghoting A, Kambadur P, Pednault E, Kannan R (2011) NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge Discovery and Data mining KDD’11, ACM, New York, NY, USA, pp 334–342
Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S (2011) SystemML: declarative machine learning on MapReduce. In: Proceedings of the 2011 I.E. 27th International Conference on Data Engineering (ICDE ‘11). IEEE Computer Society, Washington, DC, USA, pp 231–242
Granger B, Perez F, Ragan-Kelley M (2011) Using IPython for parallel computing. http://minrk.github.com/scipy-tutorial-2011. Accessed 13 May 2015
Grisel O (2013) Advanced machine learning with scikit-learn, PYCON tutorial. https://us.pycon.org/2013/schedule/presentation/23/
Hall M et al (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Hall M (2013) Weka and Spark – http://markahall.blogspot.co.nz/. Accessed 13 May 2015
Harris D (2015) 5 low-profile startups that could change the face of big data. http://gigaom.com/cloud/5-low-profile-startups-that-could-change-the-face-of-big-data/. Accessed 15 July 2015
Hido S, Tokui S, Oda S (2013) Jubauts: an open source platform for distributed online machine learning, NIPS workshop on Big Learning, Lake Taho
Hofmann M, Klinkenberg R (2013) RapidMiner: data mining use cases and business analytics applications. Chapman &Hall/CRC, Boca Raton
Isard M et al. (2007) Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper Syst Rev 41:59–72. doi:10.1145/1272998.1273005
Jain A, Nalya A (2014) Learning storm. Packt Publishing, Birmingham
Nuggets KD (2014) http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html. Accessed 15 May 2015
Krizhevsky A, Sutskever I, Hinton GE ImageNet (2012) Classification with deep convolutional neural networks. NIPS 2012: neural information processing systems, Lake Tahoe, Nevada
Le Q, Ranzato MA, Monga R, Devin M, Chen K, Corrado G, Dean J, Ng A (2012) Building high-level features using large scale unsupervised learning, international conference in machine learning, Edinburgh, UK
Leo S, Zanetti G (2010) Pydoop: a Python MapReduce and HDFS API for Hadoop. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, Chicago, IL, USA, pp 819–825
Low Y et al. (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: Proceedings of the VLDB endowment, vol 5, no 8, August 2012, Istanbul, Turkey
Mohri M, Rostamizadeh A, Talwalkar A (2012) A foundations of machine learning. The MIT Press, Cambridge, MA
Owen S, Anil R, Dunning T, Friedman E (2011) Mahout in action. Manning Publications Co., Shelter Island
Patcha A, Park JM (2007) An overview of anomaly detection techniques: existing solutions and latest technological trends. Comput Netw Elsevier, North-Holland, Inc., 51:3448–3470
Pednault E, Yom-Tov E, Ghoting A (2012) IBM parallel machine learning toolbox. In: Bekkerman R, Bilenko M, Langford J (eds) Scaling up machine learning. Cambridge University Press, New York
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Piccolboni A (2015) RHadoop. https://github.com/RevolutionAnalytics/RHadoop/wiki. Accessed 13 May 2015
Roldn MC (2013) Pentaho data integration beginner’s guide. Packt Publishing, Birmingham
Rosen J et al (2013) Iterative MapReduce for large scale machine learning, CoRR, abs/1303.3517
Russom P Big data Analytics (2011) TDWI best practices report, The Data Warehousing Institute (TDWI) Research
Sagha H, Bayati H, Millán JDR, Chavarriaga R (2013) On-line anomaly detection and resilience in classifier ensembles. Pattern Recogn Lett, Elsevier Science Inc., 34:1916–1927
Shi Q et al (2009) Hash kernels for structured data. J Mach Learn Res JMLR.org, 10:2615–2637
Elston SF (2015) Data science in the cloud with Microsoft Azure Machine Learning and R, O’Reilly
Thearling KK (1995) Massively parallel architectures and algorithms for time series analysis. In: Nadel L, Stien D (eds) Lectures in complex systems. Addison-Wesley, Reading
Tierney L, Rossini AJ, Snow NL (2009) A parallel computing framework for the R system. Int J Parallel Prog 37:78–90. doi:10.1007/s10766-008-0077-2
Upadhyaya SR (2013) Parallel approaches to machine learning – a comprehensive survey. J Parallel Distrib Comput 73(3):284–292. ISSN 0743–7315. http://dx.doi.org/10.1016/j.jpdc.2012.11.001
Wei D, Wei J, Zheng X, Kim JK, Lee S, Yin J, Ho Q, Xing EP (2013) Petuum: a framework for iterative-convergent distributed ML. arxiv.org/abs/1312:7651
Zaharia M et al. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in Cloud, Computing, USENIX Association, pp 10–10
Acknowledgments
This work was supported by the European Commission H2020 co-funded with project DICE (GA 644869).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Pop, D., Iuhasz, G., Petcu, D. (2016). Distributed Platforms and Cloud Services: Enabling Machine Learning for Big Data. In: Mahmood, Z. (eds) Data Science and Big Data Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-31861-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-31861-5_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31859-2
Online ISBN: 978-3-319-31861-5
eBook Packages: Computer ScienceComputer Science (R0)