Skip to main content

Distributed Platforms and Cloud Services: Enabling Machine Learning for Big Data

  • Chapter
  • First Online:
Data Science and Big Data Computing

Abstract

Applying popular machine learning algorithms to large amounts of data has raised new challenges for machine learning practitioners. Traditional libraries do not support properly the processing of huge data sets, so the new approaches are needed. Using modern distributed computing paradigms, such as MapReduce or in-memory processing, novel machine learning libraries have been developed. At the same time, the advance of cloud computing in the past 10 years could not be ignored by the machine learning community. Thus, a rise of cloud-based platforms has been of significance. This chapter aims at presenting an overview of novel platforms, libraries, and cloud services that can be used by data scientists to extract knowledge from unstructured and semi-structured, large data sets. The overview covers several popular packages to enable distributed computing in popular machine learning environments, distributed platforms for machine learning, and cloud services for machine learning, known as machine-learning-as-a-service approach. We also provide a number of recommendations for data scientists when considering machine learning approach for their problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://cloudnumbers.com

  2. 2.

    http://cs.croakun.com

  3. 3.

    http://opani.com

  4. 4.

    www.revolutionanalytics.com

  5. 5.

    http://datasciencetoolbox.org

  6. 6.

    http://cran.r-project.org/web/packages/available_packages_by_name.html

  7. 7.

    http://www.stat.purdue.edu/~sguha/rhipe/doc/html/index.html

  8. 8.

    http://code.google.com/p/segue

  9. 9.

    http://aws.amazon.com/elasticmapreduce

  10. 10.

    https://github.com/crs4/pydoop

  11. 11.

    https://store.continuum.io/cshop/anaconda

  12. 12.

    http://ipython.org/ipython-doc/dev/parallel/

  13. 13.

    http://star.mit.edu/cluster/

  14. 14.

    https://github.com/shadanan/HadoopLink

  15. 15.

    http://www.mathworks.com/help/matlab/large-files-and-big-data.html

  16. 16.

    http://www.mathworks.com/help/distcomp/big-data.html

  17. 17.

    https://www.knime.org/knime-big-data-extension

  18. 18.

    https://rapidminer.com/products/radoop/

  19. 19.

    http://petuum.org

  20. 20.

    http://jubat.us

  21. 21.

    https://dato.com/products/create

  22. 22.

    https://flink.apache.org/

  23. 23.

    http://0xdata.com/product/

  24. 24.

    https://github.com/JohnLangford/vowpal_wabbit/

  25. 25.

    http://julialang.org/

  26. 26.

    http://mlbasejl.readthedocs.org/en/latest/

  27. 27.

    http://oryxproject.github.io/oryx/

  28. 28.

    https://github.com/cloudera/oryx

  29. 29.

    http://research.microsoft.com/en-us/projects/dryad/

  30. 30.

    http://msdn.microsoft.com/netframework/future/linq/

  31. 31.

    http://deeplearning4j.org

  32. 32.

    https://samoa.incubator.apache.org/

  33. 33.

    http://www.dmg.org/v4-1/GeneralStructure.html

  34. 34.

    https://prediction.io/

  35. 35.

    http://www.ersatzlabs.com/

  36. 36.

    http://www.skymind.io/about/

  37. 37.

    http://docs.aws.amazon.com/machine-learning/latest/mlconcepts/mlconcepts.html

  38. 38.

    https://developers.google.com/prediction/

  39. 39.

    https://developers.google.com/storage/

References

  1. Agarwal A, Chapelle O, Dudik M, Langford J (2014) A reliable effective terascale linear learning system. J Mach Learn Res 15:1111–1133

    MathSciNet  MATH  Google Scholar 

  2. Assunção MD, Calheiros RN, Bianchi S, Netto MAS, Buyya R (2014) Big data computing and clouds: trends and future directions. J. Parallel Distrib Comput. http://dx.doi.org/10.10.16/j.jpdc.2014.08.003

  3. Bekkerman R, Bilenko M, Langford J (eds) (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, Cambridge

    Google Scholar 

  4. Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Sieb C, Thiel K, Wiswedel B (2008) KNIME: The Konstanz Information Miner. In: Preisach C, Burkhardt H, Schmidt-Thieme L, Decker R (eds) Studies in classification, data analysis, and knowledge organization. Springer, Berlin/Heidelberg

    Google Scholar 

  5. Budiu M, Fetterly D, Isard M, McSherry F, Yu Y (2012) Large-scale machine learning using DryadLINQ. In: Bekkerman R, Bilenko M, Langford J (eds) Scaling up machine learning. Cambridge University Press, Cambridge

    Google Scholar 

  6. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58

    Article  Google Scholar 

  7. Charrington S (2012) Three new tools bring machine learning insights to the masses, February, Read Write Web. http://www.readwriteweb.com/hack/2012/02/three-new-tools-bring-machine.php

  8. Dai W et al (2015) High-performance distributed ML at scale through parameter server consistency models, AAAI

    Google Scholar 

  9. Eckerson W (2012) New technologies for big data. http://www.b-eye-network.com/blogs/eckerson/archives/2012/11/new_technologie.php

  10. Franklin M et al (2015) MLlib: Machine Learning in apache Spark

    Google Scholar 

  11. Gander M et al (2013) Anomaly detection in the cloud: detecting security incidents via machine learning, trustworthy eternal systems via evolving software, data and knowledge, vol 379. Springer, Berlin/Heidelberg, pp 103–116

    Book  Google Scholar 

  12. Ghoting A, Kambadur P, Pednault E, Kannan R (2011) NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on MapReduce. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge Discovery and Data mining KDD’11, ACM, New York, NY, USA, pp 334–342

    Google Scholar 

  13. Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S (2011) SystemML: declarative machine learning on MapReduce. In: Proceedings of the 2011 I.E. 27th International Conference on Data Engineering (ICDE ‘11). IEEE Computer Society, Washington, DC, USA, pp 231–242

    Google Scholar 

  14. Granger B, Perez F, Ragan-Kelley M (2011) Using IPython for parallel computing. http://minrk.github.com/scipy-tutorial-2011. Accessed 13 May 2015

  15. Grisel O (2013) Advanced machine learning with scikit-learn, PYCON tutorial. https://us.pycon.org/2013/schedule/presentation/23/

  16. Hall M et al (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  17. Hall M (2013) Weka and Spark – http://markahall.blogspot.co.nz/. Accessed 13 May 2015

  18. Harris D (2015) 5 low-profile startups that could change the face of big data. http://gigaom.com/cloud/5-low-profile-startups-that-could-change-the-face-of-big-data/. Accessed 15 July 2015

  19. Hido S, Tokui S, Oda S (2013) Jubauts: an open source platform for distributed online machine learning, NIPS workshop on Big Learning, Lake Taho

    Google Scholar 

  20. Hofmann M, Klinkenberg R (2013) RapidMiner: data mining use cases and business analytics applications. Chapman &Hall/CRC, Boca Raton

    Google Scholar 

  21. Isard M et al. (2007) Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper Syst Rev 41:59–72. doi:10.1145/1272998.1273005

    Google Scholar 

  22. Jain A, Nalya A (2014) Learning storm. Packt Publishing, Birmingham

    Google Scholar 

  23. Nuggets KD (2014) http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html. Accessed 15 May 2015

  24. Krizhevsky A, Sutskever I, Hinton GE ImageNet (2012) Classification with deep convolutional neural networks. NIPS 2012: neural information processing systems, Lake Tahoe, Nevada

    Google Scholar 

  25. Le Q, Ranzato MA, Monga R, Devin M, Chen K, Corrado G, Dean J, Ng A (2012) Building high-level features using large scale unsupervised learning, international conference in machine learning, Edinburgh, UK

    Google Scholar 

  26. Leo S, Zanetti G (2010) Pydoop: a Python MapReduce and HDFS API for Hadoop. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, Chicago, IL, USA, pp 819–825

    Google Scholar 

  27. Low Y et al. (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: Proceedings of the VLDB endowment, vol 5, no 8, August 2012, Istanbul, Turkey

    Google Scholar 

  28. Mohri M, Rostamizadeh A, Talwalkar A (2012) A foundations of machine learning. The MIT Press, Cambridge, MA

    MATH  Google Scholar 

  29. Owen S, Anil R, Dunning T, Friedman E (2011) Mahout in action. Manning Publications Co., Shelter Island

    Google Scholar 

  30. Patcha A, Park JM (2007) An overview of anomaly detection techniques: existing solutions and latest technological trends. Comput Netw Elsevier, North-Holland, Inc., 51:3448–3470

    Google Scholar 

  31. Pednault E, Yom-Tov E, Ghoting A (2012) IBM parallel machine learning toolbox. In: Bekkerman R, Bilenko M, Langford J (eds) Scaling up machine learning. Cambridge University Press, New York

    Google Scholar 

  32. Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  33. Piccolboni A (2015) RHadoop. https://github.com/RevolutionAnalytics/RHadoop/wiki. Accessed 13 May 2015

  34. Roldn MC (2013) Pentaho data integration beginner’s guide. Packt Publishing, Birmingham

    Google Scholar 

  35. Rosen J et al (2013) Iterative MapReduce for large scale machine learning, CoRR, abs/1303.3517

    Google Scholar 

  36. Russom P Big data Analytics (2011) TDWI best practices report, The Data Warehousing Institute (TDWI) Research

    Google Scholar 

  37. Sagha H, Bayati H, Millán JDR, Chavarriaga R (2013) On-line anomaly detection and resilience in classifier ensembles. Pattern Recogn Lett, Elsevier Science Inc., 34:1916–1927

    Google Scholar 

  38. Shi Q et al (2009) Hash kernels for structured data. J Mach Learn Res JMLR.org, 10:2615–2637

    Google Scholar 

  39. Elston SF (2015) Data science in the cloud with Microsoft Azure Machine Learning and R, O’Reilly

    Google Scholar 

  40. Thearling KK (1995) Massively parallel architectures and algorithms for time series analysis. In: Nadel L, Stien D (eds) Lectures in complex systems. Addison-Wesley, Reading

    Google Scholar 

  41. Tierney L, Rossini AJ, Snow NL (2009) A parallel computing framework for the R system. Int J Parallel Prog 37:78–90. doi:10.1007/s10766-008-0077-2

    Article  MATH  Google Scholar 

  42. Upadhyaya SR (2013) Parallel approaches to machine learning – a comprehensive survey. J Parallel Distrib Comput 73(3):284–292. ISSN 0743–7315. http://dx.doi.org/10.1016/j.jpdc.2012.11.001

    Google Scholar 

  43. Wei D, Wei J, Zheng X, Kim JK, Lee S, Yin J, Ho Q, Xing EP (2013) Petuum: a framework for iterative-convergent distributed ML. arxiv.org/abs/1312:7651

    Google Scholar 

  44. Zaharia M et al. Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in Cloud, Computing, USENIX Association, pp 10–10

    Google Scholar 

Download references

Acknowledgments

This work was supported by the European Commission H2020 co-funded with project DICE (GA 644869).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Pop .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Pop, D., Iuhasz, G., Petcu, D. (2016). Distributed Platforms and Cloud Services: Enabling Machine Learning for Big Data. In: Mahmood, Z. (eds) Data Science and Big Data Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-31861-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31861-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31859-2

  • Online ISBN: 978-3-319-31861-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics