Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Distributed Machine Learning

  • Alex GalakatosEmail author
  • Andrew Crotty
  • Tim Kraska
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_80647


Data mining; Large-scale learning; Machine learning


Distributed machine learning refers to multi-node machine learning algorithms and systems that are designed to improve performance, increase accuracy, and scale to larger input data sizes. Increasing the input data size for many algorithms can significantly reduce the learning error and can often be more effective than using more complex methods [8]. Distributed machine learning allows companies, researchers, and individuals to make informed decisions and draw meaningful conclusions from large amounts of data.

Many systems exist for performing machine learning tasks in a distributed environment. These systems fall into three primary categories: database, general, and purpose-built systems. Each type of system has distinct advantages and disadvantages, but all are used in practice depending upon individual use cases, performance requirements, input data sizes, and the amount of implementation effort.


This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Apache hadoop. http://hadoop.apache.org.
  2. 2.
    Apache mahout. http://mahout.apache.org.
  3. 3.
    Crotty A, Galakatos A, Dursun K, Kraska T, Binnig C, Çetintemel U, Zdonik S. An Architecture for Compiling UDF-centric Workflows. Proc VLDB Endow. 2015; 8(12):1466–1477.CrossRefGoogle Scholar
  4. 4.
    Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation; 2004.Google Scholar
  5. 5.
    Feng X, Kumar A, Recht B, Ré C. Towards a unified architecture for in-rdbms analytics. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data; 2012. p. 325–36.Google Scholar
  6. 6.
    Forum MP. Mpi: a message-passing interface standard. Technical report, Knoxville; 1994.Google Scholar
  7. 7.
    Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S. Systemml: declarative machine learning on mapreduce. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering; 2011. p. 231–42.Google Scholar
  8. 8.
    Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intell Syst. 2009;24(2): 8–12.CrossRefGoogle Scholar
  9. 9.
    Hellerstein JM, Ré C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar A. The MADlib analytics library: or MAD skills, the SQL. Proc VLDB Endow. 2012;5(12):1700–11.CrossRefGoogle Scholar
  10. 10.
    Konda P, Kumar A, Ré C, Sashikanth V. Feature selection in enterprise analytics: a demonstration using an r-based data analytics system. Proc VLDB Endow. 2013;6(12):1306–9.CrossRefGoogle Scholar
  11. 11.
    Kraska T, Talwalkar A, Duchi JC, Griffith R, Franklin MJ, Jordan MI. Mlbase: a distributed machine-learning system. In: Proceedings of the 6th Biennial Conference on Innovative Data Systems Research; 2013.Google Scholar
  12. 12.
    Niu F, Recht B, Ré C, Wright SJ. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Proceedings of the Systems 24, Proceedings of the 25th Annual Conference on Neural Information Proceedings of the Systems; 2011.Google Scholar
  13. 13.
    Sujeeth AK, Lee H, Brown KJ, Chafi H, Wu M, Atreya AR, Olukotun K, Rompf T, Odersky M. Optiml: an implicitly parallel domainspecific language for machine learning. In: Proceedings of the 28th International Conference on Machine Learning; 2011.Google Scholar
  14. 14.
    Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson U, Gunda PK, Currey J. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation; 2008. p. 1–14.Google Scholar
  15. 15.
    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation; 2012. p. 2.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Database GroupBrown UniversityProvidenceUSA
  2. 2.Department of Computer ScienceBrown UniversityProvidenceUSA