Synonyms
Data mining; Large-scale learning; Machine learning
Definition
Distributed machine learning refers to multi-node machine learning algorithms and systems that are designed to improve performance, increase accuracy, and scale to larger input data sizes. Increasing the input data size for many algorithms can significantly reduce the learning error and can often be more effective than using more complex methods [8]. Distributed machine learning allows companies, researchers, and individuals to make informed decisions and draw meaningful conclusions from large amounts of data.
Many systems exist for performing machine learning tasks in a distributed environment. These systems fall into three primary categories: database, general, and purpose-built systems. Each type of system has distinct advantages and disadvantages, but all are used in practice depending upon individual use cases, performance requirements, input data sizes, and the amount of implementation effort.
Historical...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Apache hadoop. http://hadoop.apache.org.
Apache mahout. http://mahout.apache.org.
Crotty A, Galakatos A, Dursun K, Kraska T, Binnig C, Çetintemel U, Zdonik S. An Architecture for Compiling UDF-centric Workflows. Proc VLDB Endow. 2015; 8(12):1466–1477.
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation; 2004.
Feng X, Kumar A, Recht B, Ré C. Towards a unified architecture for in-rdbms analytics. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data; 2012. p. 325–36.
Forum MP. Mpi: a message-passing interface standard. Technical report, Knoxville; 1994.
Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S. Systemml: declarative machine learning on mapreduce. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering; 2011. p. 231–42.
Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intell Syst. 2009;24(2): 8–12.
Hellerstein JM, Ré C, Schoppmann F, Wang DZ, Fratkin E, Gorajek A, Ng KS, Welton C, Feng X, Li K, Kumar A. The MADlib analytics library: or MAD skills, the SQL. Proc VLDB Endow. 2012;5(12):1700–11.
Konda P, Kumar A, Ré C, Sashikanth V. Feature selection in enterprise analytics: a demonstration using an r-based data analytics system. Proc VLDB Endow. 2013;6(12):1306–9.
Kraska T, Talwalkar A, Duchi JC, Griffith R, Franklin MJ, Jordan MI. Mlbase: a distributed machine-learning system. In: Proceedings of the 6th Biennial Conference on Innovative Data Systems Research; 2013.
Niu F, Recht B, Ré C, Wright SJ. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Proceedings of the Systems 24, Proceedings of the 25th Annual Conference on Neural Information Proceedings of the Systems; 2011.
Sujeeth AK, Lee H, Brown KJ, Chafi H, Wu M, Atreya AR, Olukotun K, Rompf T, Odersky M. Optiml: an implicitly parallel domainspecific language for machine learning. In: Proceedings of the 28th International Conference on Machine Learning; 2011.
Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson U, Gunda PK, Currey J. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation; 2008. p. 1–14.
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation; 2012. p. 2.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Galakatos, A., Crotty, A., Kraska, T. (2018). Distributed Machine Learning. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_80647
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_80647
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering