Distributed Machine Learning
Data mining; Large-scale learning; Machine learning
Distributed machine learning refers to multi-node machine learning algorithms and systems that are designed to improve performance, increase accuracy, and scale to larger input data sizes. Increasing the input data size for many algorithms can significantly reduce the learning error and can often be more effective than using more complex methods . Distributed machine learning allows companies, researchers, and individuals to make informed decisions and draw meaningful conclusions from large amounts of data.
Many systems exist for performing machine learning tasks in a distributed environment. These systems fall into three primary categories: database, general, and purpose-built systems. Each type of system has distinct advantages and disadvantages, but all are used in practice depending upon individual use cases, performance requirements, input data sizes, and the amount of implementation effort.
- 1.Apache hadoop. http://hadoop.apache.org.
- 2.Apache mahout. http://mahout.apache.org.
- 4.Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation; 2004.Google Scholar
- 5.Feng X, Kumar A, Recht B, Ré C. Towards a unified architecture for in-rdbms analytics. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data; 2012. p. 325–36.Google Scholar
- 6.Forum MP. Mpi: a message-passing interface standard. Technical report, Knoxville; 1994.Google Scholar
- 7.Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y, Vaithyanathan S. Systemml: declarative machine learning on mapreduce. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering; 2011. p. 231–42.Google Scholar
- 11.Kraska T, Talwalkar A, Duchi JC, Griffith R, Franklin MJ, Jordan MI. Mlbase: a distributed machine-learning system. In: Proceedings of the 6th Biennial Conference on Innovative Data Systems Research; 2013.Google Scholar
- 12.Niu F, Recht B, Ré C, Wright SJ. Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Proceedings of the Systems 24, Proceedings of the 25th Annual Conference on Neural Information Proceedings of the Systems; 2011.Google Scholar
- 13.Sujeeth AK, Lee H, Brown KJ, Chafi H, Wu M, Atreya AR, Olukotun K, Rompf T, Odersky M. Optiml: an implicitly parallel domainspecific language for machine learning. In: Proceedings of the 28th International Conference on Machine Learning; 2011.Google Scholar
- 14.Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson U, Gunda PK, Currey J. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation; 2008. p. 1–14.Google Scholar
- 15.Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation; 2012. p. 2.Google Scholar