A formally based parallelization of data mining algorithms for multi-core systems
- 87 Downloads
We describe a novel, systematic approach to efficiently parallelizing data mining algorithms: starting with the representation of an algorithm as a sequential composition of functions, we formally transform it into a parallel form using higher-order functions for specifying parallelism. We implement the approach as an extension of the industrial-strength Java-based library Xelopes, and we illustrate its use by developing a multi-threaded Java program for the popular naive Bayes classification algorithm. In comparison with the popular MapReduce programming model, our resulting programs enable not only data-parallel, but also task-parallel implementation and a combination of both. Our experiments demonstrate an efficient parallelization and good scalability on multi-core processors.
KeywordsParallel algorithms Data mining Parallel data mining Program transformation Functional programming Parallel programming
This work was supported by the Ministry of Education and Science of the Russian Federation in the framework of the state order “Organization of Scientific Research,” task 2.6113.2017/6.7, and by the German Ministry of Education and Research (BMBF) in the framework of the HPC2SE project at the University of Muenster.
- 3.Kadam P, Jadhav S, Kulkarni A, Kulkarni S (2017) Survey of parallel implementations of clustering algorithms. Int J Adv Res Comput Commun Eng 6(10):46–52Google Scholar
- 4.Zaki MJ, Ho C-T, Agrawal R (1999) Parallel classification for data mining on shared-memory multiprocessors. In: ICDE: IEEE International Conference on Data Engineering, pp 198–205Google Scholar
- 5.Kholod I, Shorov A, Gorlatch S (2017) Creation of data mining algorithms as functional expression for parallel and distributed execution. In: Malyshkin V (ed) PaCT 2017, LNCS, vol 10421. Springer, Basel, pp 459–472Google Scholar
- 7.Chu C-T et al (2006) Map-reduce for machine learning on multicore. In: Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp 281–288Google Scholar
- 8.Prudsys Xelopes. https://prudsys.de/en/knowledge/technology/prudsys-xelopes/
- 10.John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp 338–345Google Scholar
- 13.Allen R, Kennedy K (2002) Optimizing compilers for modern architectures. Morgan Kaufmann, San FranciscoGoogle Scholar
- 14.Kaggle Dataset. https://www.kaggle.com/rajanand/ahs-woman-1
- 15.Machine Learning Library (MLlib) Guide. http://spark.apache.org/docs/latest/mllib-guide.html