Advertisement

The Journal of Supercomputing

, Volume 75, Issue 12, pp 7909–7920 | Cite as

A formally based parallelization of data mining algorithms for multi-core systems

  • Ivan KholodEmail author
  • Andrey Shorov
  • Evgenii Titkov
  • Sergei Gorlatch
Article

Abstract

We describe a novel, systematic approach to efficiently parallelizing data mining algorithms: starting with the representation of an algorithm as a sequential composition of functions, we formally transform it into a parallel form using higher-order functions for specifying parallelism. We implement the approach as an extension of the industrial-strength Java-based library Xelopes, and we illustrate its use by developing a multi-threaded Java program for the popular naive Bayes classification algorithm. In comparison with the popular MapReduce programming model, our resulting programs enable not only data-parallel, but also task-parallel implementation and a combination of both. Our experiments demonstrate an efficient parallelization and good scalability on multi-core processors.

Keywords

Parallel algorithms Data mining Parallel data mining Program transformation Functional programming Parallel programming 

Notes

Acknowledgements

This work was supported by the Ministry of Education and Science of the Russian Federation in the framework of the state order “Organization of Scientific Research,” task 2.6113.2017/6.7, and by the German Ministry of Education and Research (BMBF) in the framework of the HPC2SE project at the University of Muenster.

References

  1. 1.
    Wu X, Zhu X, Wu GQ, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107CrossRefGoogle Scholar
  2. 2.
    Zaki M (1999) Parallel and distributed association mining: a survey. IEEE Concurr 7(4):14–25CrossRefGoogle Scholar
  3. 3.
    Kadam P, Jadhav S, Kulkarni A, Kulkarni S (2017) Survey of parallel implementations of clustering algorithms. Int J Adv Res Comput Commun Eng 6(10):46–52Google Scholar
  4. 4.
    Zaki MJ, Ho C-T, Agrawal R (1999) Parallel classification for data mining on shared-memory multiprocessors. In: ICDE: IEEE International Conference on Data Engineering, pp 198–205Google Scholar
  5. 5.
    Kholod I, Shorov A, Gorlatch S (2017) Creation of data mining algorithms as functional expression for parallel and distributed execution. In: Malyshkin V (ed) PaCT 2017, LNCS, vol 10421. Springer, Basel, pp 459–472Google Scholar
  6. 6.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113CrossRefGoogle Scholar
  7. 7.
    Chu C-T et al (2006) Map-reduce for machine learning on multicore. In: Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp 281–288Google Scholar
  8. 8.
  9. 9.
    Wu X et al (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRefGoogle Scholar
  10. 10.
    John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp 338–345Google Scholar
  11. 11.
    Bernstein J (1966) Program analysis for parallel processing. IEEE Trans Electron Comput EC–15:757–762CrossRefGoogle Scholar
  12. 12.
    Li Z, Yew P-C, Zhu C-Q (1990) An efficient data dependence analysis for parallelizing compilers. IEEE Trans Parallel Distrib Syst 1:26–34CrossRefGoogle Scholar
  13. 13.
    Allen R, Kennedy K (2002) Optimizing compilers for modern architectures. Morgan Kaufmann, San FranciscoGoogle Scholar
  14. 14.
  15. 15.
    Machine Learning Library (MLlib) Guide. http://spark.apache.org/docs/latest/mllib-guide.html

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Saint Petersburg Electrotechnical University “LETI”Saint PetersburgRussia
  2. 2.University of MuensterMünsterGermany

Personalised recommendations