A Functional Approach to Parallelizing Data Mining Algorithms in Java

  • Ivan KholodEmail author
  • Andrey Shorov
  • Sergei Gorlatch
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10421)


We describe a new approach to parallelizing data mining algorithms. We use the representation of an algorithm as a sequence of functions and we use higher-order functions to express parallel execution. Our approach generalizes the popular MapReduce programming model by enabling not only data-parallel, but also task-parallel implementation and a combination of both. We implement our approach as an extension of the industrial-strength library Xelopes, and we illustrate it by developing a multi-threaded Java program for the 1R classification algorithm, with experiments on a multi-core processor.


Parallel algorithms Data mining Parallel data mining Multithreads Multi-core processors MapReduce, homomorphisms 



This work was supported by the Ministry of Education and Science of the Russian Federation in the framework of the state order “Organization of Scientific Research”, task #2.6113.2017/BУ, and by the German Research Agency (DFG) in the framework of the Cluster of Excellence Cells-in-Motion at the University of Muenster.


  1. 1.
    Paul, S.: Parallel and distributed data mining. In: Funatsu, K. (ed.) New Fundamental Technologies in Data Mining, Karunya University, Coimbatore, India, pp. 43–54 (2011). ISBN 978-953-307-547-1Google Scholar
  2. 2.
    Zaki, M.: Parallel and distributed association mining : a survey. IEEE Concurrency 7(4), 14–25 (1999)CrossRefGoogle Scholar
  3. 3.
    Kim, W.: Parallel clustering algorithms: survey. In: CSC 8530 Parallel Algorithms. Spring (2009).
  4. 4.
    Satuluri, V.: A survey of parallel algorithms for classification (2007).
  5. 5.
    Dean, J. Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of Operating Systems Design and Implementation. San Francisco (2004)Google Scholar
  6. 6.
    Lammel, R.: Google’s MapReduce programming model—revisited. Sci. Comput. Program. 70(1), 1–30 (2008)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Gorlatch, S.: Extracting and implementing list homomorphism in parallel program development. Sci. Comput. Program. 33(1), 1–27 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Rasch, A., Gorlatch, S.: Multi-dimensional homomorphisms and their implementation in OpenCL. Int. J. Parallel Prog. 45, 300–319 (2017)CrossRefGoogle Scholar
  9. 9.
    Ng, A.Y., Bradski, G., Chu, C.-T., Olukotun, K., Kim, S.K., Lin, Y.-A., Yu, Y.Y.: Map-Reduce for machine learning on multicore. In: Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 281–288 (2006)Google Scholar
  10. 10.
    Machine learning library (MLlib) guide.
  11. 11.
    Grant ingersoll, introducing apache mahout.
  12. 12.
    Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data mining, inference and prediction, 533 p. Springer, New York (2001)CrossRefzbMATHGoogle Scholar
  13. 13.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco (2001)zbMATHGoogle Scholar
  14. 14.
    Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 63–90 (1993)CrossRefzbMATHGoogle Scholar
  15. 15.
    Witten, I.H., Eibe, F., Hall, M.A.: Data Mining Practical Machine Learning Tools and Techniques, 3rd edn., 629 pp. Morgan Kaufmann, San Francisco (2011)Google Scholar
  16. 16.
    Bernstein, A.J.: Program analysis for parallel processing. IEEE Trans. Electron. Comput. EC-15, 757–762 (1966)CrossRefzbMATHGoogle Scholar
  17. 17.
  18. 18.

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Saint Petersburg Electrotechnical University “LETI”Saint PetersburgRussia
  2. 2.University of MuensterMuensterGermany

Personalised recommendations