Abstract
We describe a new approach to parallelizing data mining algorithms. We use the representation of an algorithm as a sequence of functions and we use higher-order functions to express parallel execution. Our approach generalizes the popular MapReduce programming model by enabling not only data-parallel, but also task-parallel implementation and a combination of both. We implement our approach as an extension of the industrial-strength library Xelopes, and we illustrate it by developing a multi-threaded Java program for the 1R classification algorithm, with experiments on a multi-core processor.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Paul, S.: Parallel and distributed data mining. In: Funatsu, K. (ed.) New Fundamental Technologies in Data Mining, Karunya University, Coimbatore, India, pp. 43–54 (2011). ISBN 978-953-307-547-1
Zaki, M.: Parallel and distributed association mining : a survey. IEEE Concurrency 7(4), 14–25 (1999)
Kim, W.: Parallel clustering algorithms: survey. In: CSC 8530 Parallel Algorithms. Spring (2009). http://s3-us-west-2.amazonaws.com/mlsurveys/46.pdf
Satuluri, V.: A survey of parallel algorithms for classification (2007). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.126.5567
Dean, J. Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of Operating Systems Design and Implementation. San Francisco (2004)
Lammel, R.: Google’s MapReduce programming model—revisited. Sci. Comput. Program. 70(1), 1–30 (2008)
Gorlatch, S.: Extracting and implementing list homomorphism in parallel program development. Sci. Comput. Program. 33(1), 1–27 (1999)
Rasch, A., Gorlatch, S.: Multi-dimensional homomorphisms and their implementation in OpenCL. Int. J. Parallel Prog. 45, 300–319 (2017)
Ng, A.Y., Bradski, G., Chu, C.-T., Olukotun, K., Kim, S.K., Lin, Y.-A., Yu, Y.Y.: Map-Reduce for machine learning on multicore. In: Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 281–288 (2006)
Machine learning library (MLlib) guide. http://spark.apache.org/docs/latest/mllib-guide.html
Grant ingersoll, introducing apache mahout. http://www.ibm.com/developerworks/java/library/j-mahout/
Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data mining, inference and prediction, 533 p. Springer, New York (2001)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco (2001)
Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 63–90 (1993)
Witten, I.H., Eibe, F., Hall, M.A.: Data Mining Practical Machine Learning Tools and Techniques, 3rd edn., 629 pp. Morgan Kaufmann, San Francisco (2011)
Bernstein, A.J.: Program analysis for parallel processing. IEEE Trans. Electron. Comput. EC-15, 757–762 (1966)
Prudsys Xelopes. https://prudsys.de/en/knowledge/technology/prudsys-xelopes/
Rapid Miner. http://rapidminer.com/
Acknowledgments
This work was supported by the Ministry of Education and Science of the Russian Federation in the framework of the state order “Organization of Scientific Research”, task #2.6113.2017/BУ, and by the German Research Agency (DFG) in the framework of the Cluster of Excellence Cells-in-Motion at the University of Muenster.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kholod, I., Shorov, A., Gorlatch, S. (2017). A Functional Approach to Parallelizing Data Mining Algorithms in Java. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2017. Lecture Notes in Computer Science(), vol 10421. Springer, Cham. https://doi.org/10.1007/978-3-319-62932-2_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-62932-2_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62931-5
Online ISBN: 978-3-319-62932-2
eBook Packages: Computer ScienceComputer Science (R0)