A Functional Approach to Parallelizing Data Mining Algorithms in Java

Kholod, Ivan; Shorov, Andrey; Gorlatch, Sergei

doi:10.1007/978-3-319-62932-2_44

Ivan Kholod¹⁴,
Andrey Shorov¹⁴ &
Sergei Gorlatch¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10421))

Included in the following conference series:

International Conference on Parallel Computing Technologies

1099 Accesses

Abstract

We describe a new approach to parallelizing data mining algorithms. We use the representation of an algorithm as a sequence of functions and we use higher-order functions to express parallel execution. Our approach generalizes the popular MapReduce programming model by enabling not only data-parallel, but also task-parallel implementation and a combination of both. We implement our approach as an extension of the industrial-strength library Xelopes, and we illustrate it by developing a multi-threaded Java program for the 1R classification algorithm, with experiments on a multi-core processor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Paul, S.: Parallel and distributed data mining. In: Funatsu, K. (ed.) New Fundamental Technologies in Data Mining, Karunya University, Coimbatore, India, pp. 43–54 (2011). ISBN 978-953-307-547-1
Google Scholar
Zaki, M.: Parallel and distributed association mining : a survey. IEEE Concurrency 7(4), 14–25 (1999)
Article Google Scholar
Kim, W.: Parallel clustering algorithms: survey. In: CSC 8530 Parallel Algorithms. Spring (2009). http://s3-us-west-2.amazonaws.com/mlsurveys/46.pdf
Satuluri, V.: A survey of parallel algorithms for classification (2007). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.126.5567
Dean, J. Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of Operating Systems Design and Implementation. San Francisco (2004)
Google Scholar
Lammel, R.: Google’s MapReduce programming model—revisited. Sci. Comput. Program. 70(1), 1–30 (2008)
Article MathSciNet MATH Google Scholar
Gorlatch, S.: Extracting and implementing list homomorphism in parallel program development. Sci. Comput. Program. 33(1), 1–27 (1999)
Article MathSciNet MATH Google Scholar
Rasch, A., Gorlatch, S.: Multi-dimensional homomorphisms and their implementation in OpenCL. Int. J. Parallel Prog. 45, 300–319 (2017)
Article Google Scholar
Ng, A.Y., Bradski, G., Chu, C.-T., Olukotun, K., Kim, S.K., Lin, Y.-A., Yu, Y.Y.: Map-Reduce for machine learning on multicore. In: Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 281–288 (2006)
Google Scholar
Machine learning library (MLlib) guide. http://spark.apache.org/docs/latest/mllib-guide.html
Grant ingersoll, introducing apache mahout. http://www.ibm.com/developerworks/java/library/j-mahout/
Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data mining, inference and prediction, 533 p. Springer, New York (2001)
Book MATH Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco (2001)
MATH Google Scholar
Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 63–90 (1993)
Article MATH Google Scholar
Witten, I.H., Eibe, F., Hall, M.A.: Data Mining Practical Machine Learning Tools and Techniques, 3rd edn., 629 pp. Morgan Kaufmann, San Francisco (2011)
Google Scholar
Bernstein, A.J.: Program analysis for parallel processing. IEEE Trans. Electron. Comput. EC-15, 757–762 (1966)
Article MATH Google Scholar
Prudsys Xelopes. https://prudsys.de/en/knowledge/technology/prudsys-xelopes/
Rapid Miner. http://rapidminer.com/

Download references

Acknowledgments

This work was supported by the Ministry of Education and Science of the Russian Federation in the framework of the state order “Organization of Scientific Research”, task #2.6113.2017/BУ, and by the German Research Agency (DFG) in the framework of the Cluster of Excellence Cells-in-Motion at the University of Muenster.

Author information

Authors and Affiliations

Saint Petersburg Electrotechnical University “LETI”, Saint Petersburg, Russia
Ivan Kholod & Andrey Shorov
University of Muenster, Muenster, Germany
Sergei Gorlatch

Authors

Ivan Kholod
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Shorov
View author publications
You can also search for this author in PubMed Google Scholar
Sergei Gorlatch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivan Kholod .

Editor information

Editors and Affiliations

Russian Academy of Sciences, Novosibirsk, Russia
Victor Malyshkin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kholod, I., Shorov, A., Gorlatch, S. (2017). A Functional Approach to Parallelizing Data Mining Algorithms in Java. In: Malyshkin, V. (eds) Parallel Computing Technologies. PaCT 2017. Lecture Notes in Computer Science(), vol 10421. Springer, Cham. https://doi.org/10.1007/978-3-319-62932-2_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-62932-2_44
Published: 29 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62931-5
Online ISBN: 978-3-319-62932-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics