Big Data Analytics in Java with PCJ Library: Performance Comparison with Hadoop

Nowicki, Marek; Ryczkowska, Magdalena; Górski, Łukasz; Bala, Piotr

doi:10.1007/978-3-319-78054-2_30

Marek Nowicki¹⁷,
Magdalena Ryczkowska^17,18,
Łukasz Górski^17,18 &
…
Piotr Bala¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10778))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

1097 Accesses
4 Citations
4 Altmetric

Abstract

The focus of this article is to present Big Data analytics using Java and PCJ library. The PCJ library is an award-winning library for development of parallel codes using PGAS programming paradigm. The PCJ can be used for easy implementation of the different algorithms, including ones used for Big Data processing. In this paper, we present performance results for standard benchmarks covering different types of applications from computational intensive, through traditional map-reduce up to communication intensive. The performance is compared to one achieved on the same hardware but using Hadoop. The PCJ implementation has been used with both local file system and HDFS. The code written with the PCJ can be developed much faster as it requires a smaller number of libraries used. Our results show that applications developed with the PCJ library are much faster compare to Hadoop implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tsai, C.-W., Lai, C.-F., Chao, H.-C., Vasilakos, A.V.: Big data analytics: a survey. J. Big Data 2, 21 (2015)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In Proceedings of OSDI (2004)
Google Scholar
Apache Hadoop. http://hadoop.apache.org/. Accessed 20 Sept 2017
http://pcj.icm.edu.pl. Accessed 20 Sept 2017
Nowicki, M., Górski, Ł., Grabarczyk, P., Bała, P.: PCJ - Java library for high performance computing in PGAS model. In: Smari, W.W., Zeljkovic, V. (eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS), pp. 202–209. IEEE (2014)
Google Scholar
Li, Z., Shen, H., Ligon, W.B., Denton, J.: An exploration of designing a hybrid scale-up/out hadoop architecture based on performance measurements. IEEE Trans. Parallel Distrib. Syst. 99, 1–1 (2016)
Article Google Scholar
Tolstoy, L.: War and Peace. Random House, Newyork (2016)
Google Scholar
de Scudéry, M.: Artamène ou le grand Cyrus (1972)
Google Scholar
Ibrahim, S., Jin, H., Lu, L., Qi, L., Wu, S., Shi, X.: Evaluating mapreduce on virtual machines: the hadoop case. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 519–528. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10665-1_47
Chapter Google Scholar
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/PiEstimator.html. Accessed 10 Feb 2017
Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph 500. Cray Users’ Group (CUG) 19, 45–74 (2010)
Google Scholar
Ueno, K., Suzumura, T.: Highly scalable graph search for the Graph500 benchmark, In: Proceedings of the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing, pp. 149–160 (2012)
Google Scholar
Buluc, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)
Google Scholar
Ryczkowska, M., Nowicki, M., Bała, P.: Level-synchronous BFS algorithm implemented in Java using PCJ library. In: 2016 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, pp. 596–601 (2016)
Google Scholar
https://hadooptutorial.wikispaces.com/Iterative+MapReduce+and+Counters. Accessed 21 Mar 2017
Li, Z., Shen, H., Denton, J., Ligon, W.: Comparing application performance on HPC-based hadoop platforms with local storage and dedicated storage. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 233–242 (2016)
Google Scholar
Augustine, D.P., Raj, P.: Performance evaluation of parallel genetic algorithm for brain MRI segmentation in hadoop and spark. Indian J. Sci. Technol. (2016). http://www.indjst.org/index.php/indjst/article/view/91373. Accessed 24 Mar 2017
Islam, N.S., Wasi-ur-Rahman, M., Lu, X., Panda, D.K.D.K.: Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage. In: 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, pp. 223–232 (2016)
Google Scholar
He, H., Du, Z., Zhang, W., Chen, A.: Optimization strategy of Hadoop small file storage for big data in healthcare. J. Supercomput. 72, 3696–3707 (2016)
Article Google Scholar
Park, D., Wang, J., Kee, Y.S.: In-storage computing for Hadoop mapreduce framework: challenges and possibilities. IEEE Trans. Comput. PP(99), 1–1 (2016)
Article Google Scholar
Maltzahn, C., Molina-Estolano, E., Khurana, A., Nelson, A., Brandt, S., Weil, S.: Ceph as a scalable alternative to the Hadoop Distributed File System. The USENIX Mag. 4(35), 518–529 (2010)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In Proceedings of ACM SOSP (2003)
Google Scholar
Carsn, P.H., Ligon, W.B., Ross, R.B., Thakur, R.: PVFS: a parallel file system for linux clusters (2000)
Google Scholar
Yang, S., Ligon, W., Quarles, E.: Scalable distributed directory implementation on orange file system. In: Proceedings of SNAPI (2011)
Google Scholar

Download references

Acknowledgments

The authors would like to thank CHIST-ERA consortium for financial support under HPDCJ project. The Polish contribution is financed through NCN grant 2014/14/Z/ST6/00007. The performance tests have been performed using ICM University of Warsaw computational facilities.

Author information

Authors and Affiliations

Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Chopina 12/18, 87-100, Torun, Poland
Marek Nowicki, Magdalena Ryczkowska & Łukasz Górski
Interdisciplinary Centre for Mathematical and Computational Modeling, University of Warsaw, Pawinskiego 5a, 02-106, Warsaw, Poland
Magdalena Ryczkowska, Łukasz Górski & Piotr Bala

Authors

Marek Nowicki
View author publications
You can also search for this author in PubMed Google Scholar
Magdalena Ryczkowska
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Górski
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Bala
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Piotr Bala .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra
University of Southern California, Marina Del Rey, California, USA
Ewa Deelman
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nowicki, M., Ryczkowska, M., Górski, Ł., Bala, P. (2018). Big Data Analytics in Java with PCJ Library: Performance Comparison with Hadoop. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2017. Lecture Notes in Computer Science(), vol 10778. Springer, Cham. https://doi.org/10.1007/978-3-319-78054-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-78054-2_30
Published: 23 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78053-5
Online ISBN: 978-3-319-78054-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Big Data Analytics in Java with PCJ Library: Performance Comparison with Hadoop