Abstract
The focus of this article is to present Big Data analytics using Java and PCJ library. The PCJ library is an award-winning library for development of parallel codes using PGAS programming paradigm. The PCJ can be used for easy implementation of the different algorithms, including ones used for Big Data processing. In this paper, we present performance results for standard benchmarks covering different types of applications from computational intensive, through traditional map-reduce up to communication intensive. The performance is compared to one achieved on the same hardware but using Hadoop. The PCJ implementation has been used with both local file system and HDFS. The code written with the PCJ can be developed much faster as it requires a smaller number of libraries used. Our results show that applications developed with the PCJ library are much faster compare to Hadoop implementation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Tsai, C.-W., Lai, C.-F., Chao, H.-C., Vasilakos, A.V.: Big data analytics: a survey. J. Big Data 2, 21 (2015)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In Proceedings of OSDI (2004)
Apache Hadoop. http://hadoop.apache.org/. Accessed 20 Sept 2017
http://pcj.icm.edu.pl. Accessed 20 Sept 2017
Nowicki, M., Górski, Ł., Grabarczyk, P., Bała, P.: PCJ - Java library for high performance computing in PGAS model. In: Smari, W.W., Zeljkovic, V. (eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS), pp. 202–209. IEEE (2014)
Li, Z., Shen, H., Ligon, W.B., Denton, J.: An exploration of designing a hybrid scale-up/out hadoop architecture based on performance measurements. IEEE Trans. Parallel Distrib. Syst. 99, 1–1 (2016)
Tolstoy, L.: War and Peace. Random House, Newyork (2016)
de Scudéry, M.: Artamène ou le grand Cyrus (1972)
Ibrahim, S., Jin, H., Lu, L., Qi, L., Wu, S., Shi, X.: Evaluating mapreduce on virtual machines: the hadoop case. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 519–528. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10665-1_47
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/PiEstimator.html. Accessed 10 Feb 2017
Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph 500. Cray Users’ Group (CUG) 19, 45–74 (2010)
Ueno, K., Suzumura, T.: Highly scalable graph search for the Graph500 benchmark, In: Proceedings of the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing, pp. 149–160 (2012)
Buluc, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)
Ryczkowska, M., Nowicki, M., Bała, P.: Level-synchronous BFS algorithm implemented in Java using PCJ library. In: 2016 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, pp. 596–601 (2016)
https://hadooptutorial.wikispaces.com/Iterative+MapReduce+and+Counters. Accessed 21 Mar 2017
Li, Z., Shen, H., Denton, J., Ligon, W.: Comparing application performance on HPC-based hadoop platforms with local storage and dedicated storage. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 233–242 (2016)
Augustine, D.P., Raj, P.: Performance evaluation of parallel genetic algorithm for brain MRI segmentation in hadoop and spark. Indian J. Sci. Technol. (2016). http://www.indjst.org/index.php/indjst/article/view/91373. Accessed 24 Mar 2017
Islam, N.S., Wasi-ur-Rahman, M., Lu, X., Panda, D.K.D.K.: Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage. In: 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, pp. 223–232 (2016)
He, H., Du, Z., Zhang, W., Chen, A.: Optimization strategy of Hadoop small file storage for big data in healthcare. J. Supercomput. 72, 3696–3707 (2016)
Park, D., Wang, J., Kee, Y.S.: In-storage computing for Hadoop mapreduce framework: challenges and possibilities. IEEE Trans. Comput. PP(99), 1–1 (2016)
Maltzahn, C., Molina-Estolano, E., Khurana, A., Nelson, A., Brandt, S., Weil, S.: Ceph as a scalable alternative to the Hadoop Distributed File System. The USENIX Mag. 4(35), 518–529 (2010)
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In Proceedings of ACM SOSP (2003)
Carsn, P.H., Ligon, W.B., Ross, R.B., Thakur, R.: PVFS: a parallel file system for linux clusters (2000)
Yang, S., Ligon, W., Quarles, E.: Scalable distributed directory implementation on orange file system. In: Proceedings of SNAPI (2011)
Acknowledgments
The authors would like to thank CHIST-ERA consortium for financial support under HPDCJ project. The Polish contribution is financed through NCN grant 2014/14/Z/ST6/00007. The performance tests have been performed using ICM University of Warsaw computational facilities.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Nowicki, M., Ryczkowska, M., Górski, Ł., Bala, P. (2018). Big Data Analytics in Java with PCJ Library: Performance Comparison with Hadoop. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2017. Lecture Notes in Computer Science(), vol 10778. Springer, Cham. https://doi.org/10.1007/978-3-319-78054-2_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-78054-2_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78053-5
Online ISBN: 978-3-319-78054-2
eBook Packages: Computer ScienceComputer Science (R0)