Skip to main content

Big Data Analytics in Java with PCJ Library: Performance Comparison with Hadoop

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10778))

Abstract

The focus of this article is to present Big Data analytics using Java and PCJ library. The PCJ library is an award-winning library for development of parallel codes using PGAS programming paradigm. The PCJ can be used for easy implementation of the different algorithms, including ones used for Big Data processing. In this paper, we present performance results for standard benchmarks covering different types of applications from computational intensive, through traditional map-reduce up to communication intensive. The performance is compared to one achieved on the same hardware but using Hadoop. The PCJ implementation has been used with both local file system and HDFS. The code written with the PCJ can be developed much faster as it requires a smaller number of libraries used. Our results show that applications developed with the PCJ library are much faster compare to Hadoop implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tsai, C.-W., Lai, C.-F., Chao, H.-C., Vasilakos, A.V.: Big data analytics: a survey. J. Big Data 2, 21 (2015)

    Article  Google Scholar 

  2. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In Proceedings of OSDI (2004)

    Google Scholar 

  3. Apache Hadoop. http://hadoop.apache.org/. Accessed 20 Sept 2017

  4. http://pcj.icm.edu.pl. Accessed 20 Sept 2017

  5. Nowicki, M., Górski, Ł., Grabarczyk, P., Bała, P.: PCJ - Java library for high performance computing in PGAS model. In: Smari, W.W., Zeljkovic, V. (eds.) 2014 International Conference on High Performance Computing and Simulation (HPCS), pp. 202–209. IEEE (2014)

    Google Scholar 

  6. Li, Z., Shen, H., Ligon, W.B., Denton, J.: An exploration of designing a hybrid scale-up/out hadoop architecture based on performance measurements. IEEE Trans. Parallel Distrib. Syst. 99, 1–1 (2016)

    Article  Google Scholar 

  7. Tolstoy, L.: War and Peace. Random House, Newyork (2016)

    Google Scholar 

  8. de Scudéry, M.: Artamène ou le grand Cyrus (1972)

    Google Scholar 

  9. Ibrahim, S., Jin, H., Lu, L., Qi, L., Wu, S., Shi, X.: Evaluating mapreduce on virtual machines: the hadoop case. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) CloudCom 2009. LNCS, vol. 5931, pp. 519–528. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10665-1_47

    Chapter  Google Scholar 

  10. https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/PiEstimator.html. Accessed 10 Feb 2017

  11. Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph 500. Cray Users’ Group (CUG) 19, 45–74 (2010)

    Google Scholar 

  12. Ueno, K., Suzumura, T.: Highly scalable graph search for the Graph500 benchmark, In: Proceedings of the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing, pp. 149–160 (2012)

    Google Scholar 

  13. Buluc, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (2011)

    Google Scholar 

  14. Ryczkowska, M., Nowicki, M., Bała, P.: Level-synchronous BFS algorithm implemented in Java using PCJ library. In: 2016 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, pp. 596–601 (2016)

    Google Scholar 

  15. https://hadooptutorial.wikispaces.com/Iterative+MapReduce+and+Counters. Accessed 21 Mar 2017

  16. Li, Z., Shen, H., Denton, J., Ligon, W.: Comparing application performance on HPC-based hadoop platforms with local storage and dedicated storage. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 233–242 (2016)

    Google Scholar 

  17. Augustine, D.P., Raj, P.: Performance evaluation of parallel genetic algorithm for brain MRI segmentation in hadoop and spark. Indian J. Sci. Technol. (2016). http://www.indjst.org/index.php/indjst/article/view/91373. Accessed 24 Mar 2017

  18. Islam, N.S., Wasi-ur-Rahman, M., Lu, X., Panda, D.K.D.K.: Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage. In: 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, pp. 223–232 (2016)

    Google Scholar 

  19. He, H., Du, Z., Zhang, W., Chen, A.: Optimization strategy of Hadoop small file storage for big data in healthcare. J. Supercomput. 72, 3696–3707 (2016)

    Article  Google Scholar 

  20. Park, D., Wang, J., Kee, Y.S.: In-storage computing for Hadoop mapreduce framework: challenges and possibilities. IEEE Trans. Comput. PP(99), 1–1 (2016)

    Article  Google Scholar 

  21. Maltzahn, C., Molina-Estolano, E., Khurana, A., Nelson, A., Brandt, S., Weil, S.: Ceph as a scalable alternative to the Hadoop Distributed File System. The USENIX Mag. 4(35), 518–529 (2010)

    Google Scholar 

  22. Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In Proceedings of ACM SOSP (2003)

    Google Scholar 

  23. Carsn, P.H., Ligon, W.B., Ross, R.B., Thakur, R.: PVFS: a parallel file system for linux clusters (2000)

    Google Scholar 

  24. Yang, S., Ligon, W., Quarles, E.: Scalable distributed directory implementation on orange file system. In: Proceedings of SNAPI (2011)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank CHIST-ERA consortium for financial support under HPDCJ project. The Polish contribution is financed through NCN grant 2014/14/Z/ST6/00007. The performance tests have been performed using ICM University of Warsaw computational facilities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Piotr Bala .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nowicki, M., Ryczkowska, M., Górski, Ł., Bala, P. (2018). Big Data Analytics in Java with PCJ Library: Performance Comparison with Hadoop. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2017. Lecture Notes in Computer Science(), vol 10778. Springer, Cham. https://doi.org/10.1007/978-3-319-78054-2_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-78054-2_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-78053-5

  • Online ISBN: 978-3-319-78054-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics