Skip to main content

Optimum Parallelism in Spark Framework on Hadoop YARN for Maximum Cluster Resource Utilization

  • Conference paper
  • First Online:
First International Conference on Sustainable Technologies for Computational Intelligence

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1045))

Abstract

Spark is widely used as a distributed computing framework for in-memory parallel processing. It implements distributed computing by splitting the jobs into tasks and deploying them on executors on the nodes of a cluster. Executors are JVMs with dedicated allocation of CPU cores and memory. The number of tasks depends on the partitions of input data. Depending on the number of CPU cores allocated to executors, one or more cores get allocated to one task. Tasks run as independent threads on executors hosted on JVMs dedicated exclusively to the executor. One or more executors are deployed on the nodes of the cluster depending on the resource availability. The performance advantage provided by distributed computing on Spark framework depends on the level of parallelism configured at 3 levels, namely node level, executor level, and task level. The parallelism at each of these levels should be configured to fully utilize the available computing resources. This paper recommends optimum parallelism configuration for Apache Spark framework deployed on Hadoop YARN cluster. The recommendations are based on the results of the experiments conducted to evaluate the dependency of parallelism at each of these levels on the performance of Spark applications. For the purpose of the evaluation, a CPU-intensive job and an I/O-intensive job are used. The performance is measured by varying the parallelism at each of the 3 levels. The results presented in this paper help Spark users in selecting optimum parallelism at each of these levels for achieving maximum performance for Spark jobs by maximum resource utilization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark Cluster Computing with Working Sets, HotCloud 2010 (2010)

    Google Scholar 

  2. Mane, D.: How to Plan Capacity for Hadoop Cluster, Hadoop Magazine, April (2014)

    Google Scholar 

  3. Janardhanan, P.S., Samuel, P.: Analysis and modeling of resource management overhead in Hadoop YARN Clusters. In: IEEE DataCom 2017, The 3rd IEEE International Conference on Big Data Intelligence and Computing Orlando, Florida, USA (2017)

    Google Scholar 

  4. Janardhanan, P.S., Samuel, P.: Study of execution parallelism by resource partitioning in Hadoop YARN. In: ICACCI’17—6th International Conference on Advances in Computing, Communications and Informatics, Manipal University, Karnataka, India (2017)

    Google Scholar 

  5. Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SPARKBENCH a comprehensive benchmarking suite for in memory data analytic platform spark, IBM TJ Watson Research Center. In: CF’15 Proceedings of the 12th ACM International Conference on Computing Frontiers, Article No. 53 Ischia, Italy (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. S. Janardhanan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Janardhanan, P.S., Samuel, P. (2020). Optimum Parallelism in Spark Framework on Hadoop YARN for Maximum Cluster Resource Utilization. In: Luhach, A., Kosa, J., Poonia, R., Gao, XZ., Singh, D. (eds) First International Conference on Sustainable Technologies for Computational Intelligence. Advances in Intelligent Systems and Computing, vol 1045. Springer, Singapore. https://doi.org/10.1007/978-981-15-0029-9_28

Download citation

Publish with us

Policies and ethics