Abstract
Spark is widely used as a distributed computing framework for in-memory parallel processing. It implements distributed computing by splitting the jobs into tasks and deploying them on executors on the nodes of a cluster. Executors are JVMs with dedicated allocation of CPU cores and memory. The number of tasks depends on the partitions of input data. Depending on the number of CPU cores allocated to executors, one or more cores get allocated to one task. Tasks run as independent threads on executors hosted on JVMs dedicated exclusively to the executor. One or more executors are deployed on the nodes of the cluster depending on the resource availability. The performance advantage provided by distributed computing on Spark framework depends on the level of parallelism configured at 3 levels, namely node level, executor level, and task level. The parallelism at each of these levels should be configured to fully utilize the available computing resources. This paper recommends optimum parallelism configuration for Apache Spark framework deployed on Hadoop YARN cluster. The recommendations are based on the results of the experiments conducted to evaluate the dependency of parallelism at each of these levels on the performance of Spark applications. For the purpose of the evaluation, a CPU-intensive job and an I/O-intensive job are used. The performance is measured by varying the parallelism at each of the 3 levels. The results presented in this paper help Spark users in selecting optimum parallelism at each of these levels for achieving maximum performance for Spark jobs by maximum resource utilization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark Cluster Computing with Working Sets, HotCloud 2010 (2010)
Mane, D.: How to Plan Capacity for Hadoop Cluster, Hadoop Magazine, April (2014)
Janardhanan, P.S., Samuel, P.: Analysis and modeling of resource management overhead in Hadoop YARN Clusters. In: IEEE DataCom 2017, The 3rd IEEE International Conference on Big Data Intelligence and Computing Orlando, Florida, USA (2017)
Janardhanan, P.S., Samuel, P.: Study of execution parallelism by resource partitioning in Hadoop YARN. In: ICACCI’17—6th International Conference on Advances in Computing, Communications and Informatics, Manipal University, Karnataka, India (2017)
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SPARKBENCH a comprehensive benchmarking suite for in memory data analytic platform spark, IBM TJ Watson Research Center. In: CF’15 Proceedings of the 12th ACM International Conference on Computing Frontiers, Article No. 53 Ischia, Italy (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Janardhanan, P.S., Samuel, P. (2020). Optimum Parallelism in Spark Framework on Hadoop YARN for Maximum Cluster Resource Utilization. In: Luhach, A., Kosa, J., Poonia, R., Gao, XZ., Singh, D. (eds) First International Conference on Sustainable Technologies for Computational Intelligence. Advances in Intelligent Systems and Computing, vol 1045. Springer, Singapore. https://doi.org/10.1007/978-981-15-0029-9_28
Download citation
DOI: https://doi.org/10.1007/978-981-15-0029-9_28
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0028-2
Online ISBN: 978-981-15-0029-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)