Optimum Parallelism in Spark Framework on Hadoop YARN for Maximum Cluster Resource Utilization

Janardhanan, P. S.; Samuel, Philip

doi:10.1007/978-981-15-0029-9_28

P. S. Janardhanan¹⁹ &
Philip Samuel²⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1045))

1488 Accesses
2 Citations

Abstract

Spark is widely used as a distributed computing framework for in-memory parallel processing. It implements distributed computing by splitting the jobs into tasks and deploying them on executors on the nodes of a cluster. Executors are JVMs with dedicated allocation of CPU cores and memory. The number of tasks depends on the partitions of input data. Depending on the number of CPU cores allocated to executors, one or more cores get allocated to one task. Tasks run as independent threads on executors hosted on JVMs dedicated exclusively to the executor. One or more executors are deployed on the nodes of the cluster depending on the resource availability. The performance advantage provided by distributed computing on Spark framework depends on the level of parallelism configured at 3 levels, namely node level, executor level, and task level. The parallelism at each of these levels should be configured to fully utilize the available computing resources. This paper recommends optimum parallelism configuration for Apache Spark framework deployed on Hadoop YARN cluster. The recommendations are based on the results of the experiments conducted to evaluate the dependency of parallelism at each of these levels on the performance of Spark applications. For the purpose of the evaluation, a CPU-intensive job and an I/O-intensive job are used. The performance is measured by varying the parallelism at each of the 3 levels. The results presented in this paper help Spark users in selecting optimum parallelism at each of these levels for achieving maximum performance for Spark jobs by maximum resource utilization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark Cluster Computing with Working Sets, HotCloud 2010 (2010)
Google Scholar
Mane, D.: How to Plan Capacity for Hadoop Cluster, Hadoop Magazine, April (2014)
Google Scholar
Janardhanan, P.S., Samuel, P.: Analysis and modeling of resource management overhead in Hadoop YARN Clusters. In: IEEE DataCom 2017, The 3rd IEEE International Conference on Big Data Intelligence and Computing Orlando, Florida, USA (2017)
Google Scholar
Janardhanan, P.S., Samuel, P.: Study of execution parallelism by resource partitioning in Hadoop YARN. In: ICACCI’17—6th International Conference on Advances in Computing, Communications and Informatics, Manipal University, Karnataka, India (2017)
Google Scholar
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SPARKBENCH a comprehensive benchmarking suite for in memory data analytic platform spark, IBM TJ Watson Research Center. In: CF’15 Proceedings of the 12th ACM International Conference on Computing Frontiers, Article No. 53 Ischia, Italy (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

SunTec Business Solutions Pvt Ltd, Thejaswini, TechnoPark, Trivandrum, India
P. S. Janardhanan
Department of Computer Science, Cochin University of Science & Technology, Kochi, India
Philip Samuel

Authors

P. S. Janardhanan
View author publications
You can also search for this author in PubMed Google Scholar
Philip Samuel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. S. Janardhanan .

Editor information

Editors and Affiliations

Department of Electrical and Communication Engineering, The Papua New Guinea University of Technology, Lae, Papua New Guinea
Ashish Kumar Luhach
Neumann János University, Kecskemét, Bács-Kiskun, Hungary
Janos Arpad Kosa
Amity University, Jaipur, Rajasthan, India
Ramesh Chandra Poonia
School of Computing, University of Eastern Finland, Kuopio, Finland
Xiao-Zhi Gao
Department of Computer Science, Namibia University of Science and Technology, Windhoek, Namibia
Dharm Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Janardhanan, P.S., Samuel, P. (2020). Optimum Parallelism in Spark Framework on Hadoop YARN for Maximum Cluster Resource Utilization. In: Luhach, A., Kosa, J., Poonia, R., Gao, XZ., Singh, D. (eds) First International Conference on Sustainable Technologies for Computational Intelligence. Advances in Intelligent Systems and Computing, vol 1045. Springer, Singapore. https://doi.org/10.1007/978-981-15-0029-9_28

Download citation

DOI: https://doi.org/10.1007/978-981-15-0029-9_28
Published: 02 November 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0028-2
Online ISBN: 978-981-15-0029-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics