Abstract
This article discusses a way to optimize the Apache Spark program by reducing the number of transformations with wide dependencies and, as a result, the number of data shuffles. This is achieved by combining sequential data processing algorithms in chains based on common key fields, as well as grouping the data which is stored in resilient distributed structures i.e., Spark SQL Datasets, according to the keys by which the processing takes place.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Apache Spark. http://spark.apache.org/
Karau, H., Warren, R.: High Performance Spark (2017)
Banerjee, S.: Apache Spark and Amazon s3 gotchas and best practices (2016)
Chambers, B.: Spark: The Definitive Guide (2017)
Wide vs Narrow Dependencies (2017). https://github.com/roh-gar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies
Zaharia, M., Wendell, P., Konwinski, A., Karau, H.: Learning Spark (2015)
Cloudera: Spark Execution Model. https://www.cloudera.com/docu-mentation/enterprise/5-9-x/top-ics/cdh_ig_spark_apps.html#spark_exec_model
Rizay, S., Leserson, U., Owen, S., Wills, J.: Spark for Professionals: Modern Patterns of Big Data Processing (2018)
Khazaei, T.: Spark Performance Tuning: A Checklist (2017)
Hu, R., Wang, Z., Fan, W., Agarwal, S.: Cost Based Optimizer in Apache Spark 2.2 (2017)
Armbrust, M., Huai, Y., Liang, C., Xin, R., Zaharia, M.: Deep Dive into Spark. SQL’s Catalyst Optimizer (2015)
Armbrust, M., Xin, R.S., Lian, C., Huai, Y.: Spark SQL: Relational Data Processing in Spark (2015)
Catalyst Optimizer. https://databricks.com/glossary/catalyst-optimizer
Moisan, Y.: Spark performance tuning from the trenches (2018)
Apache. «Tuning Spark». https://spark.apache.org/docs/latest/tuning.html
The Scala Programming Language. https://www.scala-lang.org/
Apache Hive. https://hive.apache.org/
Apache Phoenix. https://phoenix.apache.org
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Popov, M., Drobintsev, P.D. (2020). Data Shuffling Minimizing Approach for Apache Spark Programs. In: Arseniev, D., Overmeyer, L., Kälviäinen, H., Katalinić, B. (eds) Cyber-Physical Systems and Control. CPS&C 2019. Lecture Notes in Networks and Systems, vol 95. Springer, Cham. https://doi.org/10.1007/978-3-030-34983-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-34983-7_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34982-0
Online ISBN: 978-3-030-34983-7
eBook Packages: EngineeringEngineering (R0)