Data Shuffling Minimizing Approach for Apache Spark Programs

Popov, Maksim; Drobintsev, Pavel D.

doi:10.1007/978-3-030-34983-7_13

Maksim Popov¹³ &
Pavel D. Drobintsev¹³

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 95))

Included in the following conference series:

International Conference Cyber-Physical Systems and Control

1175 Accesses

Abstract

This article discusses a way to optimize the Apache Spark program by reducing the number of transformations with wide dependencies and, as a result, the number of data shuffles. This is achieved by combining sequential data processing algorithms in chains based on common key fields, as well as grouping the data which is stored in resilient distributed structures i.e., Spark SQL Datasets, according to the keys by which the processing takes place.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache Spark. http://spark.apache.org/
Karau, H., Warren, R.: High Performance Spark (2017)
Google Scholar
Banerjee, S.: Apache Spark and Amazon s3 gotchas and best practices (2016)
Google Scholar
Chambers, B.: Spark: The Definitive Guide (2017)
Google Scholar
Wide vs Narrow Dependencies (2017). https://github.com/roh-gar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies
Zaharia, M., Wendell, P., Konwinski, A., Karau, H.: Learning Spark (2015)
Google Scholar
Cloudera: Spark Execution Model. https://www.cloudera.com/docu-mentation/enterprise/5-9-x/top-ics/cdh_ig_spark_apps.html#spark_exec_model
Rizay, S., Leserson, U., Owen, S., Wills, J.: Spark for Professionals: Modern Patterns of Big Data Processing (2018)
Google Scholar
Khazaei, T.: Spark Performance Tuning: A Checklist (2017)
Google Scholar
Hu, R., Wang, Z., Fan, W., Agarwal, S.: Cost Based Optimizer in Apache Spark 2.2 (2017)
Google Scholar
Armbrust, M., Huai, Y., Liang, C., Xin, R., Zaharia, M.: Deep Dive into Spark. SQL’s Catalyst Optimizer (2015)
Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y.: Spark SQL: Relational Data Processing in Spark (2015)
Google Scholar
Catalyst Optimizer. https://databricks.com/glossary/catalyst-optimizer
Moisan, Y.: Spark performance tuning from the trenches (2018)
Google Scholar
Apache. «Tuning Spark». https://spark.apache.org/docs/latest/tuning.html
The Scala Programming Language. https://www.scala-lang.org/
Apache Hive. https://hive.apache.org/
Apache Phoenix. https://phoenix.apache.org

Download references

Author information

Authors and Affiliations

Peter the Great St. Petersburg Polytechnic University, Saint Petersburg, Russia
Maksim Popov & Pavel D. Drobintsev

Authors

Maksim Popov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel D. Drobintsev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maksim Popov .

Editor information

Editors and Affiliations

Peter the Great St. Petersburg Polytechnic University, Saint Petersburg, Russia
Dmitry G. Arseniev
Institute of Transport and Automation Technology, Leibniz Universität Hannover, Garbsen, Niedersachsen, Germany
Ludger Overmeyer
Machine Vision and Pattern Recognition Laboratory, Lappeenranta University of Technology, Lappeenranta, Finland
Heikki Kälviäinen
Institute of Production Engineering and Photonic Technologies, Vienna University of Technology, Vienna, Austria
Branko Katalinić

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Popov, M., Drobintsev, P.D. (2020). Data Shuffling Minimizing Approach for Apache Spark Programs. In: Arseniev, D., Overmeyer, L., Kälviäinen, H., Katalinić, B. (eds) Cyber-Physical Systems and Control. CPS&C 2019. Lecture Notes in Networks and Systems, vol 95. Springer, Cham. https://doi.org/10.1007/978-3-030-34983-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-34983-7_13
Published: 30 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34982-0
Online ISBN: 978-3-030-34983-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics