Abstract
In this chapter we give an introduction to Apache Spark, a Big Data programming framework. We describe the framework’s core aspects as well as some of the challenges that parallel and distributed computing entail.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Read, Eval, Print and Loop. See http://docs.scala-lang.org/overviews/repl/overview.html.
- 3.
See more options by running spark-shell --help.
- 4.
A more technical description of RDDs can be found in [4].
- 5.
The number of partitions in which a file is divided is, if not stated otherwise, decided by Spark based on file block size. File block size is 32MB on a local file system, and 128MB on YARN. The minimum number of partitions is 2, which would be the case of small files such as README.md (3.8 K).
- 6.
Immutability is a key concept in functional programming, and an important aspect for reliable parallel programming.
- 7.
List of transformations: http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations.
- 8.
Lineage is the official name given in the Spark documentation.
- 9.
List of actions: http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
Serialization occurs when data is sent over the network e.g. when a shuffle operation takes place.
- 16.
- 17.
- 18.
- 19.
Weather from 2012: http://academictorrents.com/details/16be344abd95d58afd4860445f4a927b7eb1a89d.
- 20.
A DStream is collection of RDDs, which are collections of distributed elements. This might sound confusing but bear with us.
- 21.
How to build self-contained applications: http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications.
- 22.
Library dependencies can be found in the Maven repository: https://mvnrepository.com/.
- 23.
- 24.
- 25.
A new feature called Continuous Processing, which will allow handling data elements as soon as they arrive, is being released.
References
Jacobs, A. (2009). The pathologies of big data. Communications of the ACM, 52(8), 36–44.
Odersky, M., Spoon, L., & Venners, B. (2011). Programming in Scala, 2 edn. Artima Press.
Torra, V. (2016). Scala: From a functional programming perspective: An introduction to the programming language. Cham, Switzerland: Springer.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., & Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the 9th USENIX conference on networked systems design and implementation, USENIX Association, Berkeley, CA, USA, NSDI’12, p. 2
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Ventocilla, E. (2019). Big Data programming with Apache Spark. In: Said, A., Torra, V. (eds) Data Science in Practice. Studies in Big Data, vol 46. Springer, Cham. https://doi.org/10.1007/978-3-319-97556-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-97556-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97555-9
Online ISBN: 978-3-319-97556-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)