Getting Started with Spark

  • K. G. SrinivasaEmail author
  • Anil Kumar Muppalla
Part of the Computer Communications and Networks book series (CCN)


Cluster computing has seen a rise in improved and popular computing models, in which clusters execute data-parallel computations on unreliable machines. This is enabled by software systems that provide locality-aware scheduling, fault tolerance, and load balancing. MapReduce [1] has become the front runner in pioneering this model, while systems like Map-Reduce-Merge [2] and Dryad [3] have generalized different data flow types. These systems are scalable and fault tolerant because they provide a programming model that enables users in creating acyclic data flow graphs to pass input data through a set of operations. This model enables the system to schedule and react to faults better without any user intervention. While this model can be applied to a lot applications, there are problems that cannot be solved efficiently by acyclic data flows.


Shared Memory Fault Tolerance Hadoop Distribute File System Distribute Shared Memory Distribute Shared Memory System 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107-113, 2008.Google Scholar
  2. 2.
    H. Yang, A. Dasdan, R. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD 07, pages 1029-1040. ACM, 2007.Google Scholar
  3. 3.
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys 2007, pages 59-72, 2007.Google Scholar
  4. 4.
    B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. A common substrate for cluster computing. In Workshop on Hot Topics in Cloud Computing (HotCloud) 2009, 2009.Google Scholar
  5. 5.
    Zaharia, Matei, et al. ”Spark: cluster computing with working sets.” Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 2010.Google Scholar
  6. 6.
    C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD 08. ACM, 2008.Google Scholar
  7. 7.
    Apache Hive. Scholar
  8. 8.
    Li, Kai. ”IVY: A Shared Virtual Memory System for Parallel Computing.” ICPP (2). 1988.Google Scholar
  9. 9.
    B. Nitzberg and V. Lo. Distributed shared memory: a survey of issues and algorithms. Computer, 24(8):52-60, aug 1991.Google Scholar
  10. 10.
    A.-M. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut. A recoverable distributed shared memory integrating coherence and recoverability. In FTCS 95. IEEE Computer Society, 1995.Google Scholar
  11. 11.
    R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, 37:128, 2005.Google Scholar
  12. 12.
    J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In SOSP 91. ACM, 1991.Google Scholar
  13. 13.
    D. Gelernter. Generative communication in linda. ACM Trans. Program. Lang. Syst., 7(1):80 - 112, 1985.Google Scholar
  14. 14.
    B. Liskov, A. Adya, M. Castro, S. Ghemawat, R. Gruber, U. Maheshwari, A. C. Myers, M. Day, and L. Shrira. Safe and efficient sharing of persistent objects in thor. In SIGMOD 96, pages 318-329. ACM, 1996.Google Scholar
  15. 15.
    G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135146, 2010.Google Scholar
  16. 16.
    J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In HPDC 10, 2010.Google Scholar
  17. 17.
    Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow., 3:285-296, September 2010.Google Scholar
  18. 18.
    Scala programming language. Scholar
  19. 19.
    B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. Technical Report UCB/EECS-2010-87, EECS Department, University of California, Berkeley, May 2010.Google Scholar
  20. 20.
    M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys 2010, April 2010.Google Scholar
  21. 21.
    R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In Proc. OSDI 2010, 2010.Google Scholar
  22. 22.
    Spark, Apache. [Online] Available: Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.M.S. Ramaiah Institute of TechnologyBangaloreIndia

Personalised recommendations