A DAG Refactor Based Automatic Execution Optimization Mechanism for Spark

  • Hang Zhao
  • Yu Rao
  • Donghua Li
  • Jie TangEmail author
  • Shaoshan Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11783)


In today’s big data era, traditional disk-based MapReduce big data framework encountered bottlenecks due to its lower memory utilization and inefficient orchestration of complex tasks. With the advantage of fully use memory resources, Spark provides a lot of data manipulate operators and use DAG to express the dependences. Spark split entire job to multi-stage according to DAG and schedule them in a distributed execution environment, which better adapted to the new characteristic of big data processing. However, Spark didn’t consider the resource requirement of different operators and schedule them indiscriminately, which could cause load imbalances on different nodes in the cluster and cause some node become bottlenecks due to its extraordinary resource consumption. In the past, solve this problem need developers to have a lot of experience of Spark and write code sophisticated. In this paper, we proposed a DAG refactor based automatic execution optimization mechanism for Spark. The experimental results show that the DAG refactor mechanism can greatly improve Spark performance by up to 8.8X without misinterpretation of original program semantics.


Big data Spark Semantic DAG DAG refactor 


  1. 1.
    Pempek, T.A., Yermolayeva, Y.A., Calvert, S.L.: College students’ social networking experiences on Facebook. J. Appl. Dev. Psychol. 30(3), 227–238 (2009)CrossRefGoogle Scholar
  2. 2.
    Zaharia, M., Chowdhury, M., Das, T., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Usenix Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)Google Scholar
  3. 3.
    Hamilton, M., Raghunathan, S., Matiach, I., et al.: MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales. arXiv preprint arXiv:1810.08744 (2018)
  4. 4.
    Agafonov, A., Yumaganov, A.: Short-term traffic flow forecasting using a distributed spatial-temporal k nearest neighbors model. In: 2018 IEEE International Conference on Computational Science and Engineering (CSE), pp. 91–98. IEEE (2018)Google Scholar
  5. 5.
    Nasiri, H., Nasehi, S., Goudarzi, M.: A survey of distributed stream processing systems for smart city data analytics. In: Proceedings of the International Conference on Smart Cities and Internet of Things, p. 12. ACM (2018)Google Scholar
  6. 6.
    Bae, J., Jang, H., Jin, W., et al.: Jointly optimizing task granularity and concurrency for in-memory mapreduce frameworks. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 130–140. IEEE (2017)Google Scholar
  7. 7.
    KanJing: The research of key techniques of incremental computing for DAG-based framework. Beijing University of Technology (2017)Google Scholar
  8. 8.
    Chen, Y.: Analysis and optimization of memory scheduling algorithm of spark shuffle. Zhejiang University (2016)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2019

Authors and Affiliations

  • Hang Zhao
    • 1
  • Yu Rao
    • 1
  • Donghua Li
    • 1
  • Jie Tang
    • 1
    Email author
  • Shaoshan Liu
    • 2
  1. 1.South China University of Technology UniversityGuangzhouPeople’s Republic of China
  2. 2.PerceptInFremontUSA

Personalised recommendations