How Good is Query Optimizer in Spark?

  • Zujie RenEmail author
  • Na Yun
  • Youhuizi Li
  • Jian Wan
  • Yuan Wang
  • Lihua Yu
  • Xinxin Fan
Conference paper
Part of the Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering book series (LNICST, volume 268)


In the big data community, Spark plays an important role and is used to process interactive queries. Spark employs a query optimizer, called Catalyst, to interpret SQL queries to optimized query execution plans. Catalyst contains a number of optimization rules and supports cost-based optimization. Although query optimization techniques have been well studied in the field of relational database systems, the effectiveness of Catalyst in Spark is still unclear. In this paper, we investigated the effectiveness of rule-based and cost-based optimization in Catalyst, meanwhile, we obtained a set of comparative experiments by varying the data volume and the number of nodes. It is found that even when applied query optimizations, the execution time of most TPC-H queries were slightly reduced. Some interesting observations were made on Catalyst, which can enable the community to have a better understanding and improvement of the query optimizer in Spark.


Spark SQL Catalyst Query optimization 



This work is supported by Key Research and Development Program of Zhejiang Province (No. 2018C01098), and the Natural Science Foundation of Zhejiang Province (NO. LY18F020014).


  1. 1.
    Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. 11, S1 (2010)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow. 3(1–2), 330–339 (2010)CrossRefGoogle Scholar
  3. 3.
    Ducarme, P., Rahman, M., Brasseur, R.: IMPALA: a simple restraint field to simulate the biological membrane in molecular structure studies. Proteins Struct. Funct. Bioinform. 30(4), 357–371 (1998)CrossRefGoogle Scholar
  4. 4.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: USENIX Conference on Hot Topics in Cloud Computing, p. 10 (2010)Google Scholar
  5. 5.
    Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)CrossRefGoogle Scholar
  6. 6.
    Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD 2015, pp. 1383–1394. ACM (2015)Google Scholar
  7. 7.
    Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRefGoogle Scholar
  8. 8.
    Ma, J., et al.: Logical query optimization for cloudera impala system. J. Syst. Softw. 125, 35–46 (2017)CrossRefGoogle Scholar
  9. 9.
    Naacke, H., Curé, O., Amann, B.: SPARQL query processing with apache spark. arXiv preprint arXiv:1604.08903 (2016)
  10. 10.
    Graefe, G.: The cascades framework for query optimization. IEEE Data Eng. Bull. 18(3), 19–29 (1995)Google Scholar
  11. 11.
    Esawi, A.M.K., Ashby, M.F.: Cost-based ranking for manufacturing process selection. In: Batoz, J.L., Chedmail, P., Cognet, G., Fortin, C. (eds.) Integrated Design and Manufacturing in Mechanical Engineering, pp. 603–610. Springer, Dordrecht (1999). Scholar
  12. 12.
    Wu, J.-M., Zhou, J.: Research of optimization rule of SQL based on oracle database. J. Shaanxi Univ. Technol. (2013)Google Scholar
  13. 13.
    Antoshenkov, G., Ziauddin, M.: Query processing and optimization in oracle RDB. VLDB J. Int. J. Very Large Data Bases 5(4), 229–237 (1996)CrossRefGoogle Scholar
  14. 14.
    Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 34–43. ACM (1998)Google Scholar
  15. 15.
    Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)Google Scholar
  16. 16.
    Chiba, T., Onodera, T.: Workload characterization and optimization of TPC-H queries on apache spark. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 112–121. IEEE (2016)Google Scholar
  17. 17.
    Liang, W., Zheng, Y.: TPC-H analysis and test tool design. Comput. Eng. Appl. (2007)Google Scholar
  18. 18.
    Transaction processing performance council.
  19. 19.
    Ioannidis, Y.E.: Query optimization. ACM Comput. Surv. (CSUR) 28(1), 121–123 (1996)CrossRefGoogle Scholar
  20. 20.
    Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. ACM SIGMOD Rec. 29, 249–260 (2000)CrossRefGoogle Scholar
  21. 21.
    Graefe, G., DeWitt, D.J.: The EXODUS Optimizer Generator, vol. 16. ACM (1987)Google Scholar
  22. 22.
    Barbas, P.M.: Database query optimization, 21 January 2014. US Patent 8,635,206Google Scholar
  23. 23.
    Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB Endow. 9(3), 204–215 (2015)CrossRefGoogle Scholar
  24. 24.
    Kocsis, Z.A., Drake, J.H., Carson, D., Swan, J.: Automatic improvement of apache spark queries using semantics-preserving program reduction. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, pp. 1141–1146. ACM (2016)Google Scholar
  25. 25.
    Liu, C.: Research on SparkSQL query optimization based on cost model (2016)Google Scholar
  26. 26.
    Zhang, L.: Research on query analysis and optimization based on spark system (2016)Google Scholar
  27. 27.

Copyright information

© ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2019

Authors and Affiliations

  • Zujie Ren
    • 1
    Email author
  • Na Yun
    • 1
  • Youhuizi Li
    • 1
  • Jian Wan
    • 2
  • Yuan Wang
    • 3
  • Lihua Yu
    • 3
  • Xinxin Fan
    • 3
  1. 1.School of Computer ScienceHangzhou Dianzi UniversityHangzhouChina
  2. 2.Department of Software EngineeringZhejiang University of Science and TechnologyHangzhouChina
  3. 3.Key Enterprise Research Institute of NetEase Big Data of Zhejiang ProvinceNetease Hangzhou, Network Co. Ltd.HangzhouChina

Personalised recommendations