Advertisement

Big SQL systems: an experimental evaluation

  • Victor Aluko
  • Sherif SakrEmail author
Article

Abstract

Recently, Big Data systems have been gaining increasing popularity on handling the massive amounts of data that are continuously generated in our digital world. While the Hadoop framework has pioneered the area of Big Data processing systems, it had clear performance limitations on providing the best performance of processing massive amounts of structured data. In addition, practically, many users of the big data systems face some challenges on dealing with the APIs and the low level programming abstractions of the Big Data System and they would prefer to use SQL (in which they are more proficient) as a high-level declarative language to express their tasks while leaving all of the execution optimization details to the backend engine. Thus, several systems have been designed and implemented to tackle these challenges by designing and implementing scalable query execution engines for processing massive structured data while supporting SQL interfaces. In this article, we present an extensive experimental study of four popular systems in this domain, namely, Apache Hive, SPARK SQL, Apache Impala and PrestoDB. In particular, we report and analyze the performance characteristics of these systems using three different benchmarks, namely, TPC-H, TPC-DS and TPCx-BB. Finally, we report a set of insights and important lessons that we have learned from conducting our experiments.

Keywords

Big data Big SQL Benchmarking 

Notes

Acknowledgements

This work is funded by the European Regional Development Funds via the Mobilitas Plus Programme (Grant MOBTT75).

References

  1. 1.
    Abadi, D., Babu, S., Özcan, F., Pandis, I.: Sql-on-hadoop systems: tutorial. Proc. VLDB Endow. 8(12), 2050–2051 (2015)CrossRefGoogle Scholar
  2. 2.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)CrossRefGoogle Scholar
  3. 3.
    Ammar, K., Özsu, M.T.: Experimental analysis of distributed graph systems. PVLDB 11(10), 1151–1164 (2018)Google Scholar
  4. 4.
    Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: Relational Data Processing in Spark. SIGMOD, Chicago (2015)CrossRefGoogle Scholar
  5. 5.
    Cao, P., Gowda, B., Lakshmi, S., Narasimhadevara, C., Nguyen, P., Poelman, J., Poess, M., Rabl, T.: From bigbench to tpcx-bb: standardization of a big data benchmark. In: Technology Conference on Performance Evaluation and Benchmarking, pp. 24–44. Springer (2016)Google Scholar
  6. 6.
    Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Commun. Data Eng. 36(4), 28–38 (2015)Google Scholar
  7. 7.
    Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of sql-on-hadoop systems. In: Proceedings of the Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, pp. 154–166. Springer (2014)Google Scholar
  8. 8.
    Choi, H., Son, J., Yang, H., Ryu, H., Lim, B., Kim, S., Chung, Y.D.: Tajo: A Distributed Data Warehouse System on Large Clusters. ICDE, Oslo (2013)Google Scholar
  9. 9.
    Dean, J., Ghemawa, S.: MapReduce: Simplified Data Processing on Large Clusters. In: OSDI, pp. 137–150 (2004)Google Scholar
  10. 10.
    Floratou, A., Özcan, F., Schiefer, B.: Benchmarking sql-on-hadoop systems: Tpc or not tpc? In: Proceedings of the Workshop on Big Data Benchmarks, pp. 63–72, Springer (2014)Google Scholar
  11. 11.
    Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 1197–1208. ACM (2013)Google Scholar
  12. 12.
    Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., Zicari, R.V.: Bigbench V2: the new and improved bigbench. In: Proceedings of the 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, 19–22 April 2017, pp. 1225–1236 (2017)Google Scholar
  13. 13.
    Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E.N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., Zhang, X.: Major Technical Advancements in Apache Hive. SIGMOD, Chicago (2014)CrossRefGoogle Scholar
  14. 14.
    Ivanov, T., Beer, M.-G.: Performance evaluation of spark sql using bigbench. In: Big Data Benchmarking, pp. 96–116. Springer (2015)Google Scholar
  15. 15.
    Ivanov, T., Singhal, R.: Abench: Big data architecture stack benchmark. In: Proceedings of the Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, Berlin, Germany, 09–13 April 2018, pp. 13–16 (2018)Google Scholar
  16. 16.
    Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H., Markl, V.: Benchmarking distributed stream data processing systems. In: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, 16–19 April 2018, pp. 1507–1518, (2018)Google Scholar
  17. 17.
    Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., Joshi, I., Kuff, L., Kumar, D., Leblang, A., Li, N., Pandis, I., Robinson, H., Rorke, D., Rus, S., Russell, J., Tsirogiannis, D., Wanderman-Milne, S., Yoder, M.: Impala: A Modern. Open-Source SQL Engine for Hadoop. In: Proceedings of the CIDR (2015)Google Scholar
  18. 18.
    Laney, D.: 3d data management: controlling data volume, velocity and variety. META Group Res. Note 6(70), 1 (2001)Google Scholar
  19. 19.
    Liu, Y., Guo, S., Hu, S., Rabl, T., Jacobsen, H., Li, J., Wang, J.: Performance evaluation and optimization of multi-dimensional indexes in hive. IEEE Trans. Serv. Comput. 11(5), 835–849 (2018)Google Scholar
  20. 20.
    Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. PVLDB 3(1), 330–339 (2010)Google Scholar
  21. 21.
    Mesmoudi, A., Hacid, M.-S., Toumani, F.: Benchmarking sql on mapreduce systems using large astronomy databases. Distrib. Parall. Databases 34(3), 347–378 (2016)CrossRefGoogle Scholar
  22. 22.
    Nambiar, R.O., Poess, M.: The making of tpc-ds. In: Proceedings of the 32nd international conference on Very large data bases, pp. 1049–1058. VLDB Endowment (2006)Google Scholar
  23. 23.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD (2009)Google Scholar
  24. 24.
    Qin, X., Chen, Y., Chen, J., Li, S., Liu, J., Zhang, H.: The performance of sql-on-hadoop systems-an experimental study. In: 2017 IEEE International Congress on Big Data (BigData Congress), pp. 464–471. IEEE (2017)Google Scholar
  25. 25.
    Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A.C., Curino, C.: Apache Tez: A unifying framework for modeling and building data processing applications. In: SIGMOD (2015)Google Scholar
  26. 26.
    Sakr, S.: Big Data 2.0 Processing Systems: A Survey. Springer, New York (2016)CrossRefGoogle Scholar
  27. 27.
    Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)CrossRefGoogle Scholar
  28. 28.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: MSST (2010)Google Scholar
  29. 29.
    Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J.S., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at Facebook. In: SIGMOD (2010)Google Scholar
  30. 30.
    Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: SOCC (2013)Google Scholar
  31. 31.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Newton (2012)Google Scholar
  32. 32.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: HotCloud (2010)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of TaruTaruEstonia

Personalised recommendations