Skip to main content

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

  • Conference paper
  • First Online:
Performance Evaluation and Benchmarking for the Analytics Era (TPCTC 2017)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10661))

Included in the following conference series:

Abstract

BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases—queries—which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. Moreover, over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements in performance and the stable release of v2. It is our intent to compare the current state of Spark to Hive’s base implementation which can use the legacy M/R engine and Mahout or the current Tez and MLlib frameworks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1 GB to 10 TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 60.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Boncz, P., Neumann, T., Erling, O.: TPC-H analyzed: hidden messages and lessons learned from an influential benchmark. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 61–76. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04936-6_5

    Chapter  Google Scholar 

  2. Cao, P., Gowda, B., Lakshmi, S., Narasimhadevara, C., Nguyen, P., Poelman, J., Poess, M., Rabl, T.: From BigBench to TPCx-BB: standardization of a big data benchmark. In: Nambiar, R., Poess, M. (eds.) TPCTC 2016. LNCS, vol. 10080, pp. 24–44. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54334-5_3

    Chapter  Google Scholar 

  3. Floratou, A., Özcan, F., Schiefer, B.: Benchmarking SQL-on-Hadoop systems: TPC or Not TPC? In: Rabl, T., Sachs, K., Poess, M., Baru, C., Jacobson, H.-A. (eds.) WBDB 2015. LNCS, vol. 8991, pp. 63–72. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20233-4_7

    Chapter  Google Scholar 

  4. Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-Hadoop: full circle back to shared-nothing database architectures. In: Proceedings of VLDB Endowment (2014)

    Google Scholar 

  5. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013)

    Google Scholar 

  6. S. R. B. D. W. Group (2016). https://research.spec.org/working-groups/big-data-working-group.html

  7. Hortonworks Data Platform (HDP) (2016). http://hortonworks.com/products/hdp/

  8. Apache Hive (2016). https://hive.apache.org/

  9. Huang, S., et al.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 22nd International Conference on Data Engineering Workshops (2010)

    Google Scholar 

  10. Intel: Big-data-benchmark-for-big-bench (2016). https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench

  11. Ivanov, T.: D2F TPC-H benchmark repository (2016). https://github.com/t-ivanov/d2f-bench

  12. Ivanov, T., Beer, M.-G.: Performance evaluation of spark SQL using BigBench. In: Rabl, T., Nambiar, R., Baru, C., Bhandarkar, M., Poess, M., Pyne, S. (eds.) WBDB -2015. LNCS, vol. 10044, pp. 96–116. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49748-8_6

    Chapter  Google Scholar 

  13. Gualtieri, M., Yuhanna, N.: Elasticity, automation, and pay-as-you-go compel enterprise adoption of hadoop in the cloud. The Forrester Wave: Big Data Hadoop Cloud Solutions, Q2 2016

    Google Scholar 

  14. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009)

    Google Scholar 

  15. Poggi, N., Berral, J.L., Carrera, D., Vujic, N., Green, D., Blakeley, J., et al.: From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in ALOJA. In: 2015 IEEE International Conference on Big Data (Big Data) (2015)

    Google Scholar 

  16. Poggi, N., Berral, J.L., Fenech, T., Carrera, D., Blakeley, J., Minhas, U.F., Vujic, N.: The state of SQL-on-Hadoop in the cloud. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 1432–1443, December 2016

    Google Scholar 

  17. Poggi, N., Carrera, D., Vujic, N., Blakeley, J., et al.: ALOJA: A systematic study of hadoop deployment variables to enable automated characterization of cost-effectiveness. In: 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014

    Google Scholar 

  18. Poggi, N., Montero, A.: Using BigBench to compare hive and spark versions and features

    Google Scholar 

  19. Ivanov, T., Rabl, T., Poess, M., Queralt, A., Poelman, J., Poggi, N., Buell, J.: Big data benchmark compendium. In: Nambiar, R., Poess, M. (eds.) TPCTC 2015. LNCS, vol. 9508, pp. 135–155. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31409-9_9

    Chapter  Google Scholar 

  20. TPC: TPCx-BB official submissions (2016). http://www.tpc.org/tpcx-bb/results/tpcxbb_perf_results.asp

  21. Transaction Processing Performance Council: TPC Benchmark H - Standard Specification, Version 2.17.1 (2014)

    Google Scholar 

  22. Transaction Processing Performance Council: TPC Benchmark DS - Standard Specification, Version 1.3.1 (2015)

    Google Scholar 

  23. Vijayakumar, S.: Hadoop based data intensive computation on IAAS cloud platforms. UNF Theses and Dissertations, page Paper 567 (2015)

    Google Scholar 

  24. T. Yahoo Betting on Apache Hive and YARN (2014). https://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn

  25. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

  26. Zhang, Z., Cherkasova, L., Loo, B.T.: Exploiting cloud heterogeneity for optimized cost/performance mapreduce processing. In: CloudDP 2014

    Google Scholar 

  27. Zhang, Z., et al.: Optimizing cost and performance trade-offs for MapReduce job processing in the cloud. In: NOMS 2014

    Google Scholar 

Download references

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement No. 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Poggi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Poggi, N., Montero, A., Carrera, D. (2018). Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking for the Analytics Era. TPCTC 2017. Lecture Notes in Computer Science(), vol 10661. Springer, Cham. https://doi.org/10.1007/978-3-319-72401-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-72401-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-72400-3

  • Online ISBN: 978-3-319-72401-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics