Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

Poggi, Nicolas; Montero, Alejandro; Carrera, David

doi:10.1007/978-3-319-72401-0_5

Nicolas Poggi¹⁵,
Alejandro Montero¹⁵ &
David Carrera¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 10661))

Included in the following conference series:

Technology Conference on Performance Evaluation and Benchmarking

1006 Accesses
4 Citations

Abstract

BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases—queries—which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. Moreover, over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements in performance and the stable release of v2. It is our intent to compare the current state of Spark to Hive’s base implementation which can use the legacy M/R engine and Mahout or the current Tez and MLlib frameworks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1 GB to 10 TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 60.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Boncz, P., Neumann, T., Erling, O.: TPC-H analyzed: hidden messages and lessons learned from an influential benchmark. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 61–76. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04936-6_5
Chapter Google Scholar
Cao, P., Gowda, B., Lakshmi, S., Narasimhadevara, C., Nguyen, P., Poelman, J., Poess, M., Rabl, T.: From BigBench to TPCx-BB: standardization of a big data benchmark. In: Nambiar, R., Poess, M. (eds.) TPCTC 2016. LNCS, vol. 10080, pp. 24–44. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54334-5_3
Chapter Google Scholar
Floratou, A., Özcan, F., Schiefer, B.: Benchmarking SQL-on-Hadoop systems: TPC or Not TPC? In: Rabl, T., Sachs, K., Poess, M., Baru, C., Jacobson, H.-A. (eds.) WBDB 2015. LNCS, vol. 8991, pp. 63–72. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20233-4_7
Chapter Google Scholar
Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-Hadoop: full circle back to shared-nothing database architectures. In: Proceedings of VLDB Endowment (2014)
Google Scholar
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013)
Google Scholar
S. R. B. D. W. Group (2016). https://research.spec.org/working-groups/big-data-working-group.html
Hortonworks Data Platform (HDP) (2016). http://hortonworks.com/products/hdp/
Apache Hive (2016). https://hive.apache.org/
Huang, S., et al.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 22nd International Conference on Data Engineering Workshops (2010)
Google Scholar
Intel: Big-data-benchmark-for-big-bench (2016). https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench
Ivanov, T.: D2F TPC-H benchmark repository (2016). https://github.com/t-ivanov/d2f-bench
Ivanov, T., Beer, M.-G.: Performance evaluation of spark SQL using BigBench. In: Rabl, T., Nambiar, R., Baru, C., Bhandarkar, M., Poess, M., Pyne, S. (eds.) WBDB -2015. LNCS, vol. 10044, pp. 96–116. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49748-8_6
Chapter Google Scholar
Gualtieri, M., Yuhanna, N.: Elasticity, automation, and pay-as-you-go compel enterprise adoption of hadoop in the cloud. The Forrester Wave: Big Data Hadoop Cloud Solutions, Q2 2016
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009)
Google Scholar
Poggi, N., Berral, J.L., Carrera, D., Vujic, N., Green, D., Blakeley, J., et al.: From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in ALOJA. In: 2015 IEEE International Conference on Big Data (Big Data) (2015)
Google Scholar
Poggi, N., Berral, J.L., Fenech, T., Carrera, D., Blakeley, J., Minhas, U.F., Vujic, N.: The state of SQL-on-Hadoop in the cloud. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 1432–1443, December 2016
Google Scholar
Poggi, N., Carrera, D., Vujic, N., Blakeley, J., et al.: ALOJA: A systematic study of hadoop deployment variables to enable automated characterization of cost-effectiveness. In: 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014
Google Scholar
Poggi, N., Montero, A.: Using BigBench to compare hive and spark versions and features
Google Scholar
Ivanov, T., Rabl, T., Poess, M., Queralt, A., Poelman, J., Poggi, N., Buell, J.: Big data benchmark compendium. In: Nambiar, R., Poess, M. (eds.) TPCTC 2015. LNCS, vol. 9508, pp. 135–155. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31409-9_9
Chapter Google Scholar
TPC: TPCx-BB official submissions (2016). http://www.tpc.org/tpcx-bb/results/tpcxbb_perf_results.asp
Transaction Processing Performance Council: TPC Benchmark H - Standard Specification, Version 2.17.1 (2014)
Google Scholar
Transaction Processing Performance Council: TPC Benchmark DS - Standard Specification, Version 1.3.1 (2015)
Google Scholar
Vijayakumar, S.: Hadoop based data intensive computation on IAAS cloud platforms. UNF Theses and Dissertations, page Paper 567 (2015)
Google Scholar
T. Yahoo Betting on Apache Hive and YARN (2014). https://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
Zhang, Z., Cherkasova, L., Loo, B.T.: Exploiting cloud heterogeneity for optimized cost/performance mapreduce processing. In: CloudDP 2014
Google Scholar
Zhang, Z., et al.: Optimizing cost and performance trade-offs for MapReduce job processing in the cloud. In: NOMS 2014
Google Scholar

Download references

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement No. 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493).

Author information

Authors and Affiliations

Barcelona Supercomputing Center (BSC), Universitat Politècnica de Catalunya (UPC-BarcelonaTech), Barcelona, Spain
Nicolas Poggi, Alejandro Montero & David Carrera

Authors

Nicolas Poggi
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Montero
View author publications
You can also search for this author in PubMed Google Scholar
David Carrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Poggi .

Editor information

Editors and Affiliations

Cisco Systems, Inc., San Jose, California, USA
Raghunath Nambiar
Server Technologies, Oracle Corporation, Redwood Shores, California, USA
Meikel Poess

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Poggi, N., Montero, A., Carrera, D. (2018). Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking for the Analytics Era. TPCTC 2017. Lecture Notes in Computer Science(), vol 10661. Springer, Cham. https://doi.org/10.1007/978-3-319-72401-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-72401-0_5
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72400-3
Online ISBN: 978-3-319-72401-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics