Advertisement

Benchmarking Performance for Migrating a Relational Application to a Parallel Implementation

  • Krishna Karthik Gadiraju
  • Karen C. Davis
  • Paul G. Talaga
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8823)

Abstract

Many organizations rely on relational database platforms for OLAP-style querying (aggregation and filtering) for small to medium size applications. We investigate the impact of scaling up the data sizes for such queries. We intend to illustrate what kind of performance results an organization could expect should they migrate current applications to big data environments. This paper benchmarks the performance of Hive [20], a parallel data warehouse platform that is a part of the Hadoop software stack. We set up a 4-node Hadoop cluster using Hortonworks HDP 1.3.2 [10]. We use the data generator provided by the TPC-DS benchmark [3] to generate data of different scales. We use a representative query provided in the TPC-DS query set and run the SQL and Hive Query Language (HiveQL) versions of the same query on a relational database installation (MySQL) and on the Hive cluster. We measure the speedup for query execution for all dataset sizes resulting from the scale up. Hive loads the large datasets faster than MySQL, while it is marginally slower than MySQL when loading the smaller datasets.

Keywords

Hive Hadoop benchmarking big data SQL queries 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Setting the direction for big data benchmark standards. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 197–208. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  2. 2.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  3. 3.
    DSGen v1.1.0, data generation tool for TPC-DS, http://www.tpc.org/tpcds/
  4. 4.
    Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics (2013)Google Scholar
  5. 5.
    GridMix program. Available in Hadoop source distribution: src/benchmarks/gridmixGoogle Scholar
  6. 6.
    Gruenheid, A., Omiecinski, E., Mark, L.: Query optimization using column statistics in hive. In: Proceedings of the 15th Symposium on International Database Engineering & Applications, pp. 97–105. ACM (2011)Google Scholar
  7. 7.
    HadoopTeraSort program. Available in Hadoop source distribution since 0.19 version: src/examples/org/apache/hadoop/examples/terasortGoogle Scholar
  8. 8.
    Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A Self-tuning System for Big Data Analytics. In: CIDR, vol. 11, pp. 261–272 (2011)Google Scholar
  9. 9.
  10. 10.
  11. 11.
    Hortonworks Stinger Initiative, http://hortonworks.com/labs/stinger/
  12. 12.
    Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51. IEEE (2010)Google Scholar
  13. 13.
    Nambiar, R.O., Poess, M.: The making of TPC-DS. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 1049–1058. VLDB Endowment (2006)Google Scholar
  14. 14.
    Pansare, N., Cai, Z.: Using Hive to perform medium-scale data analysis (2010)Google Scholar
  15. 15.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 165–178. ACM (2009)Google Scholar
  16. 16.
  17. 17.
    Sort program. Available in Hadoop source distribution: src/examples/org/apache/hadoop/examples/sort Google Scholar
  18. 18.
    Shi, Y., Meng, X., Zhao, J., Hu, X., Liu, B., Wang, H.: Benchmarking cloud-based data management systems. In: Proceedings of the Second International Workshop on Cloud Data Management, pp. 47–54. ACM (2010)Google Scholar
  19. 19.
    TPC-DS benchmarking standard, http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf
  20. 20.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)CrossRefGoogle Scholar
  21. 21.
    White, T.: Hadoop: The definitive guide. O’Reilly (2012)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Krishna Karthik Gadiraju
    • 1
  • Karen C. Davis
    • 1
  • Paul G. Talaga
    • 1
  1. 1.Electrical Engineering and Computing SystemsUniversity of CincinnatiCincinnatiUSA

Personalised recommendations