Skip to main content

The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems

  • Conference paper
Book cover Specifying Big Data Benchmarks (WBDB 2012, WBDB 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8163))

Included in the following conference series:

Abstract

Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications (in short big data systems) is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications—an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. http://hadoop.apache.org/

  2. Performance counters for linux, https://perf.wiki.kernel.org/index.php/Main_Page

  3. Sort benchmark home page, http://sortbenchmark.org/

  4. Apacible, J., Draves, R., et al.: Minutesort with flat datacenter storage. Technical report, Microsoft Research (2012)

    Google Scholar 

  5. Barroso, L., Hölzle, U.: The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture 4(1), 1–108 (2009)

    Article  Google Scholar 

  6. Baru, C., et al.: Benchmarking big data systems and the bigdata top100 list. Big Data 1(1), 60–64 (2013)

    Article  Google Scholar 

  7. Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Setting the direction for big data benchmark standards. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 197–208. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  8. Buros, W.M., et al.: Understanding systems and architecture for big data. IBM Research Report (2013)

    Google Scholar 

  9. Chen, Y.: We Don’t Know Enough to make a Big Data Benchmark Suite. In: Workshop on Big Data Benchmarking (2012)

    Google Scholar 

  10. Chen, Y., Raab, F., Katz, R.H.: From tpc-c to big data benchmarks: A functional workload model. Technical Report UCB/EECS-2012-174, EECS Department, University of California, Berkeley (July 2012)

    Google Scholar 

  11. Chen, Z., Jianfeng, Z., Zhen, J., Lixin, Z.: Characterizing os behavior of scale-out data center workloads. In: The Seventh Annual Workshop on the Interaction Amongst Virtualization, Operating Systems and Computer Architecture, WIVOSCA 2013 (2013)

    Google Scholar 

  12. Cook, S.A., Reckhow, R.A.: Time bounded random access machines. Journal of Computer and System Sciences 7(4), 354–375 (1973)

    Article  MATH  MathSciNet  Google Scholar 

  13. Ferdman, M., et al.: Clearing the clouds: A study of emerging workloads on modern hardware. Architectural Support for Programming Languages and Operating Systems (2012)

    Google Scholar 

  14. Gao, W., et al.: A benchmark suite for big data systems. In: The 19th IEEE International Symposium on High Performance Computer Architecture (HPCA 2013) (2013), Tutorial http://prof.ict.ac.cn/HPCA/BigDataBench.pdf

  15. Gao, W., et al.: Bigdatabench: a big data benchmark suite from web search engines. In: The Third Workshop on Architectures and Systems for Big Data (ASBD 2013) in Conjunction with the 40th International Symposium on Computer Architecture (May 2013)

    Google Scholar 

  16. Ghazal, A., et al.: Bigbench: Towards an industry standard benchmark for big data analytics. In: ACM SIGMOD Conference (2013)

    Google Scholar 

  17. Holyer, I.: Computational complexity (1984)

    Google Scholar 

  18. Jia, Z., Wang, L., Zhan, J., Zhang, L., Luo, C.: Characterizing data analysis workloads in data centers. In: 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE (2013)

    Google Scholar 

  19. Jia, Z., Zhan, J., Wang, L., Zhang, L., et al.: Hvcbench: A benchmark suite for data center. The 19th IEEE International Symposium on High Performance Computer Architecture (HPCA 2013) (2013), Tutorial Technical Report http://prof.ict.ac.cn/HPCA/HPCA_Tutorial_HVC_4-jiazhen.pdf

  20. Lotfi-Kamran, P., Grot, B., Ferdman, M., Volos, S., Kocberber, O., Picorel, J., Adileh, A., Jevdjic, D., Idgunji, S., Ozer, E., et al.: Scale-out processors. In: Proceedings of the 39th International Symposium on Computer Architecture, pp. 500–511. IEEE Press (2012)

    Google Scholar 

  21. Luo, C., Zhan, J., Jia, Z., Wang, L., Lu, G., Zhang, L., Xu, C., Sun, N.: Cloudrank-d: benchmarking and ranking cloud computing systems for data processing applications. Frontiers of Computer Science 6(4), 347–362 (2012)

    MathSciNet  Google Scholar 

  22. Rajaraman, A.: More data usually beats better algorithms. Datawocky Blog (2008)

    Google Scholar 

  23. Sang, B., Zhan, J., Lu, G., Wang, H., Xu, D., Wang, L., Zhang, Z., Jia, Z.: Precise, scalable, and online request tracing for multitier services of black boxes. IEEE Transactions on Parallel and Distributed Systems 23(6), 1159–1167 (2012)

    Article  Google Scholar 

  24. Skiena, S.S.: The algorithm design manual: with 72 figures, vol. 1. Telos Press (1998)

    Google Scholar 

  25. Wang, L., Zhan, J., Shi, W., Liang, Y.: In cloud, can scientific communities benefit from the economies of scale? IEEE Transactions on Parallel and Distributed Systems 23(2), 296–303 (2012)

    Article  Google Scholar 

  26. White, T.: Hadoop: The definitive guide. O’Reilly Media (2012)

    Google Scholar 

  27. Yelick, K.: Single processor machines: Memory hierarchies and processor features

    Google Scholar 

  28. Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, pp. 265–278. ACM (2010)

    Google Scholar 

  29. Zhan, J., Wang, L., Li, X., Shi, W., Weng, C., Zhang, W., Zang, X.: Cost-aware cooperative resource provisioning for heterogeneous workloads in data centers. IEEE Transactions on Computers

    Google Scholar 

  30. Zhan, J., Zhang, L., Sun, N., Wang, L., Jia, Z., Luo, C.: High volume throughput computing: Identifying and characterizing throughput oriented workloads in data centers. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 1712–1721. IEEE (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Jia, Z. et al. (2014). The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems. In: Rabl, T., Poess, M., Baru, C., Jacobsen, HA. (eds) Specifying Big Data Benchmarks. WBDB WBDB 2012 2012. Lecture Notes in Computer Science, vol 8163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53974-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-53974-9_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-53973-2

  • Online ISBN: 978-3-642-53974-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics