Skip to main content

Performance Evaluation of Big Data Analysis

  • Reference work entry
  • First Online:
Encyclopedia of Big Data Technologies

Synonyms

Big Data performance characterization

Definitions

Evaluating the performance of Big Data systems is the usual way of getting information about the expected execution time of analytics applications. These applications are generally used to extract meaningful information from very large input datasets. There exist many high-level frameworks for Big Data analysis, each one oriented to different fields like machine learning and data mining, like Mahout (Apache Mahout 2009), or graph analytics like Giraph (Avery 2011). These high-level frameworks allow to define complex data processing pipelines that are later decomposed into more fine-grained operations in order to be executed by Big Data processing frameworks like Hadoop (Dean and Ghemawat 2008), Spark (Zaharia et al. 2016), and Flink (Apache Flink 2014). Therefore, the performance evaluation of these frameworks is key to determine their suitability for scalable Big Data analysis.

Big Data processing frameworks can be broken down...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 849.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Apache Flink (2014) Scalable batch and stream data processing. http://flink.apache.org/, [Last visited: Dec 2017]

  • Apache Mahout (2009) Scalable machine learning and data mining. http://mahout.apache.org/, [Last visited: Dec 2017]

  • Avery C (2011) Giraph: large-scale graph processing infrastructure on Hadoop. In: 2011 Hadoop summit, Santa Clara, pp 5–9

    Google Scholar 

  • Browne S, Dongarra J, Garner N, Ho G, Mucci P (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14(3):189–204

    Article  Google Scholar 

  • Chen C, Li K, Ouyang A, Tang Z, Li K (2017) GPU-accelerated parallel hierarchical extreme learning machine on Flink for Big Data. IEEE Trans Syst Man Cybern Syst 47(10):2740–2753

    Article  Google Scholar 

  • Choi IS, Yang W, Kee YS (2015) Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing. In: 2015 IEEE international conference on Big Data (IEEE BigData 2015), Santa Clara, pp 1073–1083

    Google Scholar 

  • Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  • Enes J, Expósito RR, Touriño J (2017) Big Data watchdog: real-time monitoring and profiling. http://bdwatchdog.dec.udc.es, [Last visited: Dec 2017]

  • Fadika Z, Govindaraju M, Canon R, Ramakrishnan L (2012) Evaluating Hadoop for data-intensive scientific operations. In: 5th IEEE international conference on cloud computing (CLOUD’12), Honolulu, pp 67–74

    Google Scholar 

  • Fadika Z, Dede E, Govindaraju M, Ramakrishnan L (2014) MARIANE: using MApReduce in HPC environments. Futur Gener Comput Syst 36:379–388

    Article  Google Scholar 

  • Fang W, He B, Luo Q, Govindaraju NK (2011) Mars: accelerating MapReduce with graphics processors. IEEE Trans Parallel Distrib Syst 22(4):608–620

    Article  Google Scholar 

  • Gog I, Giceva J, Schwarzkopf M, Vaswani K, Vytiniotis D, Ramalingan G, Costa M, Murray D, Hand S, Isard M (2015) Broom: sweeping out garbage collection from Big Data systems. In: 15th workshop on hot topics in operating systems (HotOS’15), Kartause Ittingen

    Google Scholar 

  • González P, Pardo XC, Penas DR, Teijeiro D, Banga JR, Doallo R (2017) Using the cloud for parameter estimation problems: comparing Spark vs MPI with a case-study. In: 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid 2017), Madrid, pp 797–806

    Google Scholar 

  • Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 26th IEEE international conference on data engineering workshops (ICDEW’10), Long Beach, pp 41–51

    Google Scholar 

  • Lee YS, Quero LC, Kim SH, Kim JS, Maeng S (2016) ActiveSort: efficient external sorting using active SSDs in the MapReduce framework. Futur Gener Comput Syst 65:76–89

    Article  Google Scholar 

  • Li Z, Shen H (2017) Measuring scale-up and scale-out Hadoop with remote and local file systems and selecting the best platform. IEEE Trans Parallel Distrib Syst 28(11):3201–3214

    Article  Google Scholar 

  • Li M, Tan J, Wang Y, Zhang L, Salapura V (2017) SparkBench: a Spark benchmarking suite characterizing large-scale in-memory data analytics. Clust Comput 20(3):2575–2589

    Article  Google Scholar 

  • Liang F, Feng C, Lu X, Xu Z (2014) Performance benefits of DataMPI: a case study with BigDataBench. In: 4th workshop on Big Data benchmarks, performance optimization and emerging hardware (BPOE’14), Salt Lake City, pp 111–123

    Google Scholar 

  • Loghin D, Tudor BM, Zhang H, Ooi BC, Teo YM (2015) A performance study of Big Data on small nodes. Proc VLDB Endowment 8(7):762–773

    Article  Google Scholar 

  • Lu M, Liang Y, Huynh HP, Ong Z, He B, Goh RSM (2015) MrPhi: an optimized MapReduce framework on Intel Xeon Phi coprocessors. IEEE Trans Parallel Distrib Syst 26(11):3066–3078

    Article  Google Scholar 

  • Lu L, Shi X, Zhou Y, Zhang X, Jin H, Pei C, He L, Geng Y (2016a) Lifetime-based memory management for distributed data processing systems. Proc VLDB Endowment 9(12):936–947

    Article  Google Scholar 

  • Lu X, Shankar D, Gugnani S, Panda DK (2016b) High-performance design of Apache Spark with RDMA and its benefits on various workloads. In: 2016 IEEE international conference on Big Data (IEEE BigData 2016), Washington, DC, pp 253–262

    Google Scholar 

  • Malik M, Rafatirah S, Sasan A, Homayoun H (2015) System and architecture level characterization of Big Data applications on big and little core server architectures. In: 2015 IEEE international conference on Big Data (IEEE BigData 2015), Santa Clara, pp 85–94

    Google Scholar 

  • Moon S, Lee J, Kee YS (2014) Introducing SSDs to the Hadoop MapReduce framework. In: 7th IEEE international conference on cloud computing (CLOUD’14), Anchorage, pp 272–279

    Google Scholar 

  • Neshatpour K, Malik M, Ghodrat MA, Sasan A, Homayoun H (2015) Energy-efficient acceleration of Big Data analytics applications using FPGAs. In: 2015 IEEE international conference on Big Data (IEEE BigData 2015), Santa Clara, pp 115–123

    Google Scholar 

  • Nguyen K, Fang L, Xu GH, Demsky B, Lu S, Alamian S, Mutlu O (2016) Yak: a high-performance Big-Data-friendly garbage collector. In: 12th USENIX symposium on operating systems design and implementation (OSDI’16), Savannah, pp 349–365

    Google Scholar 

  • Sangroya A, Serrano D, Bouchenak S (2012) MRBS: towards dependability benchmarking for Hadoop MapReduce. In: 18th international Euro-par conference on parallel processing workshops (Euro-Par’12), Rhodes Island, pp 3–12

    Google Scholar 

  • Veiga J, Expósito RR, Taboada GL, Touriño J (2015) MREv: an automatic MapReduce evaluation tool for Big Data workloads. In: International conference on computational science (ICCS’15), Reykjavík, pp 80–89

    Article  Google Scholar 

  • Veiga J, Expósito RR, Pardo XC, Taboada GL, Touriño J (2016a) Performance evaluation of Big Data frameworks for large-scale data analytics. In: 2016 IEEE international conference on Big Data (IEEE BigData 2016), Washington, DC, pp 424–431

    Google Scholar 

  • Veiga J, Expósito RR, Taboada GL, Touriño J (2016b) Analysis and evaluation of MapReduce solutions on an HPC cluster. Comput Electr Eng 50:200–216

    Article  Google Scholar 

  • Veiga J, Expósito RR, Taboada GL, Touriño J (2016c) Flame-MR: an event-driven architecture for MapReduce applications. Futur Gener Comput Syst 65:46–56

    Article  Google Scholar 

  • Wang Y, Que X, Yu W, Goldenberg D, Sehgal D (2011) Hadoop acceleration through network levitated merge. In: International conference for high performance computing, networking, storage and analysis (SC’11), Seattle, pp 57:1–57:10

    Google Scholar 

  • Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a Big Data benchmark suite from Internet services. In: 20th IEEE international symposium on high-performance computer architecture (HPCA’14), Orlando, pp 488–499

    Google Scholar 

  • Wasi-Ur-Rahman M, Islam NS, Lu X, Jose J, Subramoni H, Wang H, Panda DK (2013) High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In: 27th IEEE international parallel and distributed processing symposium workshops and PhD forum (IPDPSW’13), Boston, pp 1908–1917

    Google Scholar 

  • Xuan P, Ligon WB, Srimani PK, Ge R, Luo F (2017) Accelerating Big Data analytics on HPC clusters using two-level storage. Parallel Comput 61:18–34

    Article  MathSciNet  Google Scholar 

  • Yang D, Zhong X, Yan D, Dai F, Yin X, Lian C, Zhu Z, Jiang W, Wu G (2013) NativeTask: a Hadoop compatible framework for high performance. In: 2013 IEEE international conference on Big Data (IEEE BigData’13), Santa Clara, pp 94–101

    Google Scholar 

  • Yoo T, Yim M, Jeong I, Lee Y, Chun ST (2016) Performance evaluation of in-memory computing on scale-up and scale-out cluster. In: 8th international conference on ubiquitous and future networks (ICUFN’6), Vienna, pp 456–461

    Google Scholar 

  • Yuan Y, Salmi MF, Huai Y, Wang K, Lee R, Zhang X (2016) Spark-GPU: an accelerated in-memory data processing engine on clusters. In: 2016 IEEE international conference on Big Data (IEEE BigData’16), Washington, DC, pp 273–283

    Google Scholar 

  • Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: a unified engine for Big Data processing. Commun ACM 59(11):56–65

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jorge Veiga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Veiga, J., Expósito, R.R., Touriño, J. (2019). Performance Evaluation of Big Data Analysis. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_143

Download citation

Publish with us

Policies and ethics