Performance Evaluation of Big Data Analysis

Veiga, Jorge; Expósito, Roberto R.; Touriño, Juan

doi:10.1007/978-3-319-77525-8_143

Jorge Veiga³,
Roberto R. Expósito³ &
Juan Touriño³

49 Accesses

Synonyms

Big Data performance characterization

Definitions

Evaluating the performance of Big Data systems is the usual way of getting information about the expected execution time of analytics applications. These applications are generally used to extract meaningful information from very large input datasets. There exist many high-level frameworks for Big Data analysis, each one oriented to different fields like machine learning and data mining, like Mahout (Apache Mahout 2009), or graph analytics like Giraph (Avery 2011). These high-level frameworks allow to define complex data processing pipelines that are later decomposed into more fine-grained operations in order to be executed by Big Data processing frameworks like Hadoop (Dean and Ghemawat 2008), Spark (Zaharia et al. 2016), and Flink (Apache Flink 2014). Therefore, the performance evaluation of these frameworks is key to determine their suitability for scalable Big Data analysis.

Big Data processing frameworks can be broken down...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache Flink (2014) Scalable batch and stream data processing. http://flink.apache.org/, [Last visited: Dec 2017]
Apache Mahout (2009) Scalable machine learning and data mining. http://mahout.apache.org/, [Last visited: Dec 2017]
Avery C (2011) Giraph: large-scale graph processing infrastructure on Hadoop. In: 2011 Hadoop summit, Santa Clara, pp 5–9
Google Scholar
Browne S, Dongarra J, Garner N, Ho G, Mucci P (2000) A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl 14(3):189–204
Article Google Scholar
Chen C, Li K, Ouyang A, Tang Z, Li K (2017) GPU-accelerated parallel hierarchical extreme learning machine on Flink for Big Data. IEEE Trans Syst Man Cybern Syst 47(10):2740–2753
Article Google Scholar
Choi IS, Yang W, Kee YS (2015) Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing. In: 2015 IEEE international conference on Big Data (IEEE BigData 2015), Santa Clara, pp 1073–1083
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Enes J, Expósito RR, Touriño J (2017) Big Data watchdog: real-time monitoring and profiling. http://bdwatchdog.dec.udc.es, [Last visited: Dec 2017]
Fadika Z, Govindaraju M, Canon R, Ramakrishnan L (2012) Evaluating Hadoop for data-intensive scientific operations. In: 5th IEEE international conference on cloud computing (CLOUD’12), Honolulu, pp 67–74
Google Scholar
Fadika Z, Dede E, Govindaraju M, Ramakrishnan L (2014) MARIANE: using MApReduce in HPC environments. Futur Gener Comput Syst 36:379–388
Article Google Scholar
Fang W, He B, Luo Q, Govindaraju NK (2011) Mars: accelerating MapReduce with graphics processors. IEEE Trans Parallel Distrib Syst 22(4):608–620
Article Google Scholar
Gog I, Giceva J, Schwarzkopf M, Vaswani K, Vytiniotis D, Ramalingan G, Costa M, Murray D, Hand S, Isard M (2015) Broom: sweeping out garbage collection from Big Data systems. In: 15th workshop on hot topics in operating systems (HotOS’15), Kartause Ittingen
Google Scholar
González P, Pardo XC, Penas DR, Teijeiro D, Banga JR, Doallo R (2017) Using the cloud for parameter estimation problems: comparing Spark vs MPI with a case-study. In: 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid 2017), Madrid, pp 797–806
Google Scholar
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 26th IEEE international conference on data engineering workshops (ICDEW’10), Long Beach, pp 41–51
Google Scholar
Lee YS, Quero LC, Kim SH, Kim JS, Maeng S (2016) ActiveSort: efficient external sorting using active SSDs in the MapReduce framework. Futur Gener Comput Syst 65:76–89
Article Google Scholar
Li Z, Shen H (2017) Measuring scale-up and scale-out Hadoop with remote and local file systems and selecting the best platform. IEEE Trans Parallel Distrib Syst 28(11):3201–3214
Article Google Scholar
Li M, Tan J, Wang Y, Zhang L, Salapura V (2017) SparkBench: a Spark benchmarking suite characterizing large-scale in-memory data analytics. Clust Comput 20(3):2575–2589
Article Google Scholar
Liang F, Feng C, Lu X, Xu Z (2014) Performance benefits of DataMPI: a case study with BigDataBench. In: 4th workshop on Big Data benchmarks, performance optimization and emerging hardware (BPOE’14), Salt Lake City, pp 111–123
Google Scholar
Loghin D, Tudor BM, Zhang H, Ooi BC, Teo YM (2015) A performance study of Big Data on small nodes. Proc VLDB Endowment 8(7):762–773
Article Google Scholar
Lu M, Liang Y, Huynh HP, Ong Z, He B, Goh RSM (2015) MrPhi: an optimized MapReduce framework on Intel Xeon Phi coprocessors. IEEE Trans Parallel Distrib Syst 26(11):3066–3078
Article Google Scholar
Lu L, Shi X, Zhou Y, Zhang X, Jin H, Pei C, He L, Geng Y (2016a) Lifetime-based memory management for distributed data processing systems. Proc VLDB Endowment 9(12):936–947
Article Google Scholar
Lu X, Shankar D, Gugnani S, Panda DK (2016b) High-performance design of Apache Spark with RDMA and its benefits on various workloads. In: 2016 IEEE international conference on Big Data (IEEE BigData 2016), Washington, DC, pp 253–262
Google Scholar
Malik M, Rafatirah S, Sasan A, Homayoun H (2015) System and architecture level characterization of Big Data applications on big and little core server architectures. In: 2015 IEEE international conference on Big Data (IEEE BigData 2015), Santa Clara, pp 85–94
Google Scholar
Moon S, Lee J, Kee YS (2014) Introducing SSDs to the Hadoop MapReduce framework. In: 7th IEEE international conference on cloud computing (CLOUD’14), Anchorage, pp 272–279
Google Scholar
Neshatpour K, Malik M, Ghodrat MA, Sasan A, Homayoun H (2015) Energy-efficient acceleration of Big Data analytics applications using FPGAs. In: 2015 IEEE international conference on Big Data (IEEE BigData 2015), Santa Clara, pp 115–123
Google Scholar
Nguyen K, Fang L, Xu GH, Demsky B, Lu S, Alamian S, Mutlu O (2016) Yak: a high-performance Big-Data-friendly garbage collector. In: 12th USENIX symposium on operating systems design and implementation (OSDI’16), Savannah, pp 349–365
Google Scholar
Sangroya A, Serrano D, Bouchenak S (2012) MRBS: towards dependability benchmarking for Hadoop MapReduce. In: 18th international Euro-par conference on parallel processing workshops (Euro-Par’12), Rhodes Island, pp 3–12
Google Scholar
Veiga J, Expósito RR, Taboada GL, Touriño J (2015) MREv: an automatic MapReduce evaluation tool for Big Data workloads. In: International conference on computational science (ICCS’15), Reykjavík, pp 80–89
Article Google Scholar
Veiga J, Expósito RR, Pardo XC, Taboada GL, Touriño J (2016a) Performance evaluation of Big Data frameworks for large-scale data analytics. In: 2016 IEEE international conference on Big Data (IEEE BigData 2016), Washington, DC, pp 424–431
Google Scholar
Veiga J, Expósito RR, Taboada GL, Touriño J (2016b) Analysis and evaluation of MapReduce solutions on an HPC cluster. Comput Electr Eng 50:200–216
Article Google Scholar
Veiga J, Expósito RR, Taboada GL, Touriño J (2016c) Flame-MR: an event-driven architecture for MapReduce applications. Futur Gener Comput Syst 65:46–56
Article Google Scholar
Wang Y, Que X, Yu W, Goldenberg D, Sehgal D (2011) Hadoop acceleration through network levitated merge. In: International conference for high performance computing, networking, storage and analysis (SC’11), Seattle, pp 57:1–57:10
Google Scholar
Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a Big Data benchmark suite from Internet services. In: 20th IEEE international symposium on high-performance computer architecture (HPCA’14), Orlando, pp 488–499
Google Scholar
Wasi-Ur-Rahman M, Islam NS, Lu X, Jose J, Subramoni H, Wang H, Panda DK (2013) High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In: 27th IEEE international parallel and distributed processing symposium workshops and PhD forum (IPDPSW’13), Boston, pp 1908–1917
Google Scholar
Xuan P, Ligon WB, Srimani PK, Ge R, Luo F (2017) Accelerating Big Data analytics on HPC clusters using two-level storage. Parallel Comput 61:18–34
Article MathSciNet Google Scholar
Yang D, Zhong X, Yan D, Dai F, Yin X, Lian C, Zhu Z, Jiang W, Wu G (2013) NativeTask: a Hadoop compatible framework for high performance. In: 2013 IEEE international conference on Big Data (IEEE BigData’13), Santa Clara, pp 94–101
Google Scholar
Yoo T, Yim M, Jeong I, Lee Y, Chun ST (2016) Performance evaluation of in-memory computing on scale-up and scale-out cluster. In: 8th international conference on ubiquitous and future networks (ICUFN’6), Vienna, pp 456–461
Google Scholar
Yuan Y, Salmi MF, Huai Y, Wang K, Lee R, Zhang X (2016) Spark-GPU: an accelerated in-memory data processing engine on clusters. In: 2016 IEEE international conference on Big Data (IEEE BigData’16), Washington, DC, pp 273–283
Google Scholar
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: a unified engine for Big Data processing. Commun ACM 59(11):56–65
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Architecture Group, Universidade da Coruña, A Coruña, Spain
Jorge Veiga, Roberto R. Expósito & Juan Touriño

Authors

Jorge Veiga
View author publications
You can also search for this author in PubMed Google Scholar
Roberto R. Expósito
View author publications
You can also search for this author in PubMed Google Scholar
Juan Touriño
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Veiga .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
School of Information Technologies, Sydney University, Sydney, Australia
Albert Y. Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Veiga, J., Expósito, R.R., Touriño, J. (2019). Performance Evaluation of Big Data Analysis. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_143

Download citation

DOI: https://doi.org/10.1007/978-3-319-77525-8_143
Published: 20 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics