Machine-Learning Based Spark and Hadoop Workload Classification Using Container Performance Patterns

Genkin, Mikhail; Dehne, Frank; Navarro, Pablo; Zhou, Siyu

doi:10.1007/978-3-030-32813-9_11

Mikhail Genkin¹⁰,
Frank Dehne¹⁰,
Pablo Navarro¹⁰ &
…
Siyu Zhou¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11459))

Included in the following conference series:

International Symposium on Benchmarking, Measuring and Optimization

1301 Accesses
5 Citations

Abstract

Big data Hadoop and Spark applications are deployed on infrastructure managed by resource managers such as Apache YARN, Mesos, and Kubernetes, and run in constructs called containers. These applications often require extensive manual tuning to achieve acceptable levels of performance. While there have been several promising attempts to develop automatic tuning systems, none are currently robust enough to handle realistic workload conditions. Big data workload analysis research performed to date has focused mostly on system-level parameters, such as CPU and memory utilization, rather than higher-level container metrics. In this paper we present the first detailed experimental analysis of container performance metrics in Hadoop and Spark workloads. We demonstrate that big data workloads show unique patterns of container creation, completion, response-time and relative standard deviation of response-time. Based on these observations, we built a machine-learning-based workload classifier with a workload classification accuracy of 83% and a workload change detection accuracy of 74%. Our observed experimental results are an important step towards developing automatically tuned, fully autonomous cloud infrastructure for big data analytics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Awan, A.J., Brorsson, M., Vlassov, V., Ayguade, E.: Micro-architectural characterization of apache spark on batch and stream processing workloads. In: 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom), pp. 59–66. IEEE (2016)
Google Scholar
Ding, X., Liu, Y., Qian, D.: JellyFish: Online performance tuning with adaptive configuration and elastic container in Hadoop yarn. In: 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), pp. 831–836. IEEE (2015)
Google Scholar
Genkin, M., Dehne, F., Pospelova, M., Chen, Y., Navarro, P.: Automatic, on-line tuning of yarn container memory and cpu parameters. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems, pp. 317–324. IEEE (2016)
Google Scholar
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51. IEEE (2010)
Google Scholar
Jia, Z., et al.: Auto-tuning spark big data workloads on POWER8: prediction-based dynamic SMT threading. In: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pp. 387–400. ACM (2016)
Google Scholar
Jia, Z., et al.: Characterizing and subsetting big data workloads. In: 2014 IEEE International Symposium on Workload Characterization (IISWC), pp. 191–201. IEEE (2014)
Google Scholar
Mishra, A.K., Hellerstein, J.L., Cirne, W., Das, C.R.: Towards characterizing cloud backend workloads: insights from google compute clusters. ACM SIGMETRICS Perform. Eval. Rev. 37(4), 34–41 (2010)
Article Google Scholar
Moreno, I.S., Garraghan, P., Townend, P., Xu, J.: An approach for characterizing workloads in google cloud to derive realistic resource utilization models. In: 2013 IEEE 7th International Symposium on Service Oriented System Engineering (SOSE), pp. 49–60. IEEE (2013)
Google Scholar
Mulia, W.D., Sehgal, N., Sohoni, S., Acken, J.M., Stanberry, C.L., Fritz, D.J.: Cloud workload characterization. IETE Tech. Rev. 30(5), 382–397 (2013)
Article Google Scholar
Wang, G., Xu, J., He, B.: A novel method for tuning configuration parameters of spark based on machine learning. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 586–593. IEEE (2016)
Google Scholar
Wang, K., Tan, B., Shi, J., Yang, B.: Automatic task slots assignment in Hadoop MapReduce. In: Proceedings of the 1st Workshop on Architectures and Systems for Big Data, pp. 24–29. ACM (2011)
Google Scholar
Wasi-Ur-Rahman, M., Islam, N.S., Lu, X., Shankar, D., Panda, D.K.: MR-advisor: a comprehensive tuning tool for advising HPC users to accelerate mapreduce applications on supercomputers. In: 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 198–205. IEEE (2016)
Google Scholar
Zhang, R., Li, M., Hildebrand, D.: Finding the big data sweet spot: towards automatically recommending configurations for Hadoop clusters on docker containers. In: 2015 IEEE International Conference on Cloud Engineering (IC2E), pp. 365–368. IEEE (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carleton University, Ottawa, Canada
Mikhail Genkin, Frank Dehne, Pablo Navarro & Siyu Zhou

Authors

Mikhail Genkin
View author publications
You can also search for this author in PubMed Google Scholar
Frank Dehne
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Siyu Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mikhail Genkin .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Chen Zheng
Chinese Academy of Sciences, Beijing, China
Jianfeng Zhan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Genkin, M., Dehne, F., Navarro, P., Zhou, S. (2019). Machine-Learning Based Spark and Hadoop Workload Classification Using Container Performance Patterns. In: Zheng, C., Zhan, J. (eds) Benchmarking, Measuring, and Optimizing. Bench 2018. Lecture Notes in Computer Science(), vol 11459. Springer, Cham. https://doi.org/10.1007/978-3-030-32813-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-32813-9_11
Published: 08 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32812-2
Online ISBN: 978-3-030-32813-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics