Abstract
MapReduce has emerged as a very popular programming model for large-scale data analytics. Despite its industry-wide acceptance, the open source ApacheTM HadoopTM framework for MapReduce remains difficult to optimize, particularly in large-scale production environments. The vast search space defined by the hundreds of MapReduce configuration parameters and the complex interactions between them makes it time consuming to rely on manual tuning. Hence something more is needed. In this paper we evaluate approaches to the automatic tuning of Hadoop MapReduce including ones based on cost-based and machine learning models. We determine that they are inadequate and instead propose a search-based approach called Gunther for Hadoop MapReduce optimization. Gunther uses a Genetic Algorithm which is specially designed to aggressively identify parameter settings that result in near-optimal job execution time. We evaluate Gunther on two types of clusters with different resource characteristics. Our experiments demonstrate that Gunther can obtain near-optimal performance within a small number of trials (<30), outperforming existing auto-tuning solutions and industry recommended configurations. We also describe a methodology for reducing the dimensionality of the auto-tuning problem, further improving search efficiency without sacrificing performance improvement.
Chapter PDF
Similar content being viewed by others
References
Babu, S.: Towards Automatic Optimization of MapReduce Programs. In: SOCC, pp. 137–142 (2010)
Beck, A.: A Fast Iterative Shrinkage-Threshold Algorithm for Linear Inverse Problems. In: SIAM (2009)
Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: VLDB 2007 (2007)
Cloudera: 7 tips for Improving MapReduce Performance
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI (2004)
Duan, S., Thummala, V., Babu, S.: Tuning database configuration parameters with iTuned. In: VLDB 2009 (2009)
Ekanayake, J., et al.: Twister: a runtime for iterative mapreduce. In: HPDC (2010)
Ganapathi, A., et al.: A case for machine learning to optimize multicore performance. In: HotPar (2009)
Hadoop mapreduce, http://hadoop.apache.org
HiBench, https://github.com/hibench/HiBench-2
Herodotou, H.: Hadoop Performance Models. Technical report, Duke Univ. (2010)
Herodotou, H., et al.: What-if Analysis, and Cost-based Optimization of MapReduce Programs. In: PVLDB (2011)
Herodoto, H., et al.: Starfish: A Self-tuning System for Big Data Analytics. In: CIDR (2011)
Intel SSD, http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-ssd.html
Ipek, E., de Supinski, B.R., Schulz, M., McKee, S.A.: An approach to performance prediction for parallel applications. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 196–205. Springer, Heidelberg (2005)
Jahani, E., et al.: Automatic Optimization of MapReduce Programs. In: PVLDB (2011)
Jiang, D., et al.: The Performance of MapReduce: An In-depth Study. In: PVLDB (2010)
Kambatla, K., et al.: Towards optimizing hadoop provisioning in the cloud. In: HotCloud (2009)
Kennedy, J., et al.: Particle Swarm Optimization. IEEE ICNN (1995)
Kirkpatrick, S., Gelatt, D.C., Vechhi, M.P.: Optimization by simulated annealing. Science (1983)
Kwan, S., et al.: Automatic Configuration of IBM DB2 Universal Database. IBM TR (2002)
Liu, J., et al.: Panacea: Towards Holistic Optimization of MapReduce Applications. In: CGO 2012 (2012)
Mitchell, M.: An Introduction to Genetic Algorithms. The MIT Press (1996)
Singer, J., et al.: Garbage collection auto-tuning for java mapreduce on multi-cores. In: ISMM (2011)
White, T.: Hadoop: The Definitive Guide. Yahoo Press (2010)
YARN, http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html
Ye, T., Kalyanaraman, S.: A Recursive Random Search Algorithm for Large-Scale Network Parameter Configuration. In: SIGMETRICS, pp. 196–205 (2003)
Zheng, W., Bianchini, R., Nguyen, T.D.: Automatic Configuration of Internet Services. In: Eurosys 2007 (2007)
Zhu, Q., et al.: Automatic tuning of interactive perception applications. UAI (2010)
Gridmix3 - Emulating Production Workload for Apache Hadoop: http://developer.yahoo.com/blogs/hadoop/gridmix3-emulating-production-workload-apache-hadoop-450.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liao, G., Datta, K., Willke, T.L. (2013). Gunther: Search-Based Auto-Tuning of MapReduce. In: Wolf, F., Mohr, B., an Mey, D. (eds) Euro-Par 2013 Parallel Processing. Euro-Par 2013. Lecture Notes in Computer Science, vol 8097. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40047-6_42
Download citation
DOI: https://doi.org/10.1007/978-3-642-40047-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40046-9
Online ISBN: 978-3-642-40047-6
eBook Packages: Computer ScienceComputer Science (R0)