Abstract
Hadoop is a distributed data processing platform supporting MapReduce parallel computing framework. In order to deal with general problems, there is always a need of accelerating Hadoop under certain circumstance such as Hive jobs. By outputting current time to logs at specially selected points, we traced the workflow of a typical MapReduce job generated by Hive and making time statistics for every phase of the job. Using different data quantities, we compared the proportion of each phase and located the bottleneck points of Hadoop. We make two major optimization advices: (1) focus on using combine and optimizing Net Work and Disk IO when dealing with big jobs having a large number of intermediate results; (2) optimizing map function and Disk IO when dealing with short jobs.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Dean, J., Ghemawats, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Apache Software Foundation. Apache Hadoop [EB/OL], 24 March 2016. http://hadoop.apache.org/
Apache Software Foundation. Apache Hive [EB/OL], 24 March 2016. http://hive.apache.org/index.html
Apache Software Foundation. Spark-on-Hadoop [EB/OL], 24 March 2016. http://spark.apache.org/docs/0.6.0/running-on-yarn.html
Apache Software Foundation. Storm-on-Hadoop [EB/OL], 24 March 2016. http://storm.apache.org/index.html
Apache Software Foundation. Tez-on-Hadoop [EB/OL], 24 March 2016. http://tez.apache.org/
Apache Software Foundation. Hadoop HDFS [EB/OL], 24 March 2016. http://hadoop.apache.org/docs/r2.6.4/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Overview
Argonne National Laboratory. Message passing interface standard [EB/OL], 24 March 2016. http://www.mcs.anl.gov/research/projects/mpi
Computer Science and Mathematics Division of Oak Ridge National Laboratory. Parallel virtual machine [EB/OL], 24 March 2016. http://www.csm.ornl.gov/pvm/
Apache Software Foundation. Hadoop MapReduce [EB/OL], 24 March 2016. http://hadoop.apache.org/docs/r2.6.4/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Apache Software Foundation. Hadoop YARN [EB/OL], 24 March 2016. http://hadoop.apache.org/docs/r2.6.4/hadoop-yarn/hadoop-yarn-site/YARN.html
OpenStack. OpenStack Swift [EB/OL], 24 March 2016. http://docs.openstack.org/developer/swift/#overview-and-concepts
Acknowledgement
This paper was partially supported by National Sci-Tech Support Plan 2015BAH10F01 and NSFC grant U1509216, 61472099, 61133002 and the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province LC2016026.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Li, J., Shi, S., Wang, H. (2016). Optimization Analysis of Hadoop. In: Che, W., et al. Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol 623. Springer, Singapore. https://doi.org/10.1007/978-981-10-2053-7_46
Download citation
DOI: https://doi.org/10.1007/978-981-10-2053-7_46
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2052-0
Online ISBN: 978-981-10-2053-7
eBook Packages: Computer ScienceComputer Science (R0)