An Efficient Solution for Processing Skewed MapReduce Jobs

Akbarinia, Reza; Liroz-Gistau, Miguel; Agrawal, Divyakant; Valduriez, Patrick

doi:10.1007/978-3-319-22852-5_35

An Efficient Solution for Processing Skewed MapReduce Jobs

Reza Akbarinia¹⁸,
Miguel Liroz-Gistau¹⁸,
Divyakant Agrawal¹⁹ &
…
Patrick Valduriez¹⁸

Conference paper
First Online: 01 January 2015

841 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9262))

Abstract

Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers.

In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://dumps.wikimedia.org/other/pagecounts-raw/.
2.
This program is just for illustration; actually, it is possible to write a more efficient code by leveraging the sorting mechanisms of MapReduce.
3.
http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia.
4.
This mechanism is used for communication between the master and workers.
5.
http://dumps.wikimedia.org/other/pagecounts-raw/.
6.
http://dumps.wikimedia.org/enwiki/latest/.
7.
http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html.
8.
http://webdatacommons.org/hyperlinkgraph/.
9.
https://code.google.com/p/skewtune/

References

Hadoop (2014). http://hadoop.apache.org
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Google Scholar
Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Google Scholar
Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs in pig. In: SIGMOD (2012)
Google Scholar
Elmeleegy, K., Olston, C., Reed, B.: SpongeFiles: mitigating data skew in mapreduce using distributed memory. In: SIGMOD (2014)
Google Scholar
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in MapReduce based on scalable cardinality estimates. In: ICDE. IEEE, April 2012
Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in MapReduce applications. In: SIGMOD (2012)
Google Scholar
Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)
Article MATH Google Scholar
Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: ACM Symposium on Cloud Computing, SoCC (2012)
Google Scholar
Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: ACM Symposium on Cloud Computing, SoCC (2012)
Google Scholar
White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. O’Reilly, Sebastopol (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)
Google Scholar

Download references

Acknowledgments

Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see https://www.grid5000.fr).

Author information

Authors and Affiliations

INRIA and LIRMM, Montpellier, France
Reza Akbarinia, Miguel Liroz-Gistau & Patrick Valduriez
University of California, Santa Barbara, USA
Divyakant Agrawal

Authors

Reza Akbarinia
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Liroz-Gistau
View author publications
You can also search for this author in PubMed Google Scholar
Divyakant Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Valduriez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reza Akbarinia .

Editor information

Editors and Affiliations

Hewlett-Packard Enterprise, Sunnyvale, California, USA
Qiming Chen
Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
Blaise Pascal University, Aubiere, France
Farouk Toumani
University of Linz, Linz, Austria
Roland Wagner
Universidad Politécnica de Valencia, Valencia, Spain
Hendrik Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Akbarinia, R., Liroz-Gistau, M., Agrawal, D., Valduriez, P. (2015). An Efficient Solution for Processing Skewed MapReduce Jobs. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9262. Springer, Cham. https://doi.org/10.1007/978-3-319-22852-5_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-22852-5_35
Published: 11 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22851-8
Online ISBN: 978-3-319-22852-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics