Skip to main content

An Efficient Solution for Processing Skewed MapReduce Jobs

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9262))

Abstract

Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers.

In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://dumps.wikimedia.org/other/pagecounts-raw/.

  2. 2.

    This program is just for illustration; actually, it is possible to write a more efficient code by leveraging the sorting mechanisms of MapReduce.

  3. 3.

    http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia.

  4. 4.

    This mechanism is used for communication between the master and workers.

  5. 5.

    http://dumps.wikimedia.org/other/pagecounts-raw/.

  6. 6.

    http://dumps.wikimedia.org/enwiki/latest/.

  7. 7.

    http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html.

  8. 8.

    http://webdatacommons.org/hyperlinkgraph/.

  9. 9.

    https://code.google.com/p/skewtune/

References

  1. Hadoop (2014). http://hadoop.apache.org

  2. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)

    Article  Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)

    Google Scholar 

  4. Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

    Google Scholar 

  5. Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs in pig. In: SIGMOD (2012)

    Google Scholar 

  6. Elmeleegy, K., Olston, C., Reed, B.: SpongeFiles: mitigating data skew in mapreduce using distributed memory. In: SIGMOD (2014)

    Google Scholar 

  7. Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in MapReduce based on scalable cardinality estimates. In: ICDE. IEEE, April 2012

    Google Scholar 

  8. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in MapReduce applications. In: SIGMOD (2012)

    Google Scholar 

  9. Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)

    Article  MATH  Google Scholar 

  10. Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: ACM Symposium on Cloud Computing, SoCC (2012)

    Google Scholar 

  11. Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: ACM Symposium on Cloud Computing, SoCC (2012)

    Google Scholar 

  12. White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. O’Reilly, Sebastopol (2012)

    Google Scholar 

  13. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)

    Google Scholar 

Download references

Acknowledgments

Experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several universities as well as other funding bodies (see https://www.grid5000.fr).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reza Akbarinia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Akbarinia, R., Liroz-Gistau, M., Agrawal, D., Valduriez, P. (2015). An Efficient Solution for Processing Skewed MapReduce Jobs. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9262. Springer, Cham. https://doi.org/10.1007/978-3-319-22852-5_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22852-5_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22851-8

  • Online ISBN: 978-3-319-22852-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics