Skip to main content

MapReduce – The Scalable Distributed Data Processing Solution

  • Chapter
  • First Online:
Topics in Parallel and Distributed Computing
  • 734 Accesses

Abstract

MapReduce is a programming paradigm used for processing massive data sets with a scalable and parallel approach on a cluster of distributed compute nodes. In this chapter we aim to provide background on the MapReduce programming paradigm and framework, highlighting its significance and usage for data crunching in today’s scenario. Alongside, students will be introduced to important concepts such as Big Data, scalability, parallelization and divide & conquer. The chapter provides ample examples, both beginner level and advanced, for students to become proficient in recognizing problems suitable for a MapReduce solution and to define efficient Map and Reduce functions for those data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 59.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. [Online]. “Mind-blowing growth & power of big data - Business Insider” Available: http://www.businessinsider.com/mind-blowing-growth-and-power-of-big-data-2015-6

  2. EMC Digital Universe with Research & Analysis by IDC. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. 2014

    Google Scholar 

  3. [Online]. “Volume, velocity, and variety: Understanding the three V’s of big data,” in DIY-IT Available: http://www.zdnet.com/article/volume-velocity-and-variety-understanding-the-three-vs-of-big-data/

  4. G. Sanjay, G. Howard, and L. Shun-Tak, “The Google File system,” in ACM SIGOPS Operating Systems Review - Volume 37 Issue 5, December 2003

    Google Scholar 

  5. D. Jeff and G. Sanjay, “MapReduce: Simplified Data Processing on Large Clusters,” in Communications of the ACM – 50th Anniversary Issue, Vol. 51 No. 1, Pages 107–113, 2008.

    Google Scholar 

  6. [Online]. “Apache Hadoop” Available: http://hadoop.apache.org/

  7. [Online]. “Apache Hadoop YARN” Available: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

  8. [Online]. “Hadoop Basics—Creating a MapReduce Program,” DZone Available: https://dzone.com/articles/hadoop-basics-creating

  9. “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer, University of Maryland, College Park, Manuscript prepared April 11, 2010

    Google Scholar 

  10. “MapReduce Patterns, Algorithms, and Use Cases” by Ilya Katsov in Highly Scalable Blog, 2012 (https://highlyscalable.wordpress.com/2012/02/01/MapReduce-patterns/)

  11. [Online]. “Apache Spark” Available: http://spark.apache.org/

  12. [Online]. “Apache Hive” Available: https://hive.apache.org/

  13. [Online]. “Apache Pig” Available: https://pig.apache.org/

  14. [Online]. “Apache HBase” Available: https://hbase.apache.org/

  15. [Online]. “Apache Mahout” Available: http://mahout.apache.org/

  16. [Online]. “Apache Oozie” Available: http://oozie.apache.org/

  17. [Online]. “The Hadoop Ecosystem Table” Available: https://hadoopecosystemtable.github.io/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bushra Anjum .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Anjum, B. (2018). MapReduce – The Scalable Distributed Data Processing Solution. In: Prasad, S., Gupta, A., Rosenberg, A., Sussman, A., Weems, C. (eds) Topics in Parallel and Distributed Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-93109-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93109-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93108-1

  • Online ISBN: 978-3-319-93109-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics