MapReduce – The Scalable Distributed Data Processing Solution

Anjum, Bushra

doi:10.1007/978-3-319-93109-8_7

Bushra Anjum⁶

734 Accesses

Abstract

MapReduce is a programming paradigm used for processing massive data sets with a scalable and parallel approach on a cluster of distributed compute nodes. In this chapter we aim to provide background on the MapReduce programming paradigm and framework, highlighting its significance and usage for data crunching in today’s scenario. Alongside, students will be introduced to important concepts such as Big Data, scalability, parallelization and divide & conquer. The chapter provides ample examples, both beginner level and advanced, for students to become proficient in recognizing problems suitable for a MapReduce solution and to define efficient Map and Reduce functions for those data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Hardcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

[Online]. “Mind-blowing growth & power of big data - Business Insider” Available: http://www.businessinsider.com/mind-blowing-growth-and-power-of-big-data-2015-6
EMC Digital Universe with Research & Analysis by IDC. The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things. 2014
Google Scholar
[Online]. “Volume, velocity, and variety: Understanding the three V’s of big data,” in DIY-IT Available: http://www.zdnet.com/article/volume-velocity-and-variety-understanding-the-three-vs-of-big-data/
G. Sanjay, G. Howard, and L. Shun-Tak, “The Google File system,” in ACM SIGOPS Operating Systems Review - Volume 37 Issue 5, December 2003
Google Scholar
D. Jeff and G. Sanjay, “MapReduce: Simplified Data Processing on Large Clusters,” in Communications of the ACM – 50^th Anniversary Issue, Vol. 51 No. 1, Pages 107–113, 2008.
Google Scholar
[Online]. “Apache Hadoop” Available: http://hadoop.apache.org/
[Online]. “Apache Hadoop YARN” Available: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[Online]. “Hadoop Basics—Creating a MapReduce Program,” DZone Available: https://dzone.com/articles/hadoop-basics-creating
“Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer, University of Maryland, College Park, Manuscript prepared April 11, 2010
Google Scholar
“MapReduce Patterns, Algorithms, and Use Cases” by Ilya Katsov in Highly Scalable Blog, 2012 (https://highlyscalable.wordpress.com/2012/02/01/MapReduce-patterns/)
[Online]. “Apache Spark” Available: http://spark.apache.org/
[Online]. “Apache Hive” Available: https://hive.apache.org/
[Online]. “Apache Pig” Available: https://pig.apache.org/
[Online]. “Apache HBase” Available: https://hbase.apache.org/
[Online]. “Apache Mahout” Available: http://mahout.apache.org/
[Online]. “Apache Oozie” Available: http://oozie.apache.org/
[Online]. “The Hadoop Ecosystem Table” Available: https://hadoopecosystemtable.github.io/

Download references

Author information

Authors and Affiliations

Technical Lead and Senior Software Engineer Amazon, San Luis Obispo, CA, USA
Bushra Anjum

Authors

Bushra Anjum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bushra Anjum .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, GA, USA
Sushil K. Prasad
IBM Research AI, Yorktown Heights, NY, USA
Anshul Gupta
University of Massachusetts Amherst, Amherst, MA, USA
Arnold Rosenberg
University of Maryland, College Park, MD, USA
Alan Sussman
University of Massachusetts Amherst, Amherst, MA, USA
Charles Weems

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Anjum, B. (2018). MapReduce – The Scalable Distributed Data Processing Solution. In: Prasad, S., Gupta, A., Rosenberg, A., Sussman, A., Weems, C. (eds) Topics in Parallel and Distributed Computing. Springer, Cham. https://doi.org/10.1007/978-3-319-93109-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-93109-8_7
Published: 30 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93108-1
Online ISBN: 978-3-319-93109-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics