Foundations of the Hadoop Ecosystem

Mrozek, Dariusz

doi:10.1007/978-3-319-98839-9_6

Foundations of the Hadoop Ecosystem

Dariusz Mrozek ORCID: orcid.org/0000-0001-6764-6656²⁴

Chapter
First Online: 26 September 2018

866 Accesses
3 Citations

Part of the book series: Computational Biology ((COBO,volume 28))

Abstract

The era of Big Data that we entered several years ago has changed our imagination about the type and the volume of data that can be processed, as well as the value of the data. Hadoop and the MapReduce processing model have revolutionized the way how we process and analyze the data today and how much important and valuable information we can get from the data. At the moment, the Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analytics. In this chapter, we will briefly describe the Hadoop ecosystem. We will also focus on two elements of the ecosystem—the Apache Hadoop and the Apache Spark. We will provide details of the MapReduce processing model and differences between MapReduce 1.0 and MapReduce 2.0. The concepts defined here are important for the understanding of complex systems presented in the following chapters of this part of the book.

The ability to collect, analyze, triangulate, and visualize vast amounts of data in real time is something the human race has never had before. This new set of tools, often referred by the lofty term ’Big Data,’ has begun to emerge as a new approach to addressing some of the biggest challenges facing our planet

Rick Smolan

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013)
Google Scholar
The Apache Software Foundation: RDD Programming Guide (2018). https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-programming-guide
White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. OReilly, Ireland (2012)
Google Scholar
Xin, R.: Apache Spark officially sets a new record in large-scale sorting. Technical report, Engineering Blog (2014). https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28. USENIX, San Jose, CA (2012). https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Article Google Scholar

Download references

Author information

Authors and Affiliations

Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek

Authors

Dariusz Mrozek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Mrozek .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mrozek, D. (2018). Foundations of the Hadoop Ecosystem. In: Scalable Big Data Analytics for Protein Bioinformatics. Computational Biology, vol 28. Springer, Cham. https://doi.org/10.1007/978-3-319-98839-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-98839-9_6
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98838-2
Online ISBN: 978-3-319-98839-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics