Abstract
The era of Big Data that we entered several years ago has changed our imagination about the type and the volume of data that can be processed, as well as the value of the data. Hadoop and the MapReduce processing model have revolutionized the way how we process and analyze the data today and how much important and valuable information we can get from the data. At the moment, the Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analytics. In this chapter, we will briefly describe the Hadoop ecosystem. We will also focus on two elements of the ecosystem—the Apache Hadoop and the Apache Spark. We will provide details of the MapReduce processing model and differences between MapReduce 1.0 and MapReduce 2.0. The concepts defined here are important for the understanding of complex systems presented in the following chapters of this part of the book.
The ability to collect, analyze, triangulate, and visualize vast amounts of data in real time is something the human race has never had before. This new set of tools, often referred by the lofty term ’Big Data,’ has begun to emerge as a new approach to addressing some of the biggest challenges facing our planet
Rick Smolan
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013)
The Apache Software Foundation: RDD Programming Guide (2018). https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-programming-guide
White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. OReilly, Ireland (2012)
Xin, R.: Apache Spark officially sets a new record in large-scale sorting. Technical report, Engineering Blog (2014). https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28. USENIX, San Jose, CA (2012). https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Mrozek, D. (2018). Foundations of the Hadoop Ecosystem. In: Scalable Big Data Analytics for Protein Bioinformatics. Computational Biology, vol 28. Springer, Cham. https://doi.org/10.1007/978-3-319-98839-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-98839-9_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98838-2
Online ISBN: 978-3-319-98839-9
eBook Packages: Computer ScienceComputer Science (R0)