Skip to main content

Foundations of the Hadoop Ecosystem

  • Chapter
  • First Online:

Part of the book series: Computational Biology ((COBO,volume 28))

Abstract

The era of Big Data that we entered several years ago has changed our imagination about the type and the volume of data that can be processed, as well as the value of the data. Hadoop and the MapReduce processing model have revolutionized the way how we process and analyze the data today and how much important and valuable information we can get from the data. At the moment, the Hadoop ecosystem covers a broad collection of platforms, frameworks, tools, libraries, and other services for fast, reliable, and scalable data analytics. In this chapter, we will briefly describe the Hadoop ecosystem. We will also focus on two elements of the ecosystem—the Apache Hadoop and the Apache Spark. We will provide details of the MapReduce processing model and differences between MapReduce 1.0 and MapReduce 2.0. The concepts defined here are important for the understanding of complex systems presented in the following chapters of this part of the book.

The ability to collect, analyze, triangulate, and visualize vast amounts of data in real time is something the human race has never had before. This new set of tools, often referred by the lofty term ’Big Data,’ has begun to emerge as a new approach to addressing some of the biggest challenges facing our planet

Rick Smolan

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. National Research Council: Frontiers in Massive Data Analysis. National Academy Press, Washington, D.C. (2013)

    Google Scholar 

  3. The Apache Software Foundation: RDD Programming Guide (2018). https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-programming-guide

  4. White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. OReilly, Ireland (2012)

    Google Scholar 

  5. Xin, R.: Apache Spark officially sets a new record in large-scale sorting. Technical report, Engineering Blog (2014). https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

  6. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28. USENIX, San Jose, CA (2012). https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

  7. Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dariusz Mrozek .

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mrozek, D. (2018). Foundations of the Hadoop Ecosystem. In: Scalable Big Data Analytics for Protein Bioinformatics. Computational Biology, vol 28. Springer, Cham. https://doi.org/10.1007/978-3-319-98839-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98839-9_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98838-2

  • Online ISBN: 978-3-319-98839-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics