Skip to main content

The Hitchhiker’s Guide to Big Data

  • Chapter
  • First Online:
  • 2593 Accesses

Abstract

By the time you get to the end of this paragraph, you will have processed 1,700 bytes of data. This number will grow to 500,000 bytes by the end of this book. Taking that as the average size of a book and multiplying it by the total number of books in the world (according to a Google estimate, there were 130 million books in the world in 20101) gives 65 TB. That is a staggering amount of data that would require 130 standard, off-the-shelf 500 GB hard drives to store. Now imagine you are a book publisher and you want to translate all of these books into multiple languages (for simplicity, let’s assume all these books are in English). You would like to translate each line as soon as it is written by the author—that is, you want to perform the translation in real time using a stream of lines rather than waiting for the book to be finished. The average number of characters or bytes per line is 80 (this also includes spaces). Let’s assume the author of each book can churn out 4 lines per minute (320 bytes per minute), and all the authors are writing concurrently and nonstop. Across the entire 130 million-book corpus, the figure is 41,600,000,000 bytes, or 41.6 GB per minute. This is well beyond the processing capabilities of a single machine and requires a multi-node cluster. Atop this cluster, you also need a real-time dataprocessing framework to run your translation application. Enter Spark Streaming. Appropriately, this book will teach you to architect and implement applications that can process data at scale and at line-rate.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   29.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   37.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Leonid Taycher, “Books of the world, stand up and be counted! All 129,864,880 of you,” Google Books Search, 2010, http://booksearch.blogspot.com/2010/08/books-of-world-stand-up-and-be-counted.html .

  2. 2.

    IEEE Computer Society, “Web Search for a Planet: The Google Cluster Architecture,” 2003, http://static.googleusercontent.com/media/research.google.com/en//archive/googlecluster-ieee.pdf .

  3. 3.

    Luiz André Barroso and Urs Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (Morgan& Claypool, 2009), www.morganclaypool.com/doi/abs/ 10.2200/S00193ED1V01Y200905CAC006 .

  4. 4.

    Jeffrey Dean and Luiz André Barroso, “The Tail at Scale,” Commun. ACM 56, no 2 (February 2013), 74-80.

  5. 5.

    First described by Eric Brewer, the Chief Scientist of Inktomi, one of the earliest web giants in the 1990s.

  6. 6.

    Werner Vogels, “Eventually Consistent – Revisited,” All Things Distributed, 2008, www.allthingsdistributed.com/2008/12/eventually_consistent.html .

  7. 7.

    ACID and BASE are not binary choices, though. There is a continuum between the two, with many design points.

  8. 8.

    This attention span is getting shorter because most users now consume these services on the go on mobile devices.

  9. 9.

    Grzegorz Malewicz et al., “Pregel: A System for Large-Scale Graph Processing,” Proceedings of SIGMOD ‘10 (ACM, 2010), 135-146.

  10. 10.

    Sergey Melnik et al., “Dremel: Interactive Analysis of Web-Scale Datasets, Proc. VLDB Endow 3, no. 1-2 (September 2010), 330-339.

  11. 11.

    Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proceedings of OSDI 04 6 (USENIX Association, 2004), 10.

  12. 12.

    Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” Proceedings of SOSP ‘03 (ACM, 2003), 29-43.

  13. 13.

    Tyson Condie et al., “MapReduce Online,” Proceedings of NSDI ‘10 (USENIX, 2010), 21.

  14. 14.

    This is now known as the Internet of Things (IoT).

  15. 15.

    http://cs.brown.edu/research/aurora/ .

  16. 16.

    http://cs.brown.edu/research/borealis/ .

  17. 17.

    http://nms.csail.mit.edu/projects/medusa/ .

  18. 18.

    http://telegraph.cs.berkeley.edu/ .

  19. 19.

    www.streambase.com/ .

  20. 20.

    http://www-03.ibm.com/software/products/en/infosphere-streams .

  21. 21.

    www.softwareag.com/corporate/products/apama_webmethods/analytics/overview/default.asp .

  22. 22.

    Leonardo Neumeyer et al., “S4: Distributed Stream Computing Platform,” Proceedings of ICDMW ‘10 (IEEE, 2010), 170-177.

  23. 23.

    http://incubator.apache.org/s4/ .

  24. 24.

    http://samza.apache.org/ .

  25. 25.

    Chapter 4 describes Kafka in detail when we analyze the various external sources from which to ingest data.

  26. 26.

    https://storm.apache.org/ .

  27. 27.

    http://aws.amazon.com/kinesis/ .

  28. 28.

    http://azure.microsoft.com/en-us/services/event-hubs/ .

  29. 29.

    https://cloud.google.com/dataflow/ .

  30. 30.

    Tyler Akidau et al., “MillWheel: Fault-Tolerant Stream Processing at Internet Scale,” Proc. VLDB Endow. 6, no. 11 (August 2013), 1033-1044.

  31. 31.

    Craig Chambers et al., “FlumeJava: Easy, Efficient Data-Parallel Pipelines, SIGPLAN Not. 45, no. 6 (June 2010), 363-375.

Author information

Authors and Affiliations

Authors

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

9781484214800 (zip 135 kb)

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Zubair Nabi

About this chapter

Cite this chapter

Nabi, Z. (2016). The Hitchhiker’s Guide to Big Data. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_1

Download citation

Publish with us

Policies and ethics