Skip to main content

Real-Time ETL and Analytics Magic

  • Chapter
  • First Online:
Pro Spark Streaming
  • 2584 Accesses

Abstract

Data (big or otherwise) has been woven into the fabric of most businesses. The world is at a stage where Big Data directly drives corporate strategy. To maintain a competitive edge, most businesses try to run their analytics pipeline in near real-time. Although this captures the behavior of a large class of applications that rely on unstructured data, it is not exhaustive: a significant chunk of data sources are structured, and their analysis applications require data-warehousing capabilities. One way to handle these requirements is to blend the existing Spark API with an external warehousing solution such as Hive, but this is a marriage of convenience rather than a natural fit: data must be copied back and forth, not to mention the burden of maintaining two different APIs. A better solution is Spark SQL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 29.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 37.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Charles Clover, “Urban Population to Exceed 50 Percent,” The Telegraph, June 27, 2007, " www.telegraph.co.uk/news/earth/earthnews/3298527/Urban-population-to-exceed-50-per-cent.html .

  2. 2.

    Joshua Pramis, “Number of Mobile Phones to Exceed World Population by 2014,” Digital Trends, February 28, 2013, www.digitaltrends.com/mobile/mobile-phone-world-population-2014/ .

  3. 3.

    Open Big Data, Dandelion, https://dandelion.eu/datamine/open-big-data/ .

  4. 4.

    http://geojson.org/ .

  5. 5.

    A Parquet table is typically made up of more than one file, which may be located at multiple locations.

  6. 6.

    http://spark-packages.org/package/databricks/spark-csv .

  7. 7.

    https://issues.apache.org/jira/browse/SPARK-8368 .

  8. 8.

    https://cran.r-project.org/web/views/HighPerformanceComputing.html .

  9. 9.

    https://db.apache.org/derby/releases/release-10.10.1.1.html .

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Zubair Nabi

About this chapter

Cite this chapter

Nabi, Z. (2016). Real-Time ETL and Analytics Magic. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_8

Download citation

Publish with us

Policies and ethics