Real-Time ETL and Analytics Magic

Nabi, Zubair

doi:10.1007/978-1-4842-1479-4_8

Zubair Nabi²

2584 Accesses

Abstract

Data (big or otherwise) has been woven into the fabric of most businesses. The world is at a stage where Big Data directly drives corporate strategy. To maintain a competitive edge, most businesses try to run their analytics pipeline in near real-time. Although this captures the behavior of a large class of applications that rely on unstructured data, it is not exhaustive: a significant chunk of data sources are structured, and their analysis applications require data-warehousing capabilities. One way to handle these requirements is to blend the existing Spark API with an external warehousing solution such as Hive, but this is a marriage of convenience rather than a natural fit: data must be copied back and forth, not to mention the burden of maintaining two different APIs. A better solution is Spark SQL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 29.99; Price excludes VAT (USA)

Softcover Book: USD 37.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Charles Clover, “Urban Population to Exceed 50 Percent,” The Telegraph, June 27, 2007, " www.telegraph.co.uk/news/earth/earthnews/3298527/Urban-population-to-exceed-50-per-cent.html .
2.
Joshua Pramis, “Number of Mobile Phones to Exceed World Population by 2014,” Digital Trends, February 28, 2013, www.digitaltrends.com/mobile/mobile-phone-world-population-2014/ .
3.
Open Big Data, Dandelion, https://dandelion.eu/datamine/open-big-data/ .
4.
http://geojson.org/ .
5.
A Parquet table is typically made up of more than one file, which may be located at multiple locations.
6.
http://spark-packages.org/package/databricks/spark-csv .
7.
https://issues.apache.org/jira/browse/SPARK-8368 .
8.
https://cran.r-project.org/web/views/HighPerformanceComputing.html .
9.
https://db.apache.org/derby/releases/release-10.10.1.1.html .

Author information

Authors and Affiliations

Lahore, Pakistan
Zubair Nabi

Authors

Zubair Nabi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nabi, Z. (2016). Real-Time ETL and Analytics Magic. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_8

Download citation

DOI: https://doi.org/10.1007/978-1-4842-1479-4_8
Published: 14 June 2016
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-1480-0
Online ISBN: 978-1-4842-1479-4
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics