Skip to main content

ETL

  • Reference work entry
  • First Online:

Synonyms

ELT; Extract-Transform-Load

Definitions

ETL is short for Extract-Transform-Load. The ETL process extracts data from operational source systems, transforms the data, and loads the data into a target. The transformations to perform on the data can involve a plethora of different activities, e.g., filtering, normalization or de-normalization to a desired form, joins, conversion, and cleansing to remove bad or dirty data. In the ELT variant, the data is extracted from the source systems, loaded in its raw form into the target, and then transformed.

Overview

The term ETL process has traditionally been used for a process that populates a data warehouse (DW) managed by a relational database management system (RDBMS). As pointed out by Simitsis and Vassiliadis (2017), the basic concept of populating a data store with data reshaped from another data store is, however, older than data warehousing. The ETL process can be hand-coded or made with a designated ETL tool where the developer...

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   849.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://sqoop.apache.org/

  2. 2.

    https://beam.apache.org/

References

  • Akidau T, Bradshaw R, Chambers C, Chernyak S, Fernández-Moctezuma RJ, Lax R, McVeety S, Mills D, Perry F, Schmidt E et al (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8(12):1792–1803

    Article  Google Scholar 

  • Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al (2015) Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1383–1394

    Google Scholar 

  • Dageville B, Cruanes T, Zukowski M, Antonov V, Avanes A, Bock J, Claybaugh J, Engovatov D, Hentschel M, Huang J, Lee AW, Motivala A, Munir AQ, Pelley S, Povinec P, Rahn G, Triantafyllis S, Unterbrunner P (2016) The snowflake elastic data warehouse. In: Proceedings of the 2016 international conference on management of data, SIGMOD’16, New York. ACM, pp 215–226. ISBN:978-1-4503-3531-7. http://doi.acm.org/10.1145/2882903.2903741

    Chapter  Google Scholar 

  • Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: 6th symposium on operating system design and implementation (OSDI 2004), San Francisco, 6–8 Dec 2004, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html

  • Gupta A, Agarwal D, Tan D, Kulesza J, Pathak R, Stefani S, Srinivasan V (2015) Amazon redshift and the case for simpler data warehouses. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, SIGMOD’15, New York. ACM, pp 1917–1923. ISBN:978-1-4503-2758-9. http://doi.acm.org/10.1145/2723372.2742795

    Chapter  Google Scholar 

  • Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 3363–3372

    Google Scholar 

  • Khayyat Z, Ilyas IF, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J-A, Tang N, Yin S (2015) Bigdansing: a system for big data cleansing. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 1215–1230

    Google Scholar 

  • Kimball R (2008) The data warehouse lifecycle toolkit. Wiley, Hoboken

    Google Scholar 

  • Liu X, Thomsen C, Pedersen TB (2013) Etlmr: a highly scalable dimensional ETL framework based on mapreduce. In: Hameurlain A, Küng J, Wagner RR (eds) Transactions on large-scale data-and knowledge-centered systems VIII. Springer, Heidelberg/New York, pp 1–31

    Google Scholar 

  • Liu X, Thomsen C, Pedersen TB (2014) Cloudetl: scalable dimensional ETL for hive. In: Proceedings of the 18th international database engineering & applications symposium. ACM, pp 195–206

    Google Scholar 

  • Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, pp 1099–1110

    Google Scholar 

  • Özcan F, Hoa D, Beyer KS, Balmin A, Liu CJ, Li Y (2011) Emerging trends in the enterprise data analytics: connecting hadoop and db2 warehouse. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, pp 1161–1164

    Google Scholar 

  • Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: Parallel analysis with Sawzall. Sci Program 13(4):277–298

    Google Scholar 

  • Simitsis A, Vassiliadis P (2017) Extraction, transformation, and loading. Springer, New York, pp 1–9. ISBN 978-1-4899-7993-3. https://doi.org/10.1007/978-1-4899-7993-3_158-3

    Book  Google Scholar 

  • Stonebraker M, Bruckner D, Ilyas IF, Beskales G, Cherniack M, Zdonik SB, Pagan A, Xu S (2013) Data curation at scale: the data tamer system. In CIDR

    Google Scholar 

  • Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow 2(2):1626–1629

    Article  Google Scholar 

  • Tigani J, Naidu S (2014) Google BigQuery analytics. Wiley, Indianapolis

    Google Scholar 

  • Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S (2012) Stoica I resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, pp 2–2

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Thomsen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Thomsen, C. (2019). ETL. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_11

Download citation

Publish with us

Policies and ethics